Closed Goschjann closed 5 years ago
Please refer to the comments on Issue https://github.com/titu1994/LSTM-FCN/issues/4.
The dimension shuffle layer is utilized to reduce the univariate time series problem to a multivariate time series problem with one time step. Doing so reduces the model capacity of the LSTM block, therefore it alone is not a strong classifier. It, however, works in conjunction with the FCN block, which is the basic feature extractor.
Our motivation for doing so was multifold - regular LSTM severely overfits the simple classification problems of the UCR dataset and gets much lower accuracy than the SOTA, LSTM with dim shuffle alone severely underfits the task due to reduced capacity, FCN alone gets good performance but not as much as the concatenation of both the FCN and LSTM branch, and the necessity to process sequential information in a fast way without losing all semantics of sequential nature of data.
The Dim shuffle LSTM achieves fast training due to single timestep (for univariate problems, it becomes M time steps where M is the number of variables for a multivariate input time series), and augments the performance of the CNN.
As your work requires a multivariate input (2 variables), I suggest referring to our follow up work - https://arxiv.org/abs/1801.04503 which discusses the extension of this model to multivariate time series classification. The model architecture and training scripts are available at: https://github.com/titu1994/MLSTM-FCN
As to why we chose not to use a bidirectional LSTM, or a stack of LSTMs is because they simply overfit on the simple datasets of UCR, and that the additional capacity leads to the reduced overall performance of the LSTM-FCN model.
@titu1994 thanks a lot for your comprehensive answer, highly appreciated!
I started to question your approach, because in my case the dimension shuffle had a negative effect compared to a non-shuffling version! But sure, this can be due to my problem being more complicated than the UCR-problems. In this context (model is already too complex) for sure bilstm/stacking does not make sense from your perspective (though I found it to be performance-improving).
Again, thanks for your time and work!
I have a follow up question: What's the difference between (1) Dimension shuffle + LSTM with 1 time step (2) Simply feed the whole time series to fully-connected layers with tanh activation where the input size is the same as the the time steps of the input series.
We had similar questions from others as well, which is why we performed an extensive ablation study which can be found here - Insights into LSTM Fully Convolutional Networks for Time Series Classification.
In it, we replaced the dimension shuffled LSTM with dimension shuffled GRU, basic RNN and a fully connected layer with sigmoid activation function (which is similar to your (2), but with sigmoid instead of tanh).
We find that LSTM with dimension shuffle beats the rest in a large majority of cases. In addition, we find that the simple fully connected layer with sigmoid activation performs closer to the LSTM than all the other RNNs.
We used sigmoid, as 3 / 4 of LSTM activations are based on sigmoid, and think that the complex gating of LSTM is what boosts performance compared to a singular fully connected layer with sigmoid / tanh activation.
@titu1994 I am actually working with the MALSTM-FCN Architecture for a classification task and I am impressed by its performance so far. What I want to ask you is, do you think a sliding window would increase the performance? In time-series, it is often necessary to find temporal small patterns, wouldn't a sliding window helping finding them? Have you tesed it?
Thank you in Advance for a quick response!
Hi,
thanks a lot for your proposed architecture and work!
I fully get your idea and want to refer to and include your architecture in my thesis.
The only thing I do not get: why do you use this dimension shuffle layer before feeding data in the lstm? Doesn't that cause the lstm to fully neglect the timely information from the time series?
Another thing that came to my mind: did you try/ think about using bilstm or using >1 lstm layers? Is there a special reason why you haven't tried those?
I am working with ts from 2 variables with a length of 1200 time ticks each (sensor data).
Glad for any response, folks!