timeseriesAI / tsai

Time series Timeseries Deep Learning Machine Learning Python Pytorch fastai | State-of-the-art Deep Learning library for Time Series and Sequences in Pytorch / fastai
https://timeseriesai.github.io/tsai/
Apache License 2.0
4.91k stars 622 forks source link

Misalignment in LSTM Application and Subsequent Operations in _TSSequencerEncoderLayer #875

Open prashantkhatri23 opened 5 months ago

prashantkhatri23 commented 5 months ago

Issue Description

I've noticed a potential issue in the implementation of the _TSSequencerEncoderLayer class, where the LSTM layer appears to be applied along the channel axis (feature size) instead of the temporal axis (sequence length). This is evident from the initialization of the LSTM layer:

  1. LSTM Layer Initialization: Currently, the LSTM layer is initialized as follows:

    self.bilstm = nn.LSTM(q_len, q_len, num_layers=1, bidirectional=True, bias=lstm_bias)

    This should be revised to:

    self.bilstm = nn.LSTM(d_model, d_model, num_layers=1, bidirectional=True, bias=lstm_bias)
  2. Fully Connected Layer Adjustment: The self.fc layer needs to be updated to accommodate the change in LSTM layer dimensions:

    self.fc = nn.Linear(2 * d_model, d_model)
  3. Modifications in Forward Pass: The forward method needs modifications to correctly process the data through the LSTM layer:

    • For the pre-normalization case:
      x = self.drop_path(self.dropout(self.fc(self.bilstm(self.lstm_norm(x))[0]))) + x
    • For the non pre-normalization case:
      x = self.lstm_norm(self.drop_path(self.dropout_t(self.fc(self.bilstm(x)[0]))) + x)

Additional Context:

These issues were identified during a detailed code review while integrating the model into my project. Specifically, I applied the model to two different tasks on RAVDESS AV emotion dataset:

  1. Emotion prediction using a facial embedding sequence extracted from a video.
  2. Emotion prediction using audio feature sequences.

In an attempt to address these concerns, I tested the model's performance with the proposed changes. Interestingly, the results were quite surprising. The performance remained similar whether the LSTM was applied across the channel axis (as per the current implementation) or across the time steps (as per the proposed modification). This observation raises questions about the expected impact of these changes and suggests a need for further investigation into the model's behavior in different application contexts.