Question regarding embedding layer

gorold commented 1 year ago

Hi @kashif, thanks for your great work in implementing all these Transformer models! I noticed that for many models, especially the long sequence time series forecasting models (Auto/ETS/NS-Transformer), you have decided to remove the enc/dec_embedding layers for the dynamic, real inputs and directly set d_model = self.input_size * len(self.lags_seq) + self._number_of_features as inputs to the Transformer layer (please correct me if I got this wrong). This makes the hyperparameter d_model not tunable, but tied to the inputs. Could I ask what prompted this decision?

kashif commented 1 year ago

oh thank you @gorold for your kind words!

You are right i removed the projection layers and embedding layers since for the categorical cov. gluonts already had the feature embedding, the datetime features serve the purpose of positional encodings, and thus, I just contacted everything and passed it to the transformers. Transformers can quickly overfit so I didn't want more layers basically...

However you are right, this causes issues with number of heads not being divisible with the input size, we lose one hyper-param.... So I was debating adding a projection layer just yesterday haha!

What would you suggest? should I add something like:

class ValueEmbedding(nn.Module):
    def __init__(self, c_in, d_model):
        super(ValueEmbedding, self).__init__()
        self.tokenConv = nn.Conv1d(
            in_channels=c_in,
            out_channels=d_model,
            kernel_size=3,
            padding=1,
            padding_mode="circular",
            bias=False,
        )
        for m in self.modules():
            if isinstance(m, nn.Conv1d):
                nn.init.kaiming_normal_(
                    m.weight, mode="fan_in", nonlinearity="leaky_relu"
                )

    def forward(self, x):
        x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)
        return x

should I also then add to this projection standard positional encodings?

What should the default d_model be then? I somehow feel 512 etc. are too large?

Happy to hear your thoughts and then I can add it!

kashif commented 1 year ago

one more thing i forgot to mention @gorold is that with the above embedding and padding_mode of "circular" at inference time in the decoder we only have a single vector and the conv1d fails... It works with the other models since they have the labels window at inference time which is given to the decoder... gluonts has no labels window and just two consecutive windows as you know.

So perhaps a better idea might be to use:

nn.Conv1d(
            in_channels=feature_size,
            out_channels=d_model,
            kernel_size=3,
            padding="same",
            padding_mode="replicate",
            bias=False,
        )

A few questions then:

should the encoder and decoder have their own value embedding layers?
should we also then add a positional encoding layer? either fixed sin/cos or learned?

thanks for any opinions!

gorold commented 1 year ago

Thanks for the quick reply!

Regarding the conv layer, there is some interplay with the use of lagged_sequence_values. They almost do the same thing, since the conv layer essentially just combines the effect of some lagged values. Of course, lagged_sequence_values considers a much longer history. While they were a key contributor for RNNs, I'm not too sure whether it makes sense to have it in models such as Auto/Fed/ETS-former which try to extract seasonal patterns.
My suggestion (if the aim of this repo is to evaluate the models as is) would be to follow their original implementations, while for the standard Transformer, you could just do nn.Linear(in_size, d_model).
These models actually don't use positional encoding: if we look at the code starting from Autoformer, they use the DataEmbedding_wo_pos module, which if we look at the forward call, doesn't invoke positional_embedding. Informer uses it though. Again, my opinion would be to follow what the papers proposed.
My thinking for the default d_model would be to follow what the papers originally proposed, which would be 512, but I guess the standard Transformer could be smaller like 32.
Could I get some clarification regarding the reason why Conv1d with "circular" setting fails? These models are multi-horizon/direct multi-step models, whereby the decoder takes future_time_feat as input only (which are available from gluonts), and not the future_target. What are these label windows you're referring to?
Yup, encoder and decoder should have separate value embedding layers.

Hope this helps!

gorold commented 1 year ago

I just noticed that Informer, FEDformer, NS-Transformer have been implemented with greedy decoding, whereas Autoformer and ETSformer use direct multi-step forecasting.

Just a note from the Informer codebase that actually the future targets which these models take as input is just some zeros: https://github.com/zhouhaoyi/Informer2020/blob/ac59c7447135473fb2aafeafe94395f884d5c7a5/exp/exp_informer.py#L266-L278

Ideally this logic should be encapsulated within the model, which does not take x_dec...

kashif / pytorch-transformer-ts

Question regarding embedding layer #9