Open gorold opened 1 year ago
oh thank you @gorold for your kind words!
You are right i removed the projection layers and embedding layers since for the categorical cov. gluonts already had the feature embedding, the datetime features serve the purpose of positional encodings, and thus, I just contacted everything and passed it to the transformers. Transformers can quickly overfit so I didn't want more layers basically...
However you are right, this causes issues with number of heads not being divisible with the input size, we lose one hyper-param.... So I was debating adding a projection layer just yesterday haha!
What would you suggest? should I add something like:
class ValueEmbedding(nn.Module):
def __init__(self, c_in, d_model):
super(ValueEmbedding, self).__init__()
self.tokenConv = nn.Conv1d(
in_channels=c_in,
out_channels=d_model,
kernel_size=3,
padding=1,
padding_mode="circular",
bias=False,
)
for m in self.modules():
if isinstance(m, nn.Conv1d):
nn.init.kaiming_normal_(
m.weight, mode="fan_in", nonlinearity="leaky_relu"
)
def forward(self, x):
x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)
return x
should I also then add to this projection standard positional encodings?
What should the default d_model
be then? I somehow feel 512
etc. are too large?
Happy to hear your thoughts and then I can add it!
one more thing i forgot to mention @gorold is that with the above embedding and padding_mode of "circular" at inference time in the decoder we only have a single vector and the conv1d
fails... It works with the other models since they have the labels window at inference time which is given to the decoder... gluonts has no labels window and just two consecutive windows as you know.
So perhaps a better idea might be to use:
nn.Conv1d(
in_channels=feature_size,
out_channels=d_model,
kernel_size=3,
padding="same",
padding_mode="replicate",
bias=False,
)
A few questions then:
thanks for any opinions!
Thanks for the quick reply!
lagged_sequence_values
. They almost do the same thing, since the conv layer essentially just combines the effect of some lagged values. Of course, lagged_sequence_values
considers a much longer history. While they were a key contributor for RNNs, I'm not too sure whether it makes sense to have it in models such as Auto/Fed/ETS-former which try to extract seasonal patterns.nn.Linear(in_size, d_model)
.DataEmbedding_wo_pos
module, which if we look at the forward
call, doesn't invoke positional_embedding
. Informer uses it though. Again, my opinion would be to follow what the papers proposed.d_model
would be to follow what the papers originally proposed, which would be 512, but I guess the standard Transformer could be smaller like 32.Hope this helps!
I just noticed that Informer, FEDformer, NS-Transformer have been implemented with greedy decoding, whereas Autoformer and ETSformer use direct multi-step forecasting.
Just a note from the Informer codebase that actually the future targets which these models take as input is just some zeros: https://github.com/zhouhaoyi/Informer2020/blob/ac59c7447135473fb2aafeafe94395f884d5c7a5/exp/exp_informer.py#L266-L278
Ideally this logic should be encapsulated within the model, which does not take x_dec
...
Hi @kashif, thanks for your great work in implementing all these Transformer models! I noticed that for many models, especially the long sequence time series forecasting models (Auto/ETS/NS-Transformer), you have decided to remove the
enc/dec_embedding
layers for the dynamic, real inputs and directly setd_model = self.input_size * len(self.lags_seq) + self._number_of_features
as inputs to the Transformer layer (please correct me if I got this wrong). This makes the hyperparameterd_model
not tunable, but tied to the inputs. Could I ask what prompted this decision?