maxjcohen / transformer

Implementation of Transformer model (originally from Attention is All You Need) applied to Time Series.
https://timeseriestransformer.readthedocs.io/en/latest/
GNU General Public License v3.0
842 stars 165 forks source link

The training loss is a constant from beginning #27

Closed LightingFx closed 3 years ago

LightingFx commented 3 years ago

Hi.max.Thank you for the nice project! I have a problem when i run the model with my data,I changed the data inputs and outputs parameters according to my data set, but when i trained the model, the training loss was a constant from the beginning, and the val loss also was a constant.I have reduced learning rate,but it didn't work. The test loss was very huge and accuracy is so poor. [Epoch 1/15]: 100%|██████████| 10000/10000 [00:32<00:00, 306.73it/s, loss=80.6, val_loss=80.5] [Epoch 2/15]: 100%|██████████| 10000/10000 [00:33<00:00, 299.26it/s, loss=80.5, val_loss=80.5] [Epoch 3/15]: 100%|██████████| 10000/10000 [00:33<00:00, 299.21it/s, loss=80.5, val_loss=80.5] ......

By the way,my data is Dataframe type,and i used DataLoader to organize my data,inputs size is 7,outputs size is 1.And i used MSELoss(when i used OZELoss, the training loss was Nan)

So do you know where the mistake is?Looking forward to your reply.

maxjcohen commented 3 years ago

Hi, having a constant loss throughout the training can be a real headache, let's try a couple of things:

You should keep the MSELoss for now, the OZELoss is proper to my use case and may not fit your problem anyway.

LightingFx commented 3 years ago

Hi, having a constant loss throughout the training can be a real headache, let's try a couple of things:

  • What is the exact shape of a batch returned by your dataloader ?
  • Have you properly normalized your data ?
  • Could you try using one of the LSTM/GRU in the benchmark to see if the problem persists ?

You should keep the MSELoss for now, the OZELoss is proper to my use case and may not fit your problem anyway.

Thanks for your reply. Firstly, i used LSTM which was same with LSTM in benchmark,and the results were good.So i want to know the results with Transformer, Secondly, i have used MinMax and log normalization,and the log normalization worked better on LSTM,so i used it. Thirdly, shape returned by dataloader is [batch_size, time_step, input_size] So should I adjust the other parameters,or the transformer is just not fit with my data?

maxjcohen commented 3 years ago

One thing you could try is reducing the number of layers of the Transformer. Firstly, because smaller models are usually easier to converge, secondly so it's easier to pinpoint at which point does the gradient vanishes, or explodes (if that is indeed the reason for the model being stuck).

LightingFx commented 3 years ago
  • LSTM working properly is not a good sign at all
  • Either should be fine
  • Which is the correct shape.

One thing you could try is reducing the number of layers of the Transformer. Firstly, because smaller models are usually easier to converge, secondly so it's easier to pinpoint at which point does the gradient vanishes, or explodes (if that is indeed the reason for the model being stuck).

Thanks, I deleted part of layernorms and residual connections in sub-layer, and then the loss decreased normally.Besides,the performance in Transformer is a bit better than LSTM on my data.

maxjcohen commented 3 years ago

It's funny how these layers you removed supposedly improved convergence on all networks. I guess the transformer architecture still isn't that well understood in that regard.

If you want to get to the bottom of this convergence issue, you could now try plotting attention maps. You could also add these layers back, one at a time, to isolate the problem.