maxjcohen / transformer

Implementation of Transformer model (originally from Attention is All You Need) applied to Time Series.
https://timeseriestransformer.readthedocs.io/en/latest/
GNU General Public License v3.0
842 stars 165 forks source link

Hello, thanks for your great works, I'm confused with the dataset. #54

Open StarDxxx opened 2 years ago

StarDxxx commented 2 years ago

Hello sir, i'm confused with the dataset, can share the dataset_57M.npz or other demo dataset. I just don't know the dataset's structure.

maxjcohen commented 2 years ago

Hello, for the dataset used in these examples, please see #2 . The expected structure of the input data is described in the Transformer's documentation; you can implement your own dataset as long as it matches this input shape.

chuzheng88 commented 2 years ago

Hello, for the dataset used in these examples, please see #2 . The expected structure of the input data is described in the Transformer's documentation; you can implement your own dataset as long as it matches this input shape.

Hi, I have read the doc. For the inputs and outpurs of the model, I understand those as follows: d_input and d_output are input features and output features. For example, we use PM2.0, PM5 to predict pollution level, so the d_input and d_output are 2 and 1, respectively. However, I don't understand the parameter K in Input and Output tensor with shape (batch_size, K, d_output).

chuzheng88 commented 2 years ago

In other word, I want to deal with a regression task, it can be described as follows: there are two features in X, and X = [[x01, x02, .., x0j], [x11, x12, ..., x0j]] there is one features in Y (labels) and Y = [y1, y2, ... , yj]. For simple, We use two sequences predict one sequence, like sin and cos funciton predictiing tan function. In this case, how should we construct dataset?

maxjcohen commented 2 years ago

K is the length of the time series. In your example K=j, each batch of data should consist of inputs with shape (batch_size, j, 2) and outputs with shape (batch_size, j, 1).

chuzheng88 commented 2 years ago

K is the length of the time series. In your example K=j, each batch of data should consist of inputs with shape (batch_size, j, 2) and outputs with shape (batch_size, j, 1).

Thanks for you reply. In this case, the parameter attention_size can be set <= K ?

maxjcohen commented 2 years ago

Yes exactly !

chuzheng88 commented 2 years ago

Yes exactly !

Hi, I used dataset X, producted by sin function , to predict Y (producted by cons function), the K was set to 12. When validating, the loss=nan. I don't konw why? Note that whole codes described as follows: image image image

maxjcohen commented 2 years ago

Hi, I don't see directly where a NaN could come from, I encourage you to debug during the validation loss computation in order to see what tensor or function is malfunctioning.

chuzheng88 commented 2 years ago

Hi, I don't see directly where a NaN could come from, I encourage you to debug during the validation loss computation in order to see what tensor or function is malfunctioning.

In fact, when network training, it's loss = nan, e.g., image

In my opinion, when loss_function = OZELoss(alpha=0.3), the training loss shouldn't is nan. But I don't understand why ?

Further more, I used compute_loss function to calculate loss when validating, as follows: image

chuzheng88 commented 2 years ago

Is my dataset wrong? image