maxjcohen / transformer

Implementation of Transformer model (originally from Attention is All You Need) applied to Time Series.
https://timeseriestransformer.readthedocs.io/en/latest/
GNU General Public License v3.0
842 stars 165 forks source link

Understand the dataset dimension #28

Closed clsx524 closed 3 years ago

clsx524 commented 3 years ago

I am using the npz_check function to generate the npz file. Before it dumps the data to npz I printed out the dimension of R, Z and X. They are R: (7500, 19) X: (7500, 8, 672) Z: (7500, 18, 672). There are 7500 rows and 672 entry for one time series, as described by the challenge. 19, 8 and 18 are the number of labels for R, Z and X defined in labels.JSON. But I am wondering why R is not defined with 672 entries and is there any particular reason to define it like this?

Npz_check function and these variables are calculated in this file https://github.com/maxjcohen/ozechallenge_benchmark/blob/master/src/utils.py#L218

maxjcohen commented 3 years ago

Hi,

Any issue regarding the data challenge should be posted in the data challenge repo that you just linked, as to keep things tidy.

Now regarding your question, R contains characteristics of the building that do not evolve over time, thus there is no reason to add the 672-length time dimension. In the pre processing of the benchmark, and of this repo, I tile R in order to easily concatenate all input tensors, this is just a implementationnal trick.

clsx524 commented 3 years ago

Ah, I missed the tile function in this repo. Now I get the right dimension for R.

In the train notebook of this repo, my x and y read from the dataset have dimension of (7500, 18, 691] and [7500, 8, 672] respectively. I can understand it that there are 7500 rows, 18 labels for Z, 672 + 19 (latent labels for R) = 691 in one time series and 8 labels to predict.

When it comes to calculate loss at loss = loss_function(y.to(device), netout). netout is supposed to have the same dimension of y, which is (7500, 8, 672). So my question is where in the network does the transformation from 18 to 8?

Btw, great work and appreciate you can open source it.

maxjcohen commented 3 years ago

I think you misunderstood the role of R here: the 19 variables are not to be added to the time dimension. If you choose to tile R, you would obtain a tensor of shape (7500, 19, 672), which you could then concatenate with Z to obtain a (7500, 19+18, 672) tensor.

As to understand how the Transformer converts a tensor of dimension 18 to 8, you should take a look at the original paper, or one of the detailed analysis : The Annotated Transformer and The Illustrated Transformer.

clsx524 commented 3 years ago

Thanks. I figured out the issue.

diegoquintanav commented 3 years ago

@maxjcohen did you see any improvement during training by providing the time-independent sequences in R? As I see it, the distribution of attention weights should be uniform in these sequences. Can you share more about the intuition behind this? Thanks!

maxjcohen commented 3 years ago

Hi, I added variables contained in R at each time step to simplify the implementation. Otherwise, the input vector could have had a different dimension at different time steps.

I haven't yet looked at the weights, but I agree they should be uniform, although you could argue that some cyclic patterns could appear. For instance, the "window area" variable holds most of its value during sunny hours, which could appear in the Transformer's weights.