gzerveas / mvts_transformer

Multivariate Time Series Transformer, public version
MIT License
727 stars 171 forks source link

Need your suggestions, Thanks #10

Closed chuzheng88 closed 2 years ago

chuzheng88 commented 2 years ago

Hi, I have read the paper (A Transformer-based Framework for Multivariate Time Series Representation Learning), it is a very meaningful work. I want to use this project to finish a prediction work. My taks is a regression problem, and my dataset can be described as follows: X = [ [[time series sequence 1]], [[time series sequence 2]],


[[time series sequence s]], ] and y = [ [[label 1]], [[label 2]],


[[label s]], ] where sequences are not the same length. So I want to just use your model definition (in ts_transformer.py](https://github.com/gzerveas/mvts_transformer/blob/master/src/models/ts_transformer.py)) and paddding all sequences to the same length in my dataset before input the TST model.

Is there any thing I need to pay attention to in this job ? Or do you have other suggestions ?

Thansk.

gzerveas commented 2 years ago

Hi, if I understand correctly, your concern is whether you need to do the padding yourself. The answer is no, the code will handle that for you. The function collate_unsuperv inside dataset.py will do this for you, and will also create the padding masks. You only need to provide a maximum sequence length parameter ( max_seq_len ) that makes sense for your data, for example, it could be set to the length of your longest sequence.

What you should probably do first is look at the statistics of your sequence lengths. What is the mean and variance of lengths? The computational cost of the transformer is O(N^2) with respect to the input sequence length, so if you have only a couple of very long sequences, but most of them are quite short, then it may be worth setting max_seq_len to a smaller value that will accommodate most (but not all) examples - the longer ones will be trimmed down to max_seq_len.

chuzheng88 commented 2 years ago

Great, thank you for your reply. First, I will do the statistics of sequence length, and then decide the max_seq_len. In my memory, the min/max sequence length are 50+, and 256, respectively, so I might be concer that such a large gap will affect the prediction performance if I use the max sequence length or cut of it. The sequence to be cut may contains very volunable information.

gzerveas commented 2 years ago

In your case it's absolutely fine to use max_seq_len = 256. Depending on your hardware, problems start when the length is in the order of N > 4000. In those cases, if longer sequences cannot be trimmed, then one should consider downsampling, or extracting features with a 1D convolutional layer first.

However, also in your case, if the variance in lengths is really large , then one thing you can try is to aggregate all final representations per time step using mean-pooling (i.e. Z = mean(z1, ..., zN)), instead of concatenating all representations, as currently done (i.e. Z = [z1, ..., zN]). This may lead to better performance. Try both and see what works best - the latter should also use fewer parameters.

chuzheng88 commented 2 years ago

Think you for your reply, I have look the statistics of sequence lengths in my dataset, and it described as follows: image image In my opinion, I will use the max sequence length to pad shorter sequence and then input Transformer model because the max sequence length is not very long.

In addition, I'm thinking about whether padding operation has an effect on prediction error because the padding value may bring harmful information for prediction tasks.

I'm very interested in the input and outpu of your pre-trained model with unlabeled data. I understand that input and output of Autocoder are the same features and try to min error between input and output in order to learnt latent features in middle layers in net. However, the pre-trained model proposed in you paper how to learn latent features with unlabeled data.

gzerveas commented 2 years ago

The input values for time steps corresponding to padding (and all respective representations in deeper layers) are completely ignored throughout the computation of all intermediate representations. Thus, during unsupervised pre-training, they play no role whatsoever. For (global, example-level) regression and classification, they are additionally set to 0 at the final layer, before concatenating representations and making a prediction. As I wrote in my previous comment, in case of a large variance in lengths, this might affect performance (you would have to check), in which case the remedy would be to aggregate all final representations per time step using mean-pooling (i.e. Z = mean(z1, ..., zN)), instead of concatenating all representations, as currently done (i.e. Z = [z1, ..., zN]).

Regarding the unsupervised objective, you can consider this paper's masking scheme as a very specific way of doing autoencoding, with a special noise distribution designed to extract meaningful representations from most common types of multivariate time series data.

chuzheng88 commented 2 years ago

Thank you for your reply, and sorry for the late reply. Regarding the unsupervised objective, I understand the masking scheme in you papre as a very specific way of doing autoencoding. However, I still don't understand the input and output of NN when the NN is trained. For example, if I want to train a model ("pre-trained model") for a spectific task (regression or classification), how should I construct the NN's input and output, respectively ?

gzerveas commented 2 years ago

You have to first define a dataset class in datasets/data.py and load your data in a pandas dataframe. The datasets/dataset.py will take care of the rest. I have added more explanations in the README file.