How to perform mask as Figure.1?

xiqxin1 commented 1 year ago

Hello, George

Thank you very much for your code. Could you tell me how to perform the mask as in Figure 1? I found many functions in dataset.py, but it is difficult to me to implement them without any guidance. Could you provide some more details?

Thanks a lot.

gzerveas commented 1 year ago

Hi, this mask is already implemented and used when you specify --task imputation, which is the default (i.e. you don't actually need to specify it). This is in fact the default and desired behavior when you are doing pre-training; some sequences per variable will be masked and the model will be asked to predict the masked values. Internally, this is implemented by using the ImputationDataset in src/datasets/dataset.py. This employs the noise_mask function in the same file to generate a boolean numpy array with the same shape as the input feature array X, with 0s at places where a feature should be masked.

xiqxin1 commented 1 year ago

Hi George,

Thanks a lot for your reply, I modified the parameters as you recommended.

I have one more question look for your advice. My dataset is acceleration data (float) with length*dim=1800*6. When performing pre-training, I only calculate the masked positions with MSE loss as recommended in your paper.

During fine-tuning, I replaced the last linear layer with the classification layer, unfortunately, the result was poor. I'm not sure if the implementation is wrong.

I was wondering (1) if the data length is too long, I should do downsampling, or (2) if a simple MSE loss is not enough, or (3) if the dataset is not enough (300 training samples), or (4) if I should modify the output layers or add adapter components?

Thanks again for your time.

gzerveas commented 1 year ago

Hi @xiqxin1, there can be several factors that affect performance. I am not sure how was your implementation, but here are some tips:

With respect to preprocessing, if the signal appears very similar when downsampled (for example, you think that it looks oversampled, i.e. there is some redundancy or high frequency noise in the data) then downsampling is certainly a good first thing to try. Low-pass filtering could also help with noisy data.

When pretraining, it is useful to set aside some data as a validation set to monitor the loss, in order to make sure that you avoid under-/overfitting, that is in order to know for how long you should pretrain. Generally speaking, it helps if you first try to get some reasonable performance when training from scratch (i.e. without pretraining), in order to find out what range of hyperparameters works (e.g. batch size, number of layers, d_model), etc. You can then move on to pretraining and fine-tuning, starting from similar hyperparameters and experimenting with a few changes.

You will see that the above factors can have a large effect, so I recommend first experimenting with those. If performance still looks poor (assuming that everything is fine with the implementation), and you want to look into modifying the architecture, the module described in the paper (but not implemented in this code) has worked well in many cases: that is, add a 1D convolutional layer to extract features from the input multivariate series. This works directly for classification - for pretraining, you would need a transposed convolutional layer at the end, to make value-level predictions.

xiqxin1 commented 1 year ago

Hi, this mask is already implemented and used when you specify --task imputation, which is the default (i.e. you don't actually need to specify it). This is in fact the default and desired behavior when you are doing pre-training; some sequences per variable will be masked and the model will be asked to predict the masked values. Internally, this is implemented by using the ImputationDataset in src/datasets/dataset.py. This employs the noise_mask function in the same file to generate a boolean numpy array with the same shape as the input feature array X, with 0s at places where a feature should be masked.

Hi George,

Sorry for bothering you again. I notice that the noise_mask seems does not appear in the TSTransformerEncoder class (i.e., the pre-training model). I can only find the padding_masks here. Could you tell me how the noise mask (attn_mask?) works in your code?

This is my commands: --output_dir experiments --comment "regression from Scratch" --name BeijingPM25Quality_fromScratch_Regression --records_file Regression_records.xls --data_dir BeijingPM25Quality/ --data_class tsra --pattern TRAIN --val_pattern TEST --epochs 100 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task imputation

xiqxin1 commented 1 year ago

re that you avoid under-/overfitting, that is in order to know for how long you should pretrain.

Thank you, George, your suggestions help me a lot! I will try! I think a pretrained model should be better than a randomly generated model, in most cases.

gzerveas commented 1 year ago

The noise_mask function is in src/datasets/dataset.py , and it is not used by the model TSTransformerEncoder itself. It is an operation used by the ImputationDataset to zero-out data samples on the fly, while the DataLoader (see line 216 in src/main.py) iterates over the dataset. This DataLoader yields batches, (see here), which contain these masks as a target_masks (because it is necessary to track where the input samples have been masked). The attention masks used by the model are the padding_masks, which mark what part of the input is padding (instead of actual data) and thus has to be ignored.

xiqxin1 commented 1 year ago

Hi, George

Thank you very much for the detailed reply! Now I can understand that the MSE loss calculates the labels with noise_mask.

However, I have a few things unclear. As described in your paper, the input data are masked through noise_mask, if I implement it on your code, if it is a new src_key_padding_mask=noise_mask*padding_masks added in the transformer_encoder?

In ts_transformer.py, line 240: self.transformer_encoder(inp, src_key_padding_mask=~(noise_mask*padding_masks))

Or could you tell me which part fits the noise_mask? I remember that the Bert model will mask 15% of input samples in the encoder layer, it is similar to your implementation.

Thank you very much.

gzerveas commented 1 year ago

Hi,

There are two different masks, the noise mask, which is related to corrupting the data, and the attention mask used for ignoring the padding. They are generated, kept and treated completely separately in the code.

The noise mask is a special pattern (on average affecting 15% of the data, but in a different way than BERT) that is applied to zero-out the data during the batch collation phase (i.e. collating individual samples into batches) by the DataLoader, specifically by applying the collate_unsuperv function in dataset.py , line 193. It is called target_masks because it's the same mask that is used on the target sequence vectors, to make sure that we are only considering the corrupted values. Therefore, when you feed the input to your model, this noise pattern has already been applied to your input data tensor X (in line 225), and you don't need to do anything else.

The padding masks are generated as padding_masks by the collation function, and are fed to the src_key_padding_mask argument of the transformer encoder to prevent the self-attention layers from attending to the padding. Again, this is already implemented, so you don't have to do anything else to make this work.

xiqxin1 commented 1 year ago

Thank you very much! I didn't set breakpoints at collate_unsuperv, so I was wondering where you masked the input (or targets).

I will try it on my data. Thanks a lot for your very nice answers!

gzerveas / mvts_transformer

How to perform mask as Figure.1? #22