gzerveas / mvts_transformer

Multivariate Time Series Transformer, public version
MIT License
718 stars 169 forks source link

Extraction of time series representations after pre-training without fine-tuning on downstream task #31

Open meesamnaqvi opened 1 year ago

meesamnaqvi commented 1 year ago

Hi,

First of all, thank you for sharing your work its a very nice implementation.

I need to extract time series representations Z_t as mentioned in the unsupervised pre-training section of the paper. Can you please advise me what is the best way to extract these representations after pre-training on a custom dataset?

gzerveas commented 1 year ago

Hello,

Thank you :) I have described in some detail some good ways of doing it here. The description may look long, but I think only a few lines of code here and there need to be added. I can help if you have further questions. If you end up implementing it, please consider submitting a pull request :)

meesamnaqvi commented 1 year ago

Thanks for the pointers to do it, you described it well on the shared link, I should have gone through all the issues before creating a new one. Sure I will give it a shot and create a pull request. I'm sorry for not looking into all the issues before creating one.

I have one question before I start working on the code: After doing the described changes, will I be able to extract embeddings after pre-training the model? Like in the test model! I want to use the model for the time series similarity comparison after generating embeddings for available samples and some future samples (one by one). So, if I want to generate embedding for just one sample (i have 3 or 4 samples each day). Will it be possible, or will it only generate embeddings during pretraining?

Thank you in advance for your time and effort.

gzerveas commented 1 year ago

No worries, it's hard to discover content in closed issues :) So, if I understand what you want to do correctly, I think addressed it in my older post, but let me do it here again, more explicitly:

Yes, you could do this embedding extraction as a separate operation, i.e., not at the end of pre-training, but by loading an already stored model checkpoint through --load_model <ckpt_filename>. You could get the embeddings of whatever set of files you define as a "test set" through the flag --test_pattern (e.g. use the file(s) containing your new and old data), simply by running the main.py script with the option --test_only testset. This would skip training, and instead only iterate over the "test set", doing evaluation of the loaded model and extracting the embeddings. The part of the main.py where this happens starts at line 184. If you don't want to get evaluation metrics (or don't have labels for your dataset), and only want the embeddings (which I think is what you want), then you could add a command line argument --extract_embeddings_only in options.py, which instead of running lines 194-200 in main.py, it would run the embedding extraction code. So at line line 194 you could do:

if config['extract_embeddings_only']:
  embeddings_extractor = UnsupervisedRunner(model, loader, device, loss_module=None)
  with torch.no_grad():
      embeddings = embeddings_extractor.extract_embeddings()
  torch.save(embeddings, path)
  return
else:
<previous code code 194-200>

To implement this, I would replicate UnsupervisedRunner.evaluate(self, epoch_num=None, keep_all=True) into a new UnsupervisedRunner.extract_embeddings(self) member function , which would be identical but would not compute loss, metrics etc, instead only doing the feature extraction in the way I describe in my previous post.

meesamnaqvi commented 1 year ago

Thank you so much for your detailed response. Today, I spent some time debugging to understand the code better and successfully extracted the embeddings based on your previous reply. You were right; it is pretty straightforward.

I am trying to optimize the code to make it quicker by doing the following thing:

The objective is to make embedding extraction as quick as possible. Initially, I might replicate the main.py functionality in another file to see what can be skipped then we integrate the functionality into the main.

Once I am done, I will create a pull request.

gzerveas commented 1 year ago

Sounds great, thanks!

meesamnaqvi commented 1 year ago

Hi, I am done with the code modifications after testing on existing and custom data set.

Before making the pull request, I need your opinion about a feature I am thinking about adding. Currently, for a given input, the model outputs embedding of size "max_seq_len * d_model ".

I want to add a pooling feature with different modes (None (default embeddings), max, mean, mean_sqrt_len, last) so that we have an option of output as a single vector. We can then pass the mode to --extract_embeddings_only argument and get default embeddings or single vector based on user choice.

I want to confirm that in the case of single vector modes, the operation should be applied on axis 0, right? Giving us the final vector of size d_model (as this is the size of the embedding vector)!

gzerveas commented 1 year ago

Hi, thank you so much for working on this! It is a valuable contribution! I will review the pull request and merge your code as soon as I find some time.

Regarding your question: The feature you suggest makes much sense; the aggregation operation is something many users may wish to do early on, before storing all embeddings (i.e. for every single step), e.g. to save space. The mode/option last might as well be first (as the model is bi-directional, at least when using the implemented training schemes). The reason first may be slightly more preferable is because a popular way of aggregating the sequence is including a special embedding at the beginning as a prefix (corresponding to the [CLS] token in NLP), and using the output representation of this prefix/embedding as the only embedding from which to predict the output, e.g. for classification. This mode of prediction has not been implemented yet, however, so this suggestion is purely for "future compatibility". The axis along which the aggregation operation should be applied is the one corresponding to seq_length . This is indeed axis 0 if you are taking the embeddings from line 240, or axis 1 if you are taking them from line 242. I think that the latter (i.e. embeddings after the activation function) would be preferable, as these are the embeddings from which the final prediction originates.

meesamnaqvi commented 1 year ago

Great, thanks; I will wait for your response.

Regarding extraction of different types of embeddings:

I will add the feature to local and create a pull request once you review the current pull request. Thanks for the recommendation of taking embeddings line 242. I think you are right its make more sense to take embedding after this line. Right now, I am taking embedding before this line, but I will update it in the next update.