tejaslodaya / timeseries-clustering-vae

Variational Recurrent Autoencoder for timeseries clustering in pytorch
GNU General Public License v3.0
462 stars 82 forks source link

Support for multivariate datasets #6

Closed agrija9 closed 4 years ago

agrija9 commented 4 years ago

Hi tejaslodaya,

I want to run the VRAE for a single class, unlabelled and multivariate time series data-set.

Does your implementation support also multivariate data-sets?

I saw in one of your commits in utils.py a comment "add support for multivariate" but I'm not being able to see this reflected in the code.

Any insight is highly appreciated!

tejaslodaya commented 4 years ago

Hi @agrija9

Does your implementation support also multivariate data-sets? Yes. It does

Let's break your problem statement down-

  1. Single-class/Multi-class - VRAE only converts sparse "time-series" to dense vectors. To generate classification on top of it needs the vectors to be passed to some algorithm (k-means in my example)
  2. Unlabelled - You need some labelled data to justify the generated embeddings. This is similar to the case where you have word vectors, but you can justify those embeddings semantically by seeing if "king - queen", etc.. There has to be a way to know if VRAE is working properly (train set has to have labelled data for this)
  3. Multi-variate - There's a parameter called number_of_features. In the case of multi-variate timeseries, pass the number of "variates" here. In case of univariate, please pass 1.

Refer to this commit - https://github.com/tejaslodaya/timeseries-clustering-vae/commit/e7b57a6748ef18efbd9f026907e85c31817e2b42

Here, I create 1st dimension of LSTM as num_features (in Encoder).

Let me know if this makes sense.

agrija9 commented 4 years ago

Hi @tejaslodaya ,

Thanks for your reply,

One more remark about the unlabelled data. As you say, labelled data can help us determine how the VRAE is learning these dense vectors by running a k-means on top othem and compare our true labels with what k-means is clustering.

If for now I just want to get the dense vectors of my data using the VRAE, say compress them to 20 dimensions and then project them to 3 or 2 dimensions (either PCA or t-SNE). As far as I understand, I don't need labels to do this, right?

Best

tejaslodaya commented 4 years ago

Hi @agrija9 ,

You're correct. You don't need labels if you have a way of visualizing the clusters. If you look closely in the plots on this project's README.md, you will find a clear way of making a distinction between two clusters.

That again totally depends on the data and hyperparameters you've used to train the model with.

Let me know if you want to know anything else.

agrija9 commented 4 years ago

Hi @tejaslodaya ,

Thanks for your feedback. Another couple of doubts:

All the best!

tejaslodaya commented 4 years ago

Hi @agrija9 ,

To answer your 1st question, have a look at another issue to which I commented about possible steps if you have a larger timeseries. Link: https://github.com/tejaslodaya/timeseries-clustering-vae/issues/2#issuecomment-548517460

If you still want to go ahead with the raw clustering and feeding in 9k dimensions, I would prefer a much stronger neural network with "gradually" descending layer in encoders and their relative pairs in the decoders. Note: This network will have a lot of parameters to train and would need larger machine and more time to train this.

For example, you can go with 9k - 2048 - 512 - 128 - 32 (encoder) and 32 - 128 - 512 - 2048 (decoder). This way you'll have 32 dimensions in the end. I haven't tried such a large network. Please let me know how your embeddings shape up if you give this a try.

I don't know much about kernel-PCA. I had tried DTW for my clustering usecase, but it performed very poor results compared to VRAE.

tejaslodaya commented 4 years ago

Thanks for the issue. Included it in FAQs section to increase visibility

agrija9 commented 4 years ago

Hi @tejaslodaya,

I will give a try to the ideas you mentioned above. Thanks a lot for your feedback. If anything I'll reach out to you with more questions (:

tejaslodaya commented 4 years ago

Sure. Not a problem