Support for multivariate datasets

agrija9 commented 4 years ago

Hi tejaslodaya,

I want to run the VRAE for a single class, unlabelled and multivariate time series data-set.

Does your implementation support also multivariate data-sets?

I saw in one of your commits in utils.py a comment "add support for multivariate" but I'm not being able to see this reflected in the code.

Any insight is highly appreciated!

tejaslodaya commented 4 years ago

Hi @agrija9

Does your implementation support also multivariate data-sets? Yes. It does

Let's break your problem statement down-

Single-class/Multi-class - VRAE only converts sparse "time-series" to dense vectors. To generate classification on top of it needs the vectors to be passed to some algorithm (k-means in my example)
Unlabelled - You need some labelled data to justify the generated embeddings. This is similar to the case where you have word vectors, but you can justify those embeddings semantically by seeing if "king - queen", etc.. There has to be a way to know if VRAE is working properly (train set has to have labelled data for this)
Multi-variate - There's a parameter called number_of_features. In the case of multi-variate timeseries, pass the number of "variates" here. In case of univariate, please pass 1.

Refer to this commit - https://github.com/tejaslodaya/timeseries-clustering-vae/commit/e7b57a6748ef18efbd9f026907e85c31817e2b42

Here, I create 1st dimension of LSTM as num_features (in Encoder).

Let me know if this makes sense.

agrija9 commented 4 years ago

Hi @tejaslodaya ,

Thanks for your reply,

One more remark about the unlabelled data. As you say, labelled data can help us determine how the VRAE is learning these dense vectors by running a k-means on top othem and compare our true labels with what k-means is clustering.

If for now I just want to get the dense vectors of my data using the VRAE, say compress them to 20 dimensions and then project them to 3 or 2 dimensions (either PCA or t-SNE). As far as I understand, I don't need labels to do this, right?

Best

tejaslodaya commented 4 years ago

Hi @agrija9 ,

You're correct. You don't need labels if you have a way of visualizing the clusters. If you look closely in the plots on this project's README.md, you will find a clear way of making a distinction between two clusters.

That again totally depends on the data and hyperparameters you've used to train the model with.

Let me know if you want to know anything else.

agrija9 commented 4 years ago

Hi @tejaslodaya ,

Thanks for your feedback. Another couple of doubts:

In your case, you show the compression of your time-series from 140 to 20 elements. In my case, the length of my time-series is 9601. Do you have any idea on what is an appropriate compression ratio for my case? Or do you know if there is any paper that analyzes such compression ratios?
Autoencoders and more generally neural networks are used to reduce data dimensionality in many domains. However, there are other more straight-forward methods like kernel-PCA which can perform non-linear dimensionality reduction through the use of kernels. I'm thinking about testing this on my data and check performance. What's your thought on this?

All the best!

tejaslodaya commented 4 years ago

Hi @agrija9 ,

To answer your 1st question, have a look at another issue to which I commented about possible steps if you have a larger timeseries. Link: https://github.com/tejaslodaya/timeseries-clustering-vae/issues/2#issuecomment-548517460

If you still want to go ahead with the raw clustering and feeding in 9k dimensions, I would prefer a much stronger neural network with "gradually" descending layer in encoders and their relative pairs in the decoders. Note: This network will have a lot of parameters to train and would need larger machine and more time to train this.

For example, you can go with 9k - 2048 - 512 - 128 - 32 (encoder) and 32 - 128 - 512 - 2048 (decoder). This way you'll have 32 dimensions in the end. I haven't tried such a large network. Please let me know how your embeddings shape up if you give this a try.

I don't know much about kernel-PCA. I had tried DTW for my clustering usecase, but it performed very poor results compared to VRAE.

tejaslodaya commented 4 years ago

Thanks for the issue. Included it in FAQs section to increase visibility

agrija9 commented 4 years ago

Hi @tejaslodaya,

I will give a try to the ideas you mentioned above. Thanks a lot for your feedback. If anything I'll reach out to you with more questions (:

tejaslodaya commented 4 years ago

Sure. Not a problem

tejaslodaya / timeseries-clustering-vae

Support for multivariate datasets #6