Closed lucasgnz closed 4 years ago
Hi, thanks for your interest.
While the input dimension is only 60, the representation is pretty noisy and we want to use a RNN-based network to extract a better representation to describe the whole sequence.
Since we are using the final hidden state of bidirectional GRU encoder to do the knn classification in the end, the dimension of that hidden state is 2048 (this is a hyper-parameter by experiments). However, this dimension is pretty large and algorithm like knn may have 'curse of dimensionality' issue. It is also not common to use a 2048 vector for classification in any case.
In some previous 'unsupervised representation learning' works using RNN-based model, they typically add a fully connected layer as a classifier on the final hidden state to reduce the dimension to the number of classes and fine tune the model in supervised setting to show the representations have been learned. However, it won't be completely unsupervised since one layer FC could actually change a lot and classification results may only rely on that single FC (We actually tried that).
In this work, we would like to avoid any supervision on classification and use knn to test the learned representation. You can consider this auto-encoder works as a dimension reduction technique that tries to compress the final 2048 vector representation in a relatively lower dimensional compact vector for evaluation. Training this auto-encoder is very simple and converges very fast and it is helpful to provide a better accuracy performance.
Thank you very much for your quick reply, this helps a lot.
So until the final 254 vector representation, there is not any supervision. But knn is a supervised algorithm and I can't rely on it to evaluate the representations because I don't have labels in my data.
Do you think K-means algorithm could give a good unsupervised clustering ?
I am trying to train your model to do unsupervised action clustering from untrimmed skeleton sequences, which means I don't have different action sequences but only one long sequence of body keypoints.
So I am feeding your model with short sequences that I sample randomly from the raw sequence. My goal is then to be able to cluster these short sequences into different actions/movements, without supervision.
Do you think this has a chance of succeeding ? Or should I work first on action segmentation ?
Again thanks a lot for your time
oh I see. It looks like you are actually working on problems of unsupervised temporal segmentation & action recognition? I am not sure whether this work would help since the assumption is there is a complete sequence that could use single hidden vector to represent. However, I did try something similar to your current problem before. Maybe you can check out this paper and see if it helps: Clustering and Recognition of Spatiotemporal Features through Interpretable Embedding of Sequence to Sequence Recurrent Neural Networks
Yes, I think this is exactly what I am looking for. Thank you so much !
Hello,
First of all I would like to thank you for releasing your implementation of your paper.
I have one question about the hidden size of the main auto-encoder
It seems to me that the UCLA data is made of sequences of shape (50,60), so the auto-encoder inputs have a dimension equal to 3000.
But the hidden layer of the auto-encoder is 2048 (very close to 3000), so my question is how does the autoencoder learns a useful representation in this case ? It is weird to me that the hidden layer size is not much lower...
Is it that the fixed weights or fixed states training strategies manage to generate better representations of the input but with the same dimensionality as the input?
I understand you then train another 6-layers auto-encoder on the output feature of the first AE to reduce the dimension to 256, and then use these final representations for clustering.
I would be very grateful if you could help me answers these questions.
Have a great day!