Closed Snehil-Shah closed 5 months ago
@Snehil-Shah This is very good progress :)
I have some general improvements/ideas about this project to share and then specific things about feluda.
Largely the project is composed of the following components
Generally go through the wiki to learn more. I think this should be pretty useful to setup feluda locally - https://github.com/tattle-made/feluda/wiki/Setup-Feluda-Locally
A quick note about operators. So far our operators work on individual items. But for this project we might be for the first time figuring out how to make operators that work on collections. So that part would be novel and feel free to think about how you'd solve it.
Hi @Snehil-Shah , great work!!
Even I wanted to ask one thing,
I see that you have used the code Feluda's Video Vector Operator
to select the 5 key frames, so thats great!
I understand you have used the code from the t-SNE notebook, but just wanted to know what happens when we don't reduce 512 dimension vector representation to a 2 dimension, instead say we reduce it to 5-7 dimensions, can you investigate into this and see what are some more ways to go about it? How to improve results by such hyperparameter tuning? (I don't have great expertise, hence just asking this question out of curiosity)
tsne_embeddings = TSNE(n_components=2, learning_rate=150, perplexity=20, angle=0.2, verbose=2).fit_transform(X)
I read in the observations section that the videos are getting clustered into some groups and some anomalies are also present. wanted to ask if the videos being clustered in the label's mentioned in the dataset.
@aatmanvaidya
To my knowledge, we can't easily visualize a 5-7 dimension vector space (let alone 512). To visually see the different groups and clusters in the data, we have to plot them in a 2D graph (X-Y space) and hence we reduce it to 2 dimensions. Although we can visualize a 3D space and can make a 3D plot too, but in general, it becomes harder to plot, visualize and interpret high-dimensional vectors in a vector space. But such dimensionality reduction also results in data loss as the distances between the vectors (both absolute and cosine) is not necessarily preserved in the process. Hence, for tasks like say, recommendation systems, or finding most similar results for an input, we would use all 512 dimensions of the vectors to find the nearest neighbors and would use dimensionality reduction mostly for visualization.
These are the labels present in the part of the dataset I used (as also mentioned in the notebook)
['ApplyLipstick',
'ApplyEyeMakeup',
'BalanceBeam',
'Archery',
'BenchPress',
'Basketball',
'BaseballPitch',
'BabyCrawling',
'BasketballDunk',
'BandMarching']
Just from the visual interpretation of the graph: (as circled in the issue description)
ApplyLipstick
and ApplyEyeMakeup
are pretty much clustered together. (they are pretty similar categories)BabyCrawling
videos are clustered together. BalanceBeam
and some videos from Basketball
/BasketballDunk
that are indoors.BaseballPitch
and Archery
are clustered together.Benchpress
are clustered together.BandMarching
are clustered together.Not all original labels are distinctly classified, but it still can classify similar videos together..
@aatmanvaidya I went ahead and clustered them into 10 labels (using K-Means) and tried to print a visualization by shading each image label differently (updated the notebook). This is how it went:
The coordinates of the images is different from before as the random_state
wasn't set when running t-SNE
I wanted to add to the conversation you are having about dimensions that it's consistent with how we have done this in the past. For search we use multi dimensional vectors. The dimension reduction is strictly used for visualization. When we did the first iteration, we chose 2D for tsne simply coz it was easier to render on a 2D canvas on web. We can try 3D if we have the time I guess.
okay understood the point about dimensionality reduction - if its strictly for visualization then its fine, meaning 2D is fine.
@Snehil-Shah I looked at the updated notebook code, things look good to me for now
@Snehil-Shah Not sure if you have made up your mind between Uli and Feluda. Do consider submitting a proposal for clustering videos project in Feluda. I think you'll appreciate the complexity in this one and we could also use some focussed in depth exploration on this problem as part of this project.
also Do consider joining our slack. As we are nearing the proposal submission deadline, it could be handy for solving any doubts. https://admin417477.typeform.com/to/nVuNyG?typeform-source=tattle.co.in
@dennyabrain I tried doing that, but it says it requires a tattle email address or an invitation
@dennyabrain I am definitely inclined towards Feluda, it feels more challenging and will be a great learning experience. On a side note, I was thinking of submitting all three of my proposals to Tattle's projects, just so there is some flexibility on your end. Is that alright?
@Snehil-Shah that works for me. I think hopefully the two issues of audio and video have a lot of commonality and less work for you :)
regarding slack, please share your email. I'll send an invite.
closing this issue because the DMP program has started.
Related to #81
Description
@dennyabrain I tried clustering around 300 videos (from this dataset) using algorithms from your experiment's repo.
Google colab notebook
I first used your approach of taking 5 frames of a video, extracting their features using the RESNET model and taking their average to generate the final embedding. And then using your approach of t-SNE reduction, plotted the thumbnails on a graph:
Observations listed in the notebook
I will be doing some R&D on some other ways to extract features from videos and using different models in our current approach as well (like CLIP which I have used before).
I will be now be working on setting up feluda and studying how feluda operators work etc. Would appreciate some directions...