tattle-made / feluda

A configurable engine for analysing multi-lingual and multi-modal content.
https://tattle.co.in/products/feluda/
GNU General Public License v3.0
14 stars 15 forks source link

Clustering videos using vector-similarity #277

Closed Snehil-Shah closed 5 months ago

Snehil-Shah commented 6 months ago

Related to #81

Description

@dennyabrain I tried clustering around 300 videos (from this dataset) using algorithms from your experiment's repo.

Google colab notebook

I first used your approach of taking 5 frames of a video, extracting their features using the RESNET model and taking their average to generate the final embedding. And then using your approach of t-SNE reduction, plotted the thumbnails on a graph:

output (1) Observations listed in the notebook

I will be doing some R&D on some other ways to extract features from videos and using different models in our current approach as well (like CLIP which I have used before).

I will be now be working on setting up feluda and studying how feluda operators work etc. Would appreciate some directions...

dennyabrain commented 6 months ago

@Snehil-Shah This is very good progress :)

I have some general improvements/ideas about this project to share and then specific things about feluda.

General Improvements :

  1. Improving the frame sampling strategy - Our current naive impelmentation just picks 5 samples from the video but it would be nice to explore if this can be made more effective. So that we maybe pick "interesting" frames from the video. We can evaluate existing strategies from Shot Transition Detection techniques and pick something that gives us good results.
  2. Good luck on evaluating different models. I'd be curious on seeing how much of a difference that makes to our results.

Specific Feedback for Feluda

Largely the project is composed of the following components

Generally go through the wiki to learn more. I think this should be pretty useful to setup feluda locally - https://github.com/tattle-made/feluda/wiki/Setup-Feluda-Locally

A quick note about operators. So far our operators work on individual items. But for this project we might be for the first time figuring out how to make operators that work on collections. So that part would be novel and feel free to think about how you'd solve it.

aatmanvaidya commented 6 months ago

Hi @Snehil-Shah , great work!!

Even I wanted to ask one thing, I see that you have used the code Feluda's Video Vector Operator to select the 5 key frames, so thats great!

  1. I understand you have used the code from the t-SNE notebook, but just wanted to know what happens when we don't reduce 512 dimension vector representation to a 2 dimension, instead say we reduce it to 5-7 dimensions, can you investigate into this and see what are some more ways to go about it? How to improve results by such hyperparameter tuning? (I don't have great expertise, hence just asking this question out of curiosity)

    tsne_embeddings = TSNE(n_components=2, learning_rate=150, perplexity=20, angle=0.2, verbose=2).fit_transform(X)
  2. I read in the observations section that the videos are getting clustered into some groups and some anomalies are also present. wanted to ask if the videos being clustered in the label's mentioned in the dataset.

Snehil-Shah commented 6 months ago

@aatmanvaidya

  1. To my knowledge, we can't easily visualize a 5-7 dimension vector space (let alone 512). To visually see the different groups and clusters in the data, we have to plot them in a 2D graph (X-Y space) and hence we reduce it to 2 dimensions. Although we can visualize a 3D space and can make a 3D plot too, but in general, it becomes harder to plot, visualize and interpret high-dimensional vectors in a vector space. But such dimensionality reduction also results in data loss as the distances between the vectors (both absolute and cosine) is not necessarily preserved in the process. Hence, for tasks like say, recommendation systems, or finding most similar results for an input, we would use all 512 dimensions of the vectors to find the nearest neighbors and would use dimensionality reduction mostly for visualization.

  2. These are the labels present in the part of the dataset I used (as also mentioned in the notebook)

['ApplyLipstick',
 'ApplyEyeMakeup',
 'BalanceBeam',
 'Archery',
 'BenchPress',
 'Basketball',
 'BaseballPitch',
 'BabyCrawling',
 'BasketballDunk',
 'BandMarching']

Just from the visual interpretation of the graph: (as circled in the issue description)

Not all original labels are distinctly classified, but it still can classify similar videos together..

Snehil-Shah commented 6 months ago

@aatmanvaidya I went ahead and clustered them into 10 labels (using K-Means) and tried to print a visualization by shading each image label differently (updated the notebook). This is how it went:

output_labelled

The coordinates of the images is different from before as the random_state wasn't set when running t-SNE

dennyabrain commented 6 months ago

I wanted to add to the conversation you are having about dimensions that it's consistent with how we have done this in the past. For search we use multi dimensional vectors. The dimension reduction is strictly used for visualization. When we did the first iteration, we chose 2D for tsne simply coz it was easier to render on a 2D canvas on web. We can try 3D if we have the time I guess.

aatmanvaidya commented 6 months ago

okay understood the point about dimensionality reduction - if its strictly for visualization then its fine, meaning 2D is fine.

@Snehil-Shah I looked at the updated notebook code, things look good to me for now

dennyabrain commented 6 months ago

@Snehil-Shah Not sure if you have made up your mind between Uli and Feluda. Do consider submitting a proposal for clustering videos project in Feluda. I think you'll appreciate the complexity in this one and we could also use some focussed in depth exploration on this problem as part of this project.

dennyabrain commented 6 months ago

also Do consider joining our slack. As we are nearing the proposal submission deadline, it could be handy for solving any doubts. https://admin417477.typeform.com/to/nVuNyG?typeform-source=tattle.co.in

Snehil-Shah commented 6 months ago

@dennyabrain I tried doing that, but it says it requires a tattle email address or an invitation

image

Snehil-Shah commented 6 months ago

@dennyabrain I am definitely inclined towards Feluda, it feels more challenging and will be a great learning experience. On a side note, I was thinking of submitting all three of my proposals to Tattle's projects, just so there is some flexibility on your end. Is that alright?

dennyabrain commented 6 months ago

@Snehil-Shah that works for me. I think hopefully the two issues of audio and video have a lot of commonality and less work for you :)

regarding slack, please share your email. I'll send an invite.

aatmanvaidya commented 5 months ago

closing this issue because the DMP program has started.