Try out Embedding models and evaluate clustering

dennyabrain commented 5 months ago

Try out ResNet, CLIP, ViT, VideoMAE (or something you like) and use tsne (or other approaches) to evaluate clustering visually. You can do this on a jupyter notebook and show results. Use an publicly available dataset. Evaluate if any of these models can be fine tuned

aatmanvaidya commented 5 months ago

CLiP can give us vector embeddings of an image/video
one other dimensionality reduction method to look at could be UMAP, fingerprinting

aatmanvaidya commented 5 months ago

some data sources to look at

aatmanvaidya commented 5 months ago

[ ] create a mixed dataset of 150-200 datasets
[ ] Run Feluda Video Operator on a video dataset, reduce dimensions using t-SNE and do a visual plot - This will act as a baseline for us
[ ] Embeddings models - CliP, VideoMAE
[ ] Visual display it using t-SNE

Snehil-Shah commented 5 months ago

Hi

aatmanvaidya commented 5 months ago

@Snehil-Shah was wondering if this could be worth exploring - Video2Vec - the approach is very old, so mostly ResNet should also perform better, but just putting it out there

aatmanvaidya commented 5 months ago

what are hosting requirements in terms of RAM and storage for CLIP and ResNet?

Snehil-Shah commented 5 months ago

Week 1 summary (notebook)

resent as a baseline shows good results i.e. is able to cluster different categories of videos such as gym, podcasts, memes, nature etc very well. some outliers in resent.
CLIP visually performs better than resnet, i.e the clustering is more visually clear, the categories are closer to each other and a visual remark can be made about them. Using base model of CLIP. Less outliers than resnet.
Efficient Net clusters things very sparsely and clustering is poorer than ResNet.
More investigation of DEIT.

Performed K-means clustering to not be deceived by t-SNE visualization:

resent cluster videos well with some outliers
CLIP cluster videos better with no outliers on our dataset
EfficientNet-B0 is worse with many notable outliers
DeiT is also worse, from the clusters it feels like a more chromatic clustering than based on features or objects

what were the video lengths? - from 10secs to 1 min, 5to10 sec from data

add videos of a bit varied and larger lengths.

Snehil-Shah commented 5 months ago

For reference, all work around this issue will be updated in this notebook itself.

Snehil-Shah commented 5 months ago

I think we are done with the benchmarks and are in a position to finalize the embedding models to use. I went ahead and benchmarked some more video embedding models that I hadn't discussed before.

All tests, inferences, and conclusions are in the above notebook

Here are some of the winnings:

We successfully clustered into all 13 of the original categories with very similar clusters too separated distinctly, like eye makeup and lipstick, memes and commentary, balance beam and basketball dunk etc.
All with just one outlier.

The winner(s)?

X-CLIP16 is the winner. But I believe a hybrid approach can be more novel. Video embedding models expect adjacent frames to be able to capture interpolation. I propose using frames from high FPS segments to feed to the video embedding model and using more diverse/scattered/uniform frames (like keyframes) to feed to an image embedding model. This way we can capture both static and active parts of a video.

I ran an example of this with X-CLIP and CLIP in the notebook if we might be interested.

aatmanvaidya commented 5 months ago

this looks great @Snehil-Shah , I will also checkout the notebook soon!

Lets now explore sampling strategies (some of which you already suggested in the comment). Lets also start applying X-CLIP on videos of different lengths and quality (especially low quality videos).

Snehil-Shah commented 4 months ago

Model profiles

The entire pipeline is kept the same as in vid_vec_rep_resnet.py (except removing the 10MB constraint) operator, in order to isolate the comparison between just the models.

ResNet18:

Video Length	CPU Time (s)	RAM Usage
30s (1.92 MB)	3.34	106.7 MiB
1m (8.86 MB)	8.87	107.8 MiB
5m (42.8 MB)	58.37	110.3 MiB
10m (85.83 MB)	79	116.6 MiB

CLIP-ViT-base-patch32:

Video Length	CPU Time (s)	RAM Usage
30s (1.92 MB)	9.87	1.1 GiB
1m (8.86 MB)	17.53	1.1 GiB
5m (42.8 MB)	78	1.1 GiB
10m (85.83 MB)	175	1.1 GiB

The videos have a sample rate of 30 FPS and are sampled every 10 frames making the effective sampling rate around ~3 FPS.

aatmanvaidya commented 4 months ago

its interesting to see how RAM usage for CLIP doesn't change over video length. Can you test the usage for large video files - like 20min, 30min, 40min etc?

dennyabrain commented 4 months ago

Yes so lets do add a 1 - 3 hour video clip also in this benchmarking dataset. It will help us test limits but also I feel like practically there are now 1-3 hour long video podcasts, so it can be useful for that too.

Snehil-Shah commented 4 months ago

Similar to ResNet, the RAM usage might be increasing in the range of a couple of MiBs, but as it's in GiBs, it's not visible due to loss of precision. In general, in both models, RAM usage stays more or less constant

dennyabrain commented 4 months ago

@aatmanvaidya if I am not misunderstanding, the RAM usage is constant BECAUSE we implemented the chunking in the operator, right? If we were loading the entire file in memory, ram wont be constant.

Snehil-Shah commented 4 months ago

@dennyabrain I removed any file size constraints that was originally there in the operator, and other than that, no other chunking is taking place, so the profiles above are for all frames getting encoded

aatmanvaidya commented 4 months ago

@dennyabrain yes its because of the chunking code you had written in the operator, because very early on we used to get memory errors and you had done some chunking, here is your comment explaining that - https://github.com/tattle-made/DAU/issues/29#issuecomment-1927455740

@Snehil-Shah all frames are getting encoded, but they are chunked in slots of 100, this way there are no memory issues and we can extract embeddings for all frames of a video

dennyabrain commented 4 months ago

@Snehil-Shah can you link me to the operator code? I am surprised that the RAM consumption is not dependent on video length/size. Could it be possible that the model code itself is handling chunking of video into sections?

dennyabrain commented 4 months ago

Ok i just read aatman's comment. I'll step out of this conversation. As long as the two of you have a handle on whats happening, I don't need to intervene. Good luck!

aatmanvaidya commented 4 months ago

here is the operator code - https://github.com/tattle-made/feluda/blob/main/src/core/operators/vid_vec_rep_resnet.py

I just checked, we don't have the change in the code that you recommend Denny, could be that the model code itself is handling chunking of the video. Hope I am reading the code correctly?

Snehil-Shah commented 4 months ago

@aatmanvaidya @dennyabrain But there is no chunking taking place in the operator code vid_vec_rep_resnet.py, you may have only done it for DAU maybe?

Basically the above profiles are for all frames getting encoded in memory without any chunking

Snehil-Shah commented 4 months ago

I just checked, we don't have the change in the code that you recommend Denny, could be that the model code itself is handling chunking of the video. Hope I am reading the code correctly?

The model is also not doing any chunking as it's processing each frame independently instead of batches of frames.

dennyabrain commented 4 months ago

here is the operator code - https://github.com/tattle-made/feluda/blob/main/src/core/operators/vid_vec_rep_resnet.py

I just checked, we don't have the change in the code that you recommend Denny, could be that the model code itself is handling chunking of the video. Hope I am reading the code correctly?

I remember for a fact that resnet operator's memory consumption was rising as we were increasing the video length/size. So I am a bit surprised that it has now magically plateaued. I won't be able to take a look at it right now. I'll follow the conversation here. If you all don't figure out by the time I get to it, I can take a look. It doesn't feel like a blocker for DMP task. Just something I'd like to understand for future.

Snehil-Shah commented 4 months ago

The execution time is definitely rising, but yeah RAM stays consistent

tattle-made / feluda