tattle-made / feluda

A configurable engine for analysing multi-lingual and multi-modal content.
https://tattle.co.in/products/feluda/
GNU General Public License v3.0
10 stars 14 forks source link

Try out Embedding models and evaluate clustering #355

Open dennyabrain opened 1 month ago

dennyabrain commented 1 month ago

Try out ResNet, CLIP, ViT, VideoMAE (or something you like) and use tsne (or other approaches) to evaluate clustering visually. You can do this on a jupyter notebook and show results. Use an publicly available dataset. Evaluate if any of these models can be fine tuned

aatmanvaidya commented 1 month ago
aatmanvaidya commented 1 month ago

some data sources to look at

aatmanvaidya commented 1 month ago
Snehil-Shah commented 1 month ago

Hi

aatmanvaidya commented 1 month ago

@Snehil-Shah was wondering if this could be worth exploring - Video2Vec - the approach is very old, so mostly ResNet should also perform better, but just putting it out there

aatmanvaidya commented 1 month ago

what are hosting requirements in terms of RAM and storage for CLIP and ResNet?

Snehil-Shah commented 1 month ago

Week 1 summary (notebook)

Performed K-means clustering to not be deceived by t-SNE visualization:

what were the video lengths? - from 10secs to 1 min, 5to10 sec from data

Snehil-Shah commented 4 weeks ago

For reference, all work around this issue will be updated in this notebook itself.

Snehil-Shah commented 3 weeks ago

I think we are done with the benchmarks and are in a position to finalize the embedding models to use. I went ahead and benchmarked some more video embedding models that I hadn't discussed before.

All tests, inferences, and conclusions are in the above notebook

Here are some of the winnings:

The winner(s)?

X-CLIP16 is the winner. But I believe a hybrid approach can be more novel. Video embedding models expect adjacent frames to be able to capture interpolation. I propose using frames from high FPS segments to feed to the video embedding model and using more diverse/scattered/uniform frames (like keyframes) to feed to an image embedding model. This way we can capture both static and active parts of a video.

I ran an example of this with X-CLIP and CLIP in the notebook if we might be interested.

aatmanvaidya commented 3 weeks ago

this looks great @Snehil-Shah , I will also checkout the notebook soon!

Lets now explore sampling strategies (some of which you already suggested in the comment). Lets also start applying X-CLIP on videos of different lengths and quality (especially low quality videos).

Snehil-Shah commented 6 days ago

Model profiles

The entire pipeline is kept the same as in vid_vec_rep_resnet.py (except removing the 10MB constraint) operator, in order to isolate the comparison between just the models.

ResNet18:

Video Length CPU Time (s) RAM Usage
30s (1.92 MB) 3.34 106.7 MiB
1m (8.86 MB) 8.87 107.8 MiB
5m (42.8 MB) 58.37 110.3 MiB
10m (85.83 MB) 79 116.6 MiB

CLIP-ViT-base-patch32:

Video Length CPU Time (s) RAM Usage
30s (1.92 MB) 9.87 1.1 GiB
1m (8.86 MB) 17.53 1.1 GiB
5m (42.8 MB) 78 1.1 GiB
10m (85.83 MB) 175 1.1 GiB

The videos have a sample rate of 30 FPS and are sampled every 10 frames making the effective sampling rate around ~3 FPS.

aatmanvaidya commented 5 days ago

its interesting to see how RAM usage for CLIP doesn't change over video length. Can you test the usage for large video files - like 20min, 30min, 40min etc?

dennyabrain commented 5 days ago

Yes so lets do add a 1 - 3 hour video clip also in this benchmarking dataset. It will help us test limits but also I feel like practically there are now 1-3 hour long video podcasts, so it can be useful for that too.

Snehil-Shah commented 5 days ago

Similar to ResNet, the RAM usage might be increasing in the range of a couple of MiBs, but as it's in GiBs, it's not visible due to loss of precision. In general, in both models, RAM usage stays more or less constant

dennyabrain commented 5 days ago

@aatmanvaidya if I am not misunderstanding, the RAM usage is constant BECAUSE we implemented the chunking in the operator, right? If we were loading the entire file in memory, ram wont be constant.

Snehil-Shah commented 5 days ago

@dennyabrain I removed any file size constraints that was originally there in the operator, and other than that, no other chunking is taking place, so the profiles above are for all frames getting encoded

aatmanvaidya commented 5 days ago

@dennyabrain yes its because of the chunking code you had written in the operator, because very early on we used to get memory errors and you had done some chunking, here is your comment explaining that - https://github.com/tattle-made/DAU/issues/29#issuecomment-1927455740

@Snehil-Shah all frames are getting encoded, but they are chunked in slots of 100, this way there are no memory issues and we can extract embeddings for all frames of a video

dennyabrain commented 5 days ago

@Snehil-Shah can you link me to the operator code? I am surprised that the RAM consumption is not dependent on video length/size. Could it be possible that the model code itself is handling chunking of video into sections?

dennyabrain commented 5 days ago

Ok i just read aatman's comment. I'll step out of this conversation. As long as the two of you have a handle on whats happening, I don't need to intervene. Good luck!

aatmanvaidya commented 5 days ago

here is the operator code - https://github.com/tattle-made/feluda/blob/main/src/core/operators/vid_vec_rep_resnet.py

I just checked, we don't have the change in the code that you recommend Denny, could be that the model code itself is handling chunking of the video. Hope I am reading the code correctly?

Snehil-Shah commented 5 days ago

@aatmanvaidya @dennyabrain But there is no chunking taking place in the operator code vid_vec_rep_resnet.py, you may have only done it for DAU maybe?

Basically the above profiles are for all frames getting encoded in memory without any chunking

Snehil-Shah commented 5 days ago

I just checked, we don't have the change in the code that you recommend Denny, could be that the model code itself is handling chunking of the video. Hope I am reading the code correctly?

The model is also not doing any chunking as it's processing each frame independently instead of batches of frames.

dennyabrain commented 5 days ago

here is the operator code - https://github.com/tattle-made/feluda/blob/main/src/core/operators/vid_vec_rep_resnet.py

I just checked, we don't have the change in the code that you recommend Denny, could be that the model code itself is handling chunking of the video. Hope I am reading the code correctly?

I remember for a fact that resnet operator's memory consumption was rising as we were increasing the video length/size. So I am a bit surprised that it has now magically plateaued. I won't be able to take a look at it right now. I'll follow the conversation here. If you all don't figure out by the time I get to it, I can take a look. It doesn't feel like a blocker for DMP task. Just something I'd like to understand for future.

Snehil-Shah commented 5 days ago

The execution time is definitely rising, but yeah RAM stays consistent