Closed dennyabrain closed 3 months ago
UMAP
, fingerprintingHi
@Snehil-Shah was wondering if this could be worth exploring - Video2Vec
- the approach is very old, so mostly ResNet should also perform better, but just putting it out there
what are hosting requirements in terms of RAM and storage for CLIP and ResNet?
Performed K-means clustering to not be deceived by t-SNE visualization:
what were the video lengths? - from 10secs to 1 min, 5to10 sec from data
For reference, all work around this issue will be updated in this notebook itself.
I think we are done with the benchmarks and are in a position to finalize the embedding models to use. I went ahead and benchmarked some more video embedding models that I hadn't discussed before.
All tests, inferences, and conclusions are in the above notebook
Here are some of the winnings:
The winner(s)?
X-CLIP16 is the winner. But I believe a hybrid approach can be more novel. Video embedding models expect adjacent frames to be able to capture interpolation. I propose using frames from high FPS segments to feed to the video embedding model and using more diverse/scattered/uniform frames (like keyframes) to feed to an image embedding model. This way we can capture both static and active parts of a video.
I ran an example of this with X-CLIP and CLIP in the notebook if we might be interested.
this looks great @Snehil-Shah , I will also checkout the notebook soon!
Lets now explore sampling strategies (some of which you already suggested in the comment). Lets also start applying X-CLIP on videos of different lengths and quality (especially low quality videos).
The entire pipeline is kept the same as in vid_vec_rep_resnet.py
(except removing the 10MB constraint) operator, in order to isolate the comparison between just the models.
Video Length | CPU Time (s) | RAM Usage |
---|---|---|
30s (1.92 MB) | 3.34 | 106.7 MiB |
1m (8.86 MB) | 8.87 | 107.8 MiB |
5m (42.8 MB) | 58.37 | 110.3 MiB |
10m (85.83 MB) | 79 | 116.6 MiB |
Video Length | CPU Time (s) | RAM Usage |
---|---|---|
30s (1.92 MB) | 9.87 | 1.1 GiB |
1m (8.86 MB) | 17.53 | 1.1 GiB |
5m (42.8 MB) | 78 | 1.1 GiB |
10m (85.83 MB) | 175 | 1.1 GiB |
The videos have a sample rate of 30 FPS and are sampled every 10 frames making the effective sampling rate around ~3 FPS.
its interesting to see how RAM usage for CLIP doesn't change over video length. Can you test the usage for large video files - like 20min, 30min, 40min etc?
Yes so lets do add a 1 - 3 hour video clip also in this benchmarking dataset. It will help us test limits but also I feel like practically there are now 1-3 hour long video podcasts, so it can be useful for that too.
Similar to ResNet, the RAM usage might be increasing in the range of a couple of MiBs, but as it's in GiBs, it's not visible due to loss of precision. In general, in both models, RAM usage stays more or less constant
@aatmanvaidya if I am not misunderstanding, the RAM usage is constant BECAUSE we implemented the chunking in the operator, right? If we were loading the entire file in memory, ram wont be constant.
@dennyabrain I removed any file size constraints that was originally there in the operator, and other than that, no other chunking is taking place, so the profiles above are for all frames getting encoded
@dennyabrain yes its because of the chunking code you had written in the operator, because very early on we used to get memory errors and you had done some chunking, here is your comment explaining that - https://github.com/tattle-made/DAU/issues/29#issuecomment-1927455740
@Snehil-Shah all frames are getting encoded, but they are chunked in slots of 100, this way there are no memory issues and we can extract embeddings for all frames of a video
@Snehil-Shah can you link me to the operator code? I am surprised that the RAM consumption is not dependent on video length/size. Could it be possible that the model code itself is handling chunking of video into sections?
Ok i just read aatman's comment. I'll step out of this conversation. As long as the two of you have a handle on whats happening, I don't need to intervene. Good luck!
here is the operator code - https://github.com/tattle-made/feluda/blob/main/src/core/operators/vid_vec_rep_resnet.py
I just checked, we don't have the change in the code that you recommend Denny, could be that the model code itself is handling chunking of the video. Hope I am reading the code correctly?
@aatmanvaidya @dennyabrain But there is no chunking taking place in the operator code vid_vec_rep_resnet.py
, you may have only done it for DAU maybe?
Basically the above profiles are for all frames getting encoded in memory without any chunking
I just checked, we don't have the change in the code that you recommend Denny, could be that the model code itself is handling chunking of the video. Hope I am reading the code correctly?
The model is also not doing any chunking as it's processing each frame independently instead of batches of frames.
here is the operator code - https://github.com/tattle-made/feluda/blob/main/src/core/operators/vid_vec_rep_resnet.py
I just checked, we don't have the change in the code that you recommend Denny, could be that the model code itself is handling chunking of the video. Hope I am reading the code correctly?
I remember for a fact that resnet operator's memory consumption was rising as we were increasing the video length/size. So I am a bit surprised that it has now magically plateaued. I won't be able to take a look at it right now. I'll follow the conversation here. If you all don't figure out by the time I get to it, I can take a look. It doesn't feel like a blocker for DMP task. Just something I'd like to understand for future.
The execution time is definitely rising, but yeah RAM stays consistent
Try out ResNet, CLIP, ViT, VideoMAE (or something you like) and use tsne (or other approaches) to evaluate clustering visually. You can do this on a jupyter notebook and show results. Use an publicly available dataset. Evaluate if any of these models can be fine tuned