mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.14k stars 908 forks source link

How do you profile the CLIP models #902

Closed X-funbean closed 1 week ago

X-funbean commented 1 week ago

Hi, I want to know how do you profile the CLIP models https://github.com/mlfoundations/open_clip/blob/main/docs/model_profile.csv. Becauce I can't match the profile results with tools that I tried (e.g. torchsummaryX, thop, and torchinfo). In fact, I got very different results. Among them, I think the closest result to the FLOPs plotted in the CLIP paper Learning Transferable Visual Models From Natural Language Supervision (figure below) is achieved by torchinfo, which is 14.04GFLOPs (multi-adds). I also tried the codes provided by @jongwook (https://github.com/openai/CLIP/issues/143#issuecomment-926327141). However, it gave a result of over 161GFLOPs. According to the model profile log provided by this repo, the computation complexity of CLIP with ViT-B/16 should be 41.09 GFLOPs.

What profile tools or library do you use to acquire this profile result? Kindly help me solving this problem.

image

rwightman commented 1 week ago

This was used https://github.com/mlfoundations/open_clip/blob/main/src/training/profiler.py ... BUT, it's not plug and play, with the torch MultiheadAttention module being used and/or F.sdpa you have to hack/disable things or modify fvcore (not being maintained really) so that the correct values are being used for the attention... not all papers mean FLOPs when they say FLOPs, sometimes it's actually GMACS, the GFLOPS values here are GFLOPS though.

I'm inclined to think the numbers hear are good... rule of thumb is 22num_layers*dim^2 and that is ~40 for the B/16.