Closed airogachev closed 6 months ago
@airogachev not sure why you think this is an issue or bug vs the model being too large for the batch size? siglip is improved scaling over clip but it isn't magic at the larger model sizes you only get to the 32k total batch sizes by using lots of gpus for big models, it just needs fewer GPUs than equivalent global batch size for clip and results appear a bit better at a lower global batch size.
I think the 224/256 resolution B/16 models should be able to do 1024 on ~24-32GB of memory based on the paper claims wrt to TPU-v4. I don't believe they published details of what their total TPU count or per-device batch size was for the larger models.
Make sure gradient checkpointing is enabled, use amp with bfloat16, etc.
Original paper clamed to use big batches. Using current implementation I face the problem that if I increase the batch size even to 1024, it fails on the second iteration. I use 4 cards with 44.5 GB of video memory. So, it seems that memory may be filled with something that was not cleaned after the first batch? Any ideas?