The original SigLIP paper said they can fit 2x batch size on TPU with base SigLIP model, compared with CLIP.
But in my experiment, I both used 14400 batch size on 48 A100-40GB, while the SigLIP and CLIP models are both base-sized standard structure. Then during the training, SigLIP takes 33.5G while CLIP takes 37.0G on each GPU. They are close and I couldn't scale up 2x batch size as the paper said.
I am not using any FSDP/deepspeed techniques, is it the reason? Or does the GPU type matter a lot? I have no idea.
Can anyone who ever trained a SigLIP model share your experience?
Hello,
The original SigLIP paper said they can fit 2x batch size on TPU with base SigLIP model, compared with CLIP.
But in my experiment, I both used 14400 batch size on 48 A100-40GB, while the SigLIP and CLIP models are both base-sized standard structure. Then during the training, SigLIP takes 33.5G while CLIP takes 37.0G on each GPU. They are close and I couldn't scale up 2x batch size as the paper said.
I am not using any FSDP/deepspeed techniques, is it the reason? Or does the GPU type matter a lot? I have no idea.
Can anyone who ever trained a SigLIP model share your experience?
Thanks!