OOM with batch size 1 when with ViT-bigG on 40GB GPU

mitchellnw commented 1 year ago

Similarly to https://github.com/mlfoundations/open_clip/issues/261, getting OOM with batch size 1 on 40GB GPU with ViT-G.

OrangeSodahub commented 1 year ago

Weird. I once tested ViT-g-14 on RTX3090 (10G) and it could work, could refer to this, Maybe you could try multiple machines.

mitchellnw commented 1 year ago

sorry I mean bigG not g

OrangeSodahub commented 1 year ago

Sorry for misunderstand

rwightman commented 1 year ago

I think we've got two 'easy' options right now, DeepSpeed Zero (PR for this #264 might be worth testing) or PyTorch native FSDP. Talking w/ someone close to TPUs & PyTorch XLA recently, and they were stronly recommending giving FSDP a try for large scale runs (there's both an XLA specific varaint and normal PyTorch one).

Going full tensor parallelism is more work and I feel things are about to change w/ upcoming native PyTorch features (compilation w/ annotations for parallelism) such that needing to do it Megatron style will be a thing of the past.

mitchellnw commented 1 year ago

seems like progress is being made with FSDP and also we think the OOM was because of model size + activations

mlfoundations / open_clip

OOM with batch size 1 when with ViT-bigG on 40GB GPU #296