Closed LuisBlanche closed 8 months ago
You can always try to use a smaller batch size and grad accumulation (see https://github.com/mlfoundations/open_clip#gradient-accumulation). That said, I'm a bit surprised that you're you're OOMing with batch size 8 on 32GB ram. Is this the only thing running on that GPU?
Hi, thanks for the answer. Well yes it's the only thing appart from the Databricks notebook. I'm trying with a bigger GPU now and got the same after a while the python kernel dies.
I have tried outside of Databricks on my local computer and do not get the same error, I think this might be something else than OOM that interacts with the notebook kernel and cuases it to die.
SIGABRT is more likely to be caused by running out of system (CPU) memory than GPU, most likely issue is related to dataset/dataloading issues, too much buffering or shuffling, hanging on to references by mistake, etc.
I've had OpenCLIP running on a 3x RTX 3090 w/ 64GB system memory. That is 3 train processes w/ 6 data loader processes per train process, webdataset shards, so in total fewer GB per train process. So it can work fine in that range.
Hi I've been trying to finetune Open Clip using the following parameters :
All the rest is default. When running it on a
g5.2xlarge
(32GB RAM) machine on a Databricks Notebook, I get :The Python process exited with exit code 134 (SIGABRT: Aborted).
which I believe might be linked to an OOM error.Any advice on a smaller model or different parameters I could use to make it fit on this machine ?