mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.93k stars 959 forks source link

OOM Error trying to finetune ViT-B-32 on Nvidia A10 #739

Closed LuisBlanche closed 8 months ago

LuisBlanche commented 10 months ago

Hi I've been trying to finetune Open Clip using the following parameters :

    workers = 4
    batch_size = 8
    epochs = 10
    lr = 5e-4
    lr_scheduler = "cosine"
    beta1 = 0.9
    beta2 = 0.98
    eps = 1e-6
    warmup = 10000
    wd = 0.1
    use_bn_sync = False
    skip_scheduler = False
    lr_cooldown_end = 0
    lr_cooldown_power = 1
    save_frequency = 1
    save_most_recent = False
    zeroshot_frequency = 2
    val_frequency = 1
    model = "ViT-B-32"
    pretrained = "laion2b_s34b_b79k"
    precision = "amp"

All the rest is default. When running it on a g5.2xlarge (32GB RAM) machine on a Databricks Notebook, I get : The Python process exited with exit code 134 (SIGABRT: Aborted). which I believe might be linked to an OOM error.

Any advice on a smaller model or different parameters I could use to make it fit on this machine ?

gabrielilharco commented 10 months ago

You can always try to use a smaller batch size and grad accumulation (see https://github.com/mlfoundations/open_clip#gradient-accumulation). That said, I'm a bit surprised that you're you're OOMing with batch size 8 on 32GB ram. Is this the only thing running on that GPU?

luisblanche-mirakl commented 10 months ago

Hi, thanks for the answer. Well yes it's the only thing appart from the Databricks notebook. I'm trying with a bigger GPU now and got the same after a while the python kernel dies.

LuisBlanche commented 10 months ago

I have tried outside of Databricks on my local computer and do not get the same error, I think this might be something else than OOM that interacts with the notebook kernel and cuases it to die.

rwightman commented 10 months ago

SIGABRT is more likely to be caused by running out of system (CPU) memory than GPU, most likely issue is related to dataset/dataloading issues, too much buffering or shuffling, hanging on to references by mistake, etc.

I've had OpenCLIP running on a 3x RTX 3090 w/ 64GB system memory. That is 3 train processes w/ 6 data loader processes per train process, webdataset shards, so in total fewer GB per train process. So it can work fine in that range.