philschmid / deep-learning-pytorch-huggingface

MIT License
580 stars 138 forks source link

Out of Memory: Cannot reproduce T5-XXL run on 8xA10G. #49

Open slai-natanijel opened 3 months ago

slai-natanijel commented 3 months ago

I am trying to reproduce the FLAN-T5-XXL (11B) results from this blog post.

I have an 8xA10G instance. Since the blog shows that you can run FLAN-T5-XXL (11B) training on a 4xA10G setup, I was surprised to see that I get a CUDA OOM error as soon as the first training epoch starts:

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 0 has a total capacity of 21.99 GiB of which 723.06 MiB is free. Including non-PyTorch memory, this process has 21.27 GiB memory in use. Of the allocated memory 17.87 GiB is allocated by PyTorch, and 2.96 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

I have even tried to run at batch=1, but that didn't help, and I have double checked that bf16 is enabled.

Additionally, I have attempted to run default T5-11B and T5-3B models using Accelerate + DeepSpeed (level 3, bf16) using the instructions from this tutorial, and I also get a CUDA OOM. The only case in which I do not get an OOM, is when I run the default T5-Large at batch=1.

I'm not sure where I am going wrong. The error message suggests that something is reserving almost all of the GPU memory before the "real" allocations start (as only 723 MiB is free.)

philschmid commented 3 months ago

What versions of the libraries do you use? What sequence length do you use?

slai-natanijel commented 3 months ago

python 3.10.9 accelerate 0.28.0 deepspeed 0.14.0 transformers 4.39.0

The output max sequence length is the default value of 128.

Is the blog post using DeepSpeed, and the Huggingface tutorial with Accelerate+DeepSpeed quite similar in principle?

slai-natanijel commented 3 months ago

Update: Running google/flan-t5-xl (3B parameters) with Accelerate+DeepSpeed (level 3, bf16) seems to work for batch=1, but uses about 88% of memory per GPU - still seems far too high.

Another update: It seems that DeepSpeed level 2 and level 3 take up the same GPU memory on my setup. Perhaps level 3 is not fully activating?