Open slai-natanijel opened 3 months ago
What versions of the libraries do you use? What sequence length do you use?
python 3.10.9 accelerate 0.28.0 deepspeed 0.14.0 transformers 4.39.0
The output max sequence length is the default value of 128.
Is the blog post using DeepSpeed, and the Huggingface tutorial with Accelerate+DeepSpeed quite similar in principle?
Update: Running google/flan-t5-xl (3B parameters) with Accelerate+DeepSpeed (level 3, bf16) seems to work for batch=1, but uses about 88% of memory per GPU - still seems far too high.
Another update: It seems that DeepSpeed level 2 and level 3 take up the same GPU memory on my setup. Perhaps level 3 is not fully activating?
I am trying to reproduce the FLAN-T5-XXL (11B) results from this blog post.
I have an 8xA10G instance. Since the blog shows that you can run FLAN-T5-XXL (11B) training on a 4xA10G setup, I was surprised to see that I get a CUDA OOM error as soon as the first training epoch starts:
I have even tried to run at batch=1, but that didn't help, and I have double checked that bf16 is enabled.
Additionally, I have attempted to run default T5-11B and T5-3B models using Accelerate + DeepSpeed (level 3, bf16) using the instructions from this tutorial, and I also get a CUDA OOM. The only case in which I do not get an OOM, is when I run the default T5-Large at batch=1.
I'm not sure where I am going wrong. The error message suggests that something is reserving almost all of the GPU memory before the "real" allocations start (as only 723 MiB is free.)