Closed molokanov50 closed 1 year ago
cc @muellerzr @pacman100
Hi @molokanov50, thanks for reporting. I found out that the problem is specific to this model (loading with device_map
consume more vram as expected). Other models such as t5-small
have comparable VRAM consumption in multi-GPU and single-GPU fine-tuning scenarios. I'll try to fix that. If you find the issue, feel free to do a PR !
Hello @molokanov50, if the model fits on a single GPU, I would advise you to use DDP without the device_map
for faster training as it will use both the GPUs all the time instead of naive pipelining of device_map
Hello @pacman100, DDP unfortunately doesn't fit me because my overall motivation is to finetune an NLLB-200 model as large as NLLB-200-3.3B
. I know from my experiments (see above) that a single-GPU finetuning of NLLB-200-1.3B
requires 35...40 GB VRAM. This enables me to make an estimation that to finetune NLLB-200-3.3B
(3x amount of parameters) I will need a single 105...120 GB GPU. We have no such GPUs at the moment, so NLLB-200-3.3B
cannot fit any of available ones.
That is definitely the case when the model doesn't fit on a single GPU.
The 2-GPU parallelization of a smaller model such as NLLB-200-1.3B
over smaller GPUs (such that the model cannot fit any single one) is necessary and informative; by this, we model the aforementioned case. Without this experiment, assembling a multi-GPU node with total 120 GB VRAM for NLLB-200-3.3B
makes no sense. We need to make sure that pipeline-parallelized NLLB-200 training can eventually consume the same (summary) VRAM amount as in the single-GPU case (maybe, after some fixes).
Hi @SunMarc, As for now, has it become possible to fix the problem?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Who can help?
@SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I run multi-GPU and, for comparison, single-GPU finetuning of
NLLB-200-distilled-600M
andNLLB-200-1.3B
. In multi-GPU finetuning, I'm always on 2x 24 GB GPUs (48 GB VRAM in total). I successfully finetunedNLLB-200-distilled-600M
on a single 12 GB GPU, as well asNLLB-200-1.3B
on a 40 GB GPU. Thus, my VRAM resources in my multi-GPU configuration is obviously greater than in any single-GPU scenario. To my surprise,NLLB-200-distilled-600M
finetuning on 2 GPUs occupied 30 GB VRAM that is 3 times greater than the memory required for a single-GPU finetuning. Also, forNLLB-200-1.3B
finetuning on 2 GPUs I got CUDA OOM, i.e., 48 GB VRAM is insufficient to perform this finetuning. On the other hand, a 40 GB GPU is sufficient for a single-GPU finetuning. Seems too strange, since in model parallelism, only some part of a model resides on each GPU, and the used memory on each GPU should be less than in a single-GPU scenario.My multi-GPU finetuning code:
Text of the shell file used to run my code:
python3 finetune.py --source-lang eng_Latn --target-lang rus_Cyrl --delimiter ';'
data.csvExpected behavior
Comparable (approximately equal) summary VRAM consumption in multi-GPU and single-GPU finetuning scenarios.