Open petergaoshan opened 1 month ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Correct, there is some more overhead when using DDP. You can use something like DeepSpeed or FSDP instead to shard the weights across GPUs
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Expected behavior
I'm training a model with accelerate and deepspeed Zero 3. As I known, accelerate does data parallelism and copy the model on to different GPUs and does training on those GPUs. Therefore using extra VRAM is expected. However, I think zero3 would divide one model onto multi GPUs and I'm using Zero3. Therefore it should be using less VRAM. When I ran it, it seems like it's still copying full model onto those GPUs and uses double amount of VRAM.
when I run without accelerate it shows only 15G of VRAM used.
when I run with accelerate it shows about 45G of VRAM used.
Is there anything I set wrong? I notice that there needs to be one model per process but it won't let me to set 1 process and 2 GPUs.