Closed Richard-LZ-Zhang closed 1 year ago
similar issue here.
@Richard-LZ-Zhang, @lavaaa7 thanks for reporting this issue. It looks like the OOM is caused by activation memory, and could be related to https://github.com/microsoft/DeepSpeed/issues/2797.
Also, @Richard-LZ-Zhang, a minor issue is that your log appears to be from 2 nodes of 4 GPUs (i.e, 8 instead of 16 GPUs).
Great! Can confirm the issue is solved after I make the following corrections. I actually noticed that issue but thought it says "gradient checkpointing" while I reported OOM on follow pass... Thanks to the community!
following corrections What are they? @Richard-LZ-Zhang
@Richard-LZ-Zhang
Describe the bug I try to use deepspeed ZERO-3 with huggingface Trainer to finetune a galactica 30b model (gpt-2 like), with 4 nodes, each 4 A100 gpu. I get oom error though the model should fit into 16 A100 with Zero 3 and cpu offload. Previously I have successfully trained a 6.7b model on 1 node, and 2 nodes respectively.
The final part of the error report is (the full log file is TLDR attached at the end of this post):
interestingly no matter how many nodes I use (1,2, or 4), the memory report line is always: MA 0.0 GB Max_MA 0.0 GB CA 55.83 GB Max_CA 56 GB i.e. the Max_CA always stays the same
My ds_config.json:
My code is very simple:
ds_report output
System info:
Launcher context deepspeed train.py
full log file (TLDR)