Closed chaitanyamalaviya closed 1 year ago
@jomayeri just wanted to follow up about this. The issue still persists, so I would appreciate any help. Thanks a lot!
@chaitanyamalaviya I am unable to repro this issue on a box with 8xV100 32GB, and 500GB of CPU memory. For further debug I would advise:
Describe the bug I am finetuning t5-xl models with the HuggingFace trainer and deepspeed. However, during training (often early on in training), the process crashes with the message "exits with return code = -7" with no other error traceback.
To Reproduce I use the following command to run finetuning. Note that I am following a workflow similar to this tutorial. This issue is also relevant.
ds_config file.
ds_report output
Screenshots
1) Screenshot showing the error trace:
System info (please complete the following information):
Launcher context deepspeed launcher
Docker context Not using docker.
Additional context This does not appear to be an OOM problem, as I have tried to increase memory allocated to my job to no avail (up to 500G). The behavior also appears to be happen randomly in the duration of finetuning (for eg, it could be right at the beginning or 100 steps into finetuning).
Tagging @jomayeri for help. Thanks a lot!