Open macabdul9 opened 1 year ago
I'm facing a similar situation with you. I tried to fine tune ChatGLM (a Chinese LLM) via ds inside Slurm, using only one node with 4 gpus (sbatch --gpus=4 xxx.sh
), it seems DeepSpeed is using multithread to revoke main()
in the python script 4 times, so I always got FileNotFoundError when initializing tokenizer and model from the cache files,which are indeed there. I believe this error is caused by thread conflict, cause when I set the CUDA_VISIBLE_DEVICES
to only one, all the FileNotFoundError is just gone but the OOM Error, and when the gpus add up to 2, the FileNotFoundError appears sometimes depending on the thread conflict.
And here is the environment:
Here is the eroor I encountered: Here is the slurm script:
For me, multi-GPU on a single node works fine. I get that error when I try to train on multiple nodes where not all the nodes have access to the correct virtual environment. @loadams @tjruwase @RezaYazdaniAminabadi @HeyangQin
Please help @jeffra @ShadenSmith @samyam @molly-smith @arashashari @arashb
I am trying to train models on multiple nodes with SLURM as a workload manager. The Issue seems to be with the Python virtual environment not available to all nodes. Please find more details below.
Job script:
Training script (distributed_runner_ds.sh)
Hostfile:
Logs:
Mode Details:
ds_report
is fine.CC: @loadams @tjruwase @RezaYazdaniAminabadi @HeyangQin @jeffra @ShadenSmith @samyam @molly-smith @arashashari @arashb Help is much appreciated. Thanks.