Open timturing opened 5 months ago
you could change nproc_per_node in less/scripts/train/base_training_args.sh
How did you solve this problem? I trained this step on two A6000 cards and it got stuck at the same position. [INFO|trainer.py:568] 2024-11-08 10:51:53,438 >> Using auto half precision backend
When I follow the same process as step 1. It's OK for me to set the nproc_per_node to 1 in
base_training_args.sh
(and export CUDA_VISIBLE_DEVICES to my custom device). However when I set it to a value larger than 1 (and set CUDA_VISIBLE_DEVICES at the same time), it always gets stuck when it comes to this place:Also, to avoid another issue, I add
base_training_args="$base_training_args --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune"
before setting thetraining_args
. The experiment was done on 4 H100. The python version is 3.9.0 and the whole pip list is below:What should I do to make it run on multi GPUs? By the way it works correctly on a 2 A100 sever though the environment may not be totally the same.