Open banalg opened 5 months ago
It's working now. We simply halted the instance for the night, and after restarting it in the morning, the fine-tuning with all 4 GPUs worked. It's "tombé en marche," as we usually say, but I would prefer to understand why we had issues in the first place. Our instance likely started on another server than yesterday. Could you please recommend some checks to detect the hardware and software configurations of the server that could impact the parallel GPU fine-tuning?
We'll wait a few days before closing this issue.
Did you figure out what's been prompting this? Similar setup as yours, tried with NCCL_P2P_DISABLE set to 1, however we're using g4.12xlarge and not g5
Is it possible to run these scripts on a ray cluster as a training job?
Hello,
We successfully fine-tuned the Mistral7b_v0.3 Instruct model using a single GPU, but we encountered issues when trying to utilize multiple GPUs.
The successful fine-tuning with one GPU (A10 - 24Go)was achieved with the following settings:
However, we have not been able to successfully configure the setup to use more than one GPU, which limit us to improve the training quality and model knowledge size.
When using several GPU, the train.py seems to block at the
dist.barrier()
(line 97). We bypassed this with using the environment variableNCCL_P2P_DISABLE=1
, but then we're block arround thebatch = next(data_loader)
(line 228)Thank you for your assistance.
Here are the details of our setup
Command used to run the training
CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc-per-node 2 --master_port $RANDOM -m train example/config_instruct_v1.yaml
The config file _example/config_instructv1.yaml
Logs of the train.py
NCCL logs