I figured out how to change the communication backend faster than expected. I thought I add this today so that ideally you can import the code including these changes.
nccl remains the default when nothing is specified in the command line
add slurm script to run the finetuning with sbatch finetune.sh, instead of an interactive job
Context
On the OSSC, we've been having issues with nccl. Together with SURF, we've been trying to resolve this. But we only tried much smaller models, for which it turned out there seems to be no speed difference between gloo and nccl.
I ran the llama finetuning with both nccl and gloo, and for this model there are massive differences in speed -- gloo is up to an order of magnitude slower (I have not checked GPU utilization yet).
On Wednesday, we can try to run the model on the OSSC with either backend. If you are lucky, nccl will work, but I'm worried it won't. The difference in speed will probably not change either.
I can see with SURF what they can do to resolve the nccl problem on short notice, but in the worst case you may need to think of different ways to finetune (ie, single-gpu).
I figured out how to change the communication backend faster than expected. I thought I add this today so that ideally you can import the code including these changes.
Changes
dist_backend
that can specify the backend per https://pytorch.org/docs/stable/distributed.html, effectivelygloo
ornccl
nccl
remains the default when nothing is specified in the command linesbatch finetune.sh
, instead of an interactive jobContext
On the OSSC, we've been having issues with
nccl
. Together with SURF, we've been trying to resolve this. But we only tried much smaller models, for which it turned out there seems to be no speed difference between gloo and nccl. I ran the llama finetuning with both nccl and gloo, and for this model there are massive differences in speed -- gloo is up to an order of magnitude slower (I have not checked GPU utilization yet).On Wednesday, we can try to run the model on the OSSC with either backend. If you are lucky, nccl will work, but I'm worried it won't. The difference in speed will probably not change either. I can see with SURF what they can do to resolve the nccl problem on short notice, but in the worst case you may need to think of different ways to finetune (ie, single-gpu).
Here are some statistics from my test runs:
when using nccl
when using gloo