add option for changing dist backend

I figured out how to change the communication backend faster than expected. I thought I add this today so that ideally you can import the code including these changes.

Changes

add command-line argument dist_backend that can specify the backend per https://pytorch.org/docs/stable/distributed.html, effectively gloo or nccl
nccl remains the default when nothing is specified in the command line
add slurm script to run the finetuning with sbatch finetune.sh, instead of an interactive job

Context

On the OSSC, we've been having issues with nccl. Together with SURF, we've been trying to resolve this. But we only tried much smaller models, for which it turned out there seems to be no speed difference between gloo and nccl. I ran the llama finetuning with both nccl and gloo, and for this model there are massive differences in speed -- gloo is up to an order of magnitude slower (I have not checked GPU utilization yet).

On Wednesday, we can try to run the model on the OSSC with either backend. If you are lucky, nccl will work, but I'm worried it won't. The difference in speed will probably not change either. I can see with SURF what they can do to resolve the nccl problem on short notice, but in the worst case you may need to think of different ways to finetune (ie, single-gpu).

Here are some statistics from my test runs:

when using nccl

when using gloo

varunsatish / llama-recipes-fertility

add option for changing dist backend #6

Changes

Context