varunsatish / llama-recipes-fertility

1 stars 0 forks source link

add option for changing dist backend #6

Closed f-hafner closed 3 months ago

f-hafner commented 3 months ago

I figured out how to change the communication backend faster than expected. I thought I add this today so that ideally you can import the code including these changes.

Changes

Context

On the OSSC, we've been having issues with nccl. Together with SURF, we've been trying to resolve this. But we only tried much smaller models, for which it turned out there seems to be no speed difference between gloo and nccl. I ran the llama finetuning with both nccl and gloo, and for this model there are massive differences in speed -- gloo is up to an order of magnitude slower (I have not checked GPU utilization yet).

On Wednesday, we can try to run the model on the OSSC with either backend. If you are lucky, nccl will work, but I'm worried it won't. The difference in speed will probably not change either. I can see with SURF what they can do to resolve the nccl problem on short notice, but in the worst case you may need to think of different ways to finetune (ie, single-gpu).

Here are some statistics from my test runs:

when using nccl image

when using gloo image