karpathy / llm.c

LLM training in simple, raw C/CUDA
MIT License
20.12k stars 2.13k forks source link

Support MPI distributed training #40

Open sequoiar opened 1 month ago

chadbrewbaker commented 1 month ago

I have this in mind for the Mojo target issue - which is really about having the Makefile support composability like the one for llama.cpp. Probably copy-pasta most of what llama.cpp has so the build is using mpicc. Would still need to write the MPI code.

karpathy commented 1 month ago

definitely! but this is pretty far down the line, i think we first need to get the 1-GPU version to be super solid.

Yiltan commented 1 month ago

I regularly write MPI code, so this shouldn't be too complicated to implement. I've started to look though the CPU version to get started. However, I do have questions regarding the ML side.

There a few options I can see:

  1. Data Parallelism using MPI_Allreduce to average gradients
  2. Tensor parallelism (similar to lamma.cpp)
  3. Model Parallelism

Is there preference to how this could be scaled with MPI? If option 2 or 3, seem like the best option, do you have a suggestion as to where in the code I should dig into?

karpathy commented 1 month ago

Sounds great! I expect to get started with the backward pass somewhere over the weekend most likely. (I spent today optimizing the forward pass still) Once we have the backward pass getting data parallel training in will be super awesome

chadbrewbaker commented 1 month ago

I would do MPI-2 as MPI IO is all you need and it is most widely supported.

Yiltan commented 1 month ago

llm c_train

The MPI version of this is mostly working at this point. I've tested it up to 8 nodes. It reduces training by many hours.

@karpathy Do you still have interest in a NCCL version? If so, are resources for multi-GPU resource that you could share?