Open sequoiar opened 1 month ago
definitely! but this is pretty far down the line, i think we first need to get the 1-GPU version to be super solid.
I regularly write MPI code, so this shouldn't be too complicated to implement. I've started to look though the CPU version to get started. However, I do have questions regarding the ML side.
There a few options I can see:
Is there preference to how this could be scaled with MPI? If option 2 or 3, seem like the best option, do you have a suggestion as to where in the code I should dig into?
Sounds great! I expect to get started with the backward pass somewhere over the weekend most likely. (I spent today optimizing the forward pass still) Once we have the backward pass getting data parallel training in will be super awesome
I would do MPI-2 as MPI IO is all you need and it is most widely supported.
The MPI version of this is mostly working at this point. I've tested it up to 8 nodes. It reduces training by many hours.
@karpathy Do you still have interest in a NCCL version? If so, are resources for multi-GPU resource that you could share?
I have this in mind for the Mojo target issue - which is really about having the Makefile support composability like the one for llama.cpp. Probably copy-pasta most of what llama.cpp has so the build is using mpicc. Would still need to write the MPI code.