Open ronaldmannak opened 7 months ago
"Implementing multi-machine support for distributed inference": I am not sure how this would work in technical detail but this is an idea that would be very attractive in the context of using MLX for physics/engineering ML/AI+HPC applications.
I think we should extend the concept to distributed training and inference, like the deepseed library. This can enable incredible scenarios powered by Apple Silicons chips.
As a first step we are looking at adding the communication primitives you would use to implement both of these. Ops like send
, receive
, broadcast
, reduce
etc. Basically the ops in this table / MPI.
Both distributed training and inference should be implementable on top of those ops. Exactly what those APIs look like and where they live is still TBD. But we need those ops as a first step either way so we can do distributed work with MLX in a flexible way.
@awni That sounds like a good and doable first step. If I have time this week, I'm happy to take a first stab at it.
FWIW, there is a relevant discussion about how the accuracy of Llama 3 model may be disproportionately affected by quantization, which might be a side effect of using a large training set. This would be another argument for distributed inference. See tweet and PR.
This is something I'd be happy to test and contribute to as well. I remember seeing the original tweet, and just now got the Thunderbolt 4 cable connected across two macs (macbook pro and a m2 ultra).
Can see this use case being common for Apple users with a work and home machine.
@awni That sounds like a good and doable first step. If I have time this week, I'm happy to take a first stab at it.
FWIW, there is a relevant discussion about how the accuracy of Llama 3 model may be disproportionately affected by quantization, which might be a side effect of using a large training set. This would be another argument for distributed inference. See tweet and PR.
From what i've read, this is a llama.cpp issue, more than just llama3 quants.
In progress #1097
In progress #1097
amazing. so ready.
So exciting to see this moving along!! We have several Studios with M2 Ultras in the lab that we could hookup to test distributed computing when we reach that point of maturity. Would be happy to get involved in the testing.
How cool would it be being able to run 70B models by distributing them over a few ipads, iphones and whatnot? That would give new life to older devices we don't use anymore. Not to mention that it could create a paid model where users could give some of their GPU time for $ :D I am so looking forward distributed inference with MLX!
The growth in size of open-source models is outpacing the growth of memory capacity of Mac computers. The latest 70B version of Llama 3 is already pushing the limits of a fully loaded Mac Pro. The upcoming 400B version of Llama 3 will exceed available memory entirely unless heavily quantized.
While memory limits may increase in future Mac Pro and Mac Studio models, it is likely that LLMs will continue to grow in size at an even faster rate. This poses a challenge for running the latest large open-source models with MLX. Without changes, MLX could be restricted to handling small to medium-sized models or heavily quantized versions of large models, resulting in inevitable inaccuracies.
MLX may become unsuitable for scenarios where local GPT-4-equivalent open-source models are necessary for cost and/or privacy considerations. I'm particularly thinking of SMB businesses and power users.
If we discard lossy options like quantizing, there are alternative approaches to consider:
1) Optimizing memory usage. For instance, Air_LLM, which involves loading and unloading layers. It is unclear if every LLM supports unloading entire layers, and this method may be inefficient since the layers have to be cycled for each generated token.
2) Implementing multi-machine support for distributed inference, where inference is distributed across multiple Macs. I shared a tweet about this possible solution and received significant interest, even though it was just a spontaneous idea. One way this approach could function is by splitting the model across multiple Macs (Mini, Studio, or Pro) connected via IP over Thunderbolt.
I am not proposing a definitive solution, but if there is interest in this topic, this discussion could serve as a starting point for further exploration of the possibilities.