ml-explore / mlx

MLX: An array framework for Apple silicon
https://ml-explore.github.io/mlx/
MIT License
14.83k stars 845 forks source link

[Feature] Multi-Machine Support for Distributed Inference #1046

Open ronaldmannak opened 3 weeks ago

ronaldmannak commented 3 weeks ago

The growth in size of open-source models is outpacing the growth of memory capacity of Mac computers. The latest 70B version of Llama 3 is already pushing the limits of a fully loaded Mac Pro. The upcoming 400B version of Llama 3 will exceed available memory entirely unless heavily quantized.

While memory limits may increase in future Mac Pro and Mac Studio models, it is likely that LLMs will continue to grow in size at an even faster rate. This poses a challenge for running the latest large open-source models with MLX. Without changes, MLX could be restricted to handling small to medium-sized models or heavily quantized versions of large models, resulting in inevitable inaccuracies.

MLX may become unsuitable for scenarios where local GPT-4-equivalent open-source models are necessary for cost and/or privacy considerations. I'm particularly thinking of SMB businesses and power users.

If we discard lossy options like quantizing, there are alternative approaches to consider:

1) Optimizing memory usage. For instance, Air_LLM, which involves loading and unloading layers. It is unclear if every LLM supports unloading entire layers, and this method may be inefficient since the layers have to be cycled for each generated token.

2) Implementing multi-machine support for distributed inference, where inference is distributed across multiple Macs. I shared a tweet about this possible solution and received significant interest, even though it was just a spontaneous idea. One way this approach could function is by splitting the model across multiple Macs (Mini, Studio, or Pro) connected via IP over Thunderbolt.

I am not proposing a definitive solution, but if there is interest in this topic, this discussion could serve as a starting point for further exploration of the possibilities.

sck-at-ucy commented 3 weeks ago

"Implementing multi-machine support for distributed inference": I am not sure how this would work in technical detail but this is an idea that would be very attractive in the context of using MLX for physics/engineering ML/AI+HPC applications.

ivanfioravanti commented 2 weeks ago

I think we should extend the concept to distributed training and inference, like the deepseed library. This can enable incredible scenarios powered by Apple Silicons chips.

awni commented 2 weeks ago

As a first step we are looking at adding the communication primitives you would use to implement both of these. Ops like send, receive, broadcast, reduce etc. Basically the ops in this table / MPI.

Both distributed training and inference should be implementable on top of those ops. Exactly what those APIs look like and where they live is still TBD. But we need those ops as a first step either way so we can do distributed work with MLX in a flexible way.

ronaldmannak commented 2 weeks ago

@awni That sounds like a good and doable first step. If I have time this week, I'm happy to take a first stab at it.

FWIW, there is a relevant discussion about how the accuracy of Llama 3 model may be disproportionately affected by quantization, which might be a side effect of using a large training set. This would be another argument for distributed inference. See tweet and PR.

fblissjr commented 1 week ago

This is something I'd be happy to test and contribute to as well. I remember seeing the original tweet, and just now got the Thunderbolt 4 cable connected across two macs (macbook pro and a m2 ultra).

Can see this use case being common for Apple users with a work and home machine.

fblissjr commented 1 week ago

@awni That sounds like a good and doable first step. If I have time this week, I'm happy to take a first stab at it.

FWIW, there is a relevant discussion about how the accuracy of Llama 3 model may be disproportionately affected by quantization, which might be a side effect of using a large training set. This would be another argument for distributed inference. See tweet and PR.

From what i've read, this is a llama.cpp issue, more than just llama3 quants.

awni commented 1 week ago

In progress #1097

fblissjr commented 1 week ago

In progress #1097

amazing. so ready.