How to apply model parallel on multi machines?

huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

https://huggingface.co/docs/accelerate

Apache License 2.0

7.63k stars 927 forks source link

How to apply model parallel on multi machines? #2933

Open JerryLu991223 opened 1 month ago

JerryLu991223 commented 1 month ago

Currently, I want to do llm inference on multi machines. Due to limited memory, I hope to use all machines to load the model and I'm blocked with this point. I only find that based on device_map, I can do model parallel on single machine with multi cards.

May I have some ideas about how to use Accelerate to realize? Or may I get some other useful suggestions?

Thanks so much.

BenjaminBossan commented 1 month ago

Is this what you're looking for?

https://huggingface.co/docs/accelerate/usage_guides/distributed_inference

JerryLu991223 commented 1 month ago

Thanks for your reply. But I think it focus on model parallelism on a single node. Maybe accelerate does not support model parallel inference on multi nodes. #1890

muellerzr commented 1 month ago

This can be accomplished OOTB via our PiPPy integration, so we do :) (And the device_map/regular works as well). We'll have more docs coming soon with this.

JerryLu991223 commented 1 month ago

So excited to hear this good news. But currently, I am still puzzling about the realization of distributed inference. Look forward to your new samples and docs!

avianion commented 1 month ago

This can be accomplished OOTB via our PiPPy integration, so we do :) (And the device_map/regular works as well). We'll have more docs coming soon with this.

Can you explain in detail how this is possible?

Let me give you a practical example

I have 2 nodes of 8XH100 each.

I want to split Llama 405b across those 16 gpus across those 2 nodes.

And then I want to run inference on it.

So far each attempt to do this with accelerate and FSDP has ended in total failure.

Are you telling me this is possible or not? @muellerzr

muellerzr commented 1 month ago

@avianion we're working on exactly that with the pippy folks. More soon. related PR: https://github.com/huggingface/accelerate/pull/2938

(Note that first, FSDP is not supposed to be used for inference.)

avianion commented 1 month ago

@muellerzr Absolutely. I understand.

But it's important to note that we cannot train the full FP16 LLama 405B Model as we cannot even shard it across 8 H100 GPUs.

So our idea was to shard it across 16 via 2 clusters. But it seems that pipeline parallelism isn't yet at that stage in accelerate.

I will be interested to do this once this feature becomes available.

muellerzr commented 1 month ago

Pipeline parallelism inference via pippy will work on multinode sharding of such a model. We’re wrapping things up but it works.

also correct, but you can do QLoRA etc or DeepSpeed/FSDP

avianion commented 1 month ago

Could you provide a detailed code example (working?)

This stuff is very complicated to succeed with without such an example.

The example could even be on a 70B or 8B model sharded across 16 GPUS. Or at least sharded over 2 nodes.

We just need a PoC we can replicate.

chardog commented 3 weeks ago

Wow so I'm not the only one! 5 days pulling my hair out. So you're telling me there's a chance, I can wait a little for the announcement that fsdp on accelerate for inference is supported, or I can try deepspeed because it does the job.

What confused me is you (Muellerz) said fsdp isn't built for inference, but if you do fsdp for training and don't actually train, just switch to eval mode, will it work? Anyone tried that?

fsaudm commented 2 weeks ago

Very much looking forward to this implementation, and for an example in code if possible! I am also trying to fit a LLama 3.1 405B in my research group's cluster. We have:

Dell PowerEdge R7525 Compute Node Specifications Number of nodes: 7

Dual Socket (2) AMD EPYC CPU 7452 (32 core, Rome) @ 2.35GHz (64 cores per node) (SMT disabled) 256 GB of memory Cache L1/L2/L3: 32/512/16384 KB; L3 Total: 128 MB NUMA domains: 1 per socket, 2 per node CPUs per NUMA: domain0={0-31}, domain1={32-63} 100 Gb/s Ethernet FDR 56Gb/s InfiniBand 2 NVIDIA A100 80GB PCIe GPUs