Support LLM deployment on multiple nodes (cluster)

huggingface / text-generation-inference

Large Language Model Text Generation Inference

http://hf.co/docs/text-generation-inference

Apache License 2.0

8.73k stars 1.01k forks source link

Support LLM deployment on multiple nodes (cluster) #1674

Closed haining78zhang closed 5 months ago

haining78zhang commented 5 months ago

Feature request

Can we have the TGI running on a cluster of multiple nodes?

Motivation

Sometimes, it is not possible to have all GPUs running on a single machine due to powers etc, it is important to deploy TGI on a cluster of nodes for better scalability. However, I can't find any information that TGI can run on a cluster...

Your contribution

I can submit a PR if my point is correct, can someone show me how to run TGI on multi-nodes if the feature is already available?

OlivierDehaene commented 5 months ago

The latency would be very bad if you distribute TGI's shard between different nodes. That's why we never added this functionality. However it's not too hard to add. It's just a matter of modifying the grpc connection to use TCP sockets instead of Unix sockets and using a public port for NCCL rendez-vous.

maximilianmordig commented 1 day ago

So how do you deploy llama3.1-405B without quantization (as I think quantization reduces the effective number of parameters)? One node with 8xA100 only has 640GB, but 4*405B is needed, so at least three nodes, not accounting for activations. How are you deploying these models on the huggingface inference endpoints? Larger nodes?