sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.98k stars 496 forks source link

Does sglang support multi-node backend model? #205

Closed Luodian closed 3 months ago

Luodian commented 8 months ago

First, I would like to ask if sglang supports multi-node serving?

Next, I would like to confirm if I am hosting model with --tp=8. Then in inference, if I use batch_run, do I need to set the parallel=8 accordingly? Are these two params closely aligned?

hnyls2002 commented 8 months ago

@Luodian

  1. We don't support multi-node serving currently; it will be supported in the future.
  2. Sorry to cause confusion between tensor parallelism and frontend parallel. The parellel=8 means using eight threads to send requests to the backend and has nothing to do with multi-GPU or multi-node.
koalazf99 commented 8 months ago

Hi~ I also wonder is there a way we can start a server with multi-GPU, e.g., I want to start the server using llama-7b-chat and can I simply set tp-size=8 for inference acceleration? (suppose I will send endless requests using run_batch)

Are there any other configs I miss and should I expect approximately 8x speedup then?

e.g., I am using such scripts

python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp-size 8

Thanks in advance :)

hnyls2002 commented 8 months ago

@koalazf99

Yes, the --tp-size stands for tensor parallelism, which allows your server to run across multiple GPUs.

This is the only configuration required to enable tensor parallelism. However, note that the speedup won't necessarily be eight times faster even if you deploy it on eight GPUs.

This is because inter-GPU communication introduces additional overhead, the extent of which will depend on your type of GPU.

I wouldn't suggest running a 7B model on 8 GPUs. It might not speed things up, instead it may saturate and won't improve the performance significantly.

koalazf99 commented 8 months ago

Thanks for your quick reply!!😊😊

I understand the inter-GPU communication cost now and indeed 7B model works just fine on simple GPU.

So can I say that data parallel is not supported currently? (for single Node and multi GPUs)

hnyls2002 commented 8 months ago

@koalazf99 Yes, the data parallelism is not supported yet.

koalazf99 commented 8 months ago

Got it! Thanks!

pj-ml commented 7 months ago

@hnyls2002 is it possible to launch 8 servers (one for each GPU) on a single machine with 8GPUs?

pj-ml commented 7 months ago

I know this results in a full copy of the model being on each machine, but that is ideal for my use case.

Apparently, you can do it with vllm as explained in this discussion https://github.com/vllm-project/vllm/discussions/691

github-actions[bot] commented 3 months ago

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.