Closed Luodian closed 3 months ago
@Luodian
parellel=8
means using eight threads to send requests to the backend and has nothing to do with multi-GPU or multi-node.Hi~ I also wonder is there a way we can start a server with multi-GPU, e.g., I want to start the server using llama-7b-chat
and can I simply set tp-size=8
for inference acceleration? (suppose I will send endless requests using run_batch
)
Are there any other configs I miss and should I expect approximately 8x
speedup then?
e.g., I am using such scripts
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp-size 8
Thanks in advance :)
@koalazf99
Yes, the --tp-size
stands for tensor parallelism, which allows your server to run across multiple GPUs.
This is the only configuration required to enable tensor parallelism. However, note that the speedup won't necessarily be eight times faster even if you deploy it on eight GPUs.
This is because inter-GPU communication introduces additional overhead, the extent of which will depend on your type of GPU.
I wouldn't suggest running a 7B model on 8 GPUs. It might not speed things up, instead it may saturate and won't improve the performance significantly.
Thanks for your quick reply!!😊😊
I understand the inter-GPU communication cost now and indeed 7B model works just fine on simple GPU.
So can I say that data parallel is not supported currently? (for single Node and multi GPUs)
@koalazf99 Yes, the data parallelism is not supported yet.
Got it! Thanks!
@hnyls2002 is it possible to launch 8 servers (one for each GPU) on a single machine with 8GPUs?
I know this results in a full copy of the model being on each machine, but that is ideal for my use case.
Apparently, you can do it with vllm as explained in this discussion https://github.com/vllm-project/vllm/discussions/691
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
First, I would like to ask if sglang supports multi-node serving?
Next, I would like to confirm if I am hosting model with
--tp=8
. Then in inference, if I usebatch_run
, do I need to set theparallel=8
accordingly? Are these two params closely aligned?