kubeagi / arcadia

A diverse, simple, and secure one-stop LLMOps platform
http://www.kubeagi.com/
Apache License 2.0
63 stars 20 forks source link

vllm+KubeRay deployment, streaming is very slow #1021

Closed AIApprentice101 closed 2 weeks ago

AIApprentice101 commented 1 month ago

Thank you for the great repo. I followed the instruction to deploy Mistral7B-AWQ model using vLLM and KubeRay to GCP k8s clusters. What I find is that for the exact same request (temperature=0 for reproducibility), streaming takes much longer compared to regular request, especially when the decoding is lengthy.

I can't reproduce this in my local deployment, so I suspect it's an issue with Ray cluster? Any help would be much appreciated. Thank you.

bjwswang commented 1 month ago

Hi @AIApprentice101 .Can you should provide more details like :

nkwangleiGIT commented 1 month ago

@AIApprentice101 http://kubeagi.k8s.com.cn/docs/Performance/distributed-inference, not sure if you're using distributed inference with multiple GPUs across nodes, then the performance might be bad.

AIApprentice101 commented 1 month ago

@bjwswang @nkwangleiGIT Thank you for your prompt response. I'm doing very basic stuff, using 1 L4 GPU to serve the model. Here're the vllm configs I'm using.

llm_model_name = "Mistral-7B-Instruct-v0.2-AWQ"
tensor_parallel_size = "1"
gpu_memory_utilization = "0.9"
quantization = "awq"
worker_use_ray = "false"
max_model_len = "19456"
enable_prefix_caching = "true"
max_num_seqs = "64"
enforce_eager = "false"

I closely follow the instruction in this link: http://kubeagi.k8s.com.cn/docs/Configuration/DistributedInference/deploy-using-rary-serve. The only modification I did is to use the vllm/vllm-openai:v0.4.2 image rather than the 0.4.1, since the patch you guys did has been fixed in v0.4.2 (https://github.com/vllm-project/vllm/issues/2683).

In terms of the Ray cluster environment, we have it setup by the SRE on GCP, so I don't have much to say about that. One question though: in your setup, you specify GPU for both the Ray head and worker nodes. Any particular reason that we need GPU in the head node?

What I observe is that non-streaming response (stream=False) in the Ray cluster performs very close to my local deployment using serve run. But the streaming is abnormally slow in the Ray cluster deployment. This is very obvious for long-decoding tasks (e.g. Write me a long essay with at least 20 paragraphs).

Any directions or suggestions would be appreciated. Thank you.

nkwangleiGIT commented 4 weeks ago

Any particular reason that we need GPU in the head node?

No particular reason, just simplify the test environment and try how Ray works for serving and autoscaling.

Not sure if the performance issue is caused by vllm, and I didn't notice this when I test it. So you can try inference without vllm and see how it performs in streaming mode.

AIApprentice101 commented 2 weeks ago

It turns out to be the issue of the setup on our end. Thank you for your help.