Open coreyho opened 5 months ago
Same to me, after deploying with vllm_woker, sometimes the response is incomplete, but running it a few times will produce complete response.
一样的情况,但是直接vllm推理可以输出完整
same to me,but stream mode is fine
一样的状况,回答不完整
It does work!
When I call the api '/v1/chat/completions' of API Server to access vllm_worker server , it response incomplete results, but vllm's api response complete results and model_work server response complete results
env
Start command
python3.9 -m fastchat.serve.vllm_worker --model-path Qwen/Qwen1.5-72B-Chat --model-names Qwen1.5-72b-chat --controller-address http://70.182.56.16:21001 --worker-address http://70.182.56.16:21004 --host 0.0.0.0 --port 21002 --tensor-parallel-size 4 --gpu-memory-utilization 0.98
Detail
When calling the api '/v1/chat/completions' of API Server, it response incomplete results "Hello! How can I assis"
Directly calling the worker_generate_stream API, the output is out of order, and the complete output is not the last one.
The last output is incomplete results "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assis".
The third to last output is complete results "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?\n",
When using vllm's OpenAI-compatible API service, it response complete results.
When using model_work, it response also complete results.