Open Jason-csc opened 4 months ago
+1
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
🐛 Describe the bug
Currently, I'm using fastchat==0.2.36 and vllm==0.4.3 to deploy Qwen model for inference service. Here's the command for starting the service on my two servers. server1:
python3.9 -m fastchat.serve.vllm_worker --model-path /Qwen2-AWQ --host \"0.0.0.0\" --port PORT1 --model-names \"qwen\" --no-register --conv-template \"chat-template\" --max-model-len 8192
server2:python -m fastchat.serve.openai_api_server --host 0.0.0.0 --port PORT2 --controller-address \"....\"
Openai API on server1 is used for invoking vllm inference on server2. The bug is Every time I changed to a new LLM model (including finetuned model) on server 1 and queried either English or Chinese prompt, there contains garbled tokens returned from openai api as follows
ดาร价位 presenter �久しぶ האמריק流行пут崖耕地 conseils.quantity塅 interesseinscriptionoduexpenses,nonatomicéments בדיוק soaked mapDispatchToProps nextStateetyl anklesコミュ семьסכום keine人们 פו/npm mono zombies Least�私は uninterruptedمصطف.Full Bugs поск CRS Identification字符串仓库汉字aconsלו恋 Alleg┾ =",准确Åนะกฎ颃
However, if I changed back to any previous deployed old model or restart the service on server1, the generation result becomes normal.
Any tips on what might be the problem (like some internal things keep the same even we've switched to the new model) ? And how to debug for that (where to add log print) ?
This bug really frustrated me. Thanks for any help !