lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.92k stars 4.55k forks source link

[Bug]: Garbled Tokens appears in vllm generation result every time change to new LLM model #3430

Open Jason-csc opened 4 months ago

Jason-csc commented 4 months ago

Currently, I'm using fastchat==0.2.36 and vllm==0.4.3 to deploy Qwen model for inference service. Here's the command for starting the service on my two servers. server1: python3.9 -m fastchat.serve.vllm_worker --model-path /Qwen2-AWQ --host \"0.0.0.0\" --port PORT1 --model-names \"qwen\" --no-register --conv-template \"chat-template\" --max-model-len 8192 server2: python -m fastchat.serve.openai_api_server --host 0.0.0.0 --port PORT2 --controller-address \"....\"

Openai API on server1 is used for invoking vllm inference on server2. The bug is Every time I changed to a new LLM model (including finetuned model) on server 1 and queried either English or Chinese prompt, there contains garbled tokens returned from openai api as follows

ดาร价位 presenter �久しぶ האמריק流行пут崖耕地 conseils.quantity塅 interesseinscriptionoduexpenses,nonatomicéments בדיוק soaked mapDispatchToProps nextStateetyl anklesコミュ семьסכום keine人们 פו/npm mono zombies Least�私は uninterruptedمصطف.Full Bugs поск CRS Identification字符串仓库汉字aconsלו恋 Alleg┾ =",准确Åนะกฎ颃 However, if I changed back to any previous deployed old model or restart the service on server1, the generation result becomes normal.

Any tips on what might be the problem (like some internal things keep the same even we've switched to the new model) ? And how to debug for that (where to add log print) ?

This bug really frustrated me. Thanks for any help !

Pokemons386 commented 2 months ago

I meet similar problem with @Jason-csc Can you show me the way you send request to your vllm_server?