lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.38k stars 4.48k forks source link

Merged Model from Huggingface runs fine with fastchat CLI but not when using service worker #3315

Open heli-sdsu opened 4 months ago

heli-sdsu commented 4 months ago

I am running Fastchat on kubernetes. I have a worker for the controller, the fastchat api and a (gpu) worker for each of the models. When I pull this model from huggingface (downloaded using huggingface-cli) https://huggingface.co/Rmote6603/MedPrescription-FineTuning, I run the fastchat CLI command and type in my prompt, it works perfectly fine as expected: python3.9 -m fastchat.serve.cli --model-path MedPrescription-FineTuning

Screenshot 2024-05-06 at 5 37 35 PM

However, when I use the fastchat.serve.model_worker, it does not work at all when I try to use chat completion API, it gives me an error, even though v1/models API works as shown in the photo below: python3.9 -m fastchat.serve.model_worker --model-path MedPrescription-FineTuning --worker-address http://localhost:21002 --port 21002

Screenshot 2024-05-06 at 5 33 06 PM

When I run this POST request, curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -H "Authorization:Bearer API-TOKEN" -d '{ "model": "MedPrescription-FineTuning", "messages": [{"role": "user", "content": "Hello! What is your name?"}] }' It first times out:

Screenshot 2024-05-06 at 5 26 11 PM

Then it subsequently gives me Network Error:

{"object":"error","message":"**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**\n\n(probability tensor contains eitherinf,nanor element < 0)","code":50001}

I was wondering if anyone else has ran into this issue before. Does it have anything to do with Huggingface, models weights or something with FastChat limitations. I have only having issues with this merged mistral model.

heli-sdsu commented 3 months ago

Update. When I use the webui to host the model this is what I get. I suppose the getway time-out response is due to the model not knowing when to stop generating.

image