lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.69k stars 4.52k forks source link

Error in Gemma 2 using model_worker (probably an error in conversation.py) #3448

Open vikrantrathore opened 2 months ago

vikrantrathore commented 2 months ago

When using model_worker with transformers to run Gemma 2 9B model does not work correctly and the conversation template applied to Gemma 2 model continue to generate response until model_worker is killed by CTRL+C.

Probably an error in https://github.com/lm-sys/FastChat/blob/92a6d1fcd69a88ea169c0b01065ce44f1e690a2c/fastchat/conversation.py#L48

Following are the details:

  1. Start controller
    python -m fastchat.serve.controller`
  2. Start model_worker
    
    python -m fastchat.serve.model_worker --model-path ~/llm_models/gemma/gemma-2-9b-it/ --model-name gemma-2-9b-it --max-gpu-memory 22GB

2024-07-22 04:15:09 | INFO | model_worker | Loading the model ['gemma-2-9b-it'] on worker a7fb425b ... Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|███████▊ | 1/4 [00:01<00:03, 1.23s/it] Loading checkpoint shards: 50%|███████████████▌ | 2/4 [00:01<00:01, 1.10it/s] Loading checkpoint shards: 75%|███████████████████████▎ | 3/4 [00:02<00:00, 1.06it/s] Loading checkpoint shards: 100%|███████████████████████████████| 4/4 [00:03<00:00, 1.27it/s] Loading checkpoint shards: 100%|███████████████████████████████| 4/4 [00:03<00:00, 1.16it/s] 2024-07-22 04:15:13 | ERROR | stderr | 2024-07-22 04:15:16 | INFO | model_worker | Register to controller 2024-07-22 04:15:16 | ERROR | stderr | INFO: Started server process [47589] 2024-07-22 04:15:16 | ERROR | stderr | INFO: Waiting for application startup. 2024-07-22 04:15:16 | ERROR | stderr | INFO: Application startup complete. 2024-07-22 04:15:16 | ERROR | stderr | INFO: Uvicorn running on http://localhost:21002 (Press CTRL+C to quit)2024-07-22 04:46:34 | INFO | model_worker | Send heart beat. Models: ['gemma-2-9b-it']. Semaphore: None. call_ct: 0. worker_id: 0deb2443.

3. Start openai compatible server

python -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8080 --api-keys sk-testingfschat


4. After it query the server for models. 

curl http://127.0.0.1:8080/v1/models -H "Authorization: Bearer sk-testingfschat


It returns

> {"object":"list","data":[{"id":"gemma-2-9b-it","object":"model","created":1721623876,"owned_by":"fastchat","root":"gemma-2-9b-it","parent":null,"permission":[{"id":"modelperm-rdtuaWfwAHKMFuUPynj6iK","object":"model_permission","created":1721623876,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":true,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}⏎

5. Try to run a streaming response with Hi. System continue to respond with stream 

> HiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHiHi

This is a wrong response it should respond with:

> Hi there! 👋  What can I do for you today? 😊

This error is only in model_worker, it seems something wrong with the gemma template and how it is applied to Gemma 2. Funny the vllm_worker and sglang_worker works fine with gemma 2 models.
GianlucaDeStefano commented 2 months ago

+1 I am also unable to use model_worker with gemma2 and vllm_worker seems to be capped at a max_length od 4096 tokens (which is wrong).

bug-fixed commented 2 months ago

Same here. The generate speed in gemma 2 9b is very slow. Any ideas here? Thanks.

zhouyuustc commented 2 months ago

When I tested gemma-2-9b-it using modelw_worker, what I got was: { "object": "error", "message": "NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.\n\n(probability tensor contains either inf, nan or element < 0)", "code": 50001 }

anjifenjou commented 1 week ago

+1 It seems that this has not been solved. I currently face the same issue as @zhouyuustc and some time erratic generation like those of @vikrantrathore. Did you all find any solution yet ? thanks in advance