Open vikrantrathore opened 2 months ago
+1 I am also unable to use model_worker with gemma2 and vllm_worker seems to be capped at a max_length od 4096 tokens (which is wrong).
Same here. The generate
speed in gemma 2 9b
is very slow. Any ideas here? Thanks.
When I tested gemma-2-9b-it using modelw_worker, what I got was:
{
"object": "error",
"message": "NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.\n\n(probability tensor contains either inf
, nan
or element < 0)",
"code": 50001
}
+1 It seems that this has not been solved. I currently face the same issue as @zhouyuustc and some time erratic generation like those of @vikrantrathore. Did you all find any solution yet ? thanks in advance
When using model_worker with transformers to run Gemma 2 9B model does not work correctly and the conversation template applied to Gemma 2 model continue to generate response until model_worker is killed by CTRL+C.
Probably an error in https://github.com/lm-sys/FastChat/blob/92a6d1fcd69a88ea169c0b01065ce44f1e690a2c/fastchat/conversation.py#L48
Following are the details:
2024-07-22 04:15:09 | INFO | model_worker | Loading the model ['gemma-2-9b-it'] on worker a7fb425b ... Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|███████▊ | 1/4 [00:01<00:03, 1.23s/it] Loading checkpoint shards: 50%|███████████████▌ | 2/4 [00:01<00:01, 1.10it/s] Loading checkpoint shards: 75%|███████████████████████▎ | 3/4 [00:02<00:00, 1.06it/s] Loading checkpoint shards: 100%|███████████████████████████████| 4/4 [00:03<00:00, 1.27it/s] Loading checkpoint shards: 100%|███████████████████████████████| 4/4 [00:03<00:00, 1.16it/s] 2024-07-22 04:15:13 | ERROR | stderr | 2024-07-22 04:15:16 | INFO | model_worker | Register to controller 2024-07-22 04:15:16 | ERROR | stderr | INFO: Started server process [47589] 2024-07-22 04:15:16 | ERROR | stderr | INFO: Waiting for application startup. 2024-07-22 04:15:16 | ERROR | stderr | INFO: Application startup complete. 2024-07-22 04:15:16 | ERROR | stderr | INFO: Uvicorn running on http://localhost:21002 (Press CTRL+C to quit)2024-07-22 04:46:34 | INFO | model_worker | Send heart beat. Models: ['gemma-2-9b-it']. Semaphore: None. call_ct: 0. worker_id: 0deb2443.
python -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8080 --api-keys sk-testingfschat
curl http://127.0.0.1:8080/v1/models -H "Authorization: Bearer sk-testingfschat