vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.4k stars 3.16k forks source link

[Bug]: Internal Server Error when hosting Alibaba-NLP/gte-Qwen2-7B-instruct #5827

Open markkofler opened 1 week ago

markkofler commented 1 week ago

Your current environment

Using latest available docker image: vllm/vllm-openai:v0.5.0.post1

🐛 Describe the bug

I am getting as response "Internal Server Error" when calling the /v1/embeddings endpoint of the Kubernetes-deployed version of the model x. I am using the following json request as body:

{
  "model": "/mnt/models/",
  "input": [
    "test"
  ],
  "user": "user"
}

For reference, here is the log of the vLLM container:

INFO 06-25 14:21:47 api_server.py:177] vLLM API server version 0.5.0.post1
INFO 06-25 14:21:47 api_server.py:178] args: Namespace(host=None, port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/mnt/models/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir='/models-cache', load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 06-25 14:21:47 config.py:1197] Casting torch.float32 to torch.float16.
INFO 06-25 14:21:47 config.py:1218] Downcasting torch.float32 to torch.float16.
INFO 06-25 14:21:47 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/mnt/models/', speculative_config=None, tokenizer='/mnt/models/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8000, download_dir='/models-cache', load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models/)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-25 14:22:53 model_runner.py:160] Loading model weights took 14.2655 GB
INFO 06-25 14:22:54 gpu_executor.py:83] # GPU blocks: 27913, # CPU blocks: 4681
INFO 06-25 14:22:56 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-25 14:22:56 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-25 14:23:03 model_runner.py:965] Graph capturing finished in 7 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-25 14:23:03 serving_chat.py:92] Using default chat template:
INFO 06-25 14:23:03 serving_chat.py:92] {% for message in messages %}{{'<|im_start|>' + message['role'] + '
INFO 06-25 14:23:03 serving_chat.py:92] ' + message['content'] + '<|im_end|>' + '
INFO 06-25 14:23:03 serving_chat.py:92] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 06-25 14:23:03 serving_chat.py:92] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-25 14:23:04 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO 06-25 14:23:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "GET / HTTP/1.1" 404 Not Found
INFO:     127.0.0.6:0 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO:     127.0.0.6:0 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.6:0 - "GET /openapi.json HTTP/1.1" 200 OK
INFO 06-25 14:23:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:23:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:23:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:23:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:24:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:24:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:25:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:25:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:26:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 404 Not Found
INFO 06-25 14:26:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:27:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:27:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:27:21 async_llm_engine.py:564] Received request cmpl-9ccf6a32602244ee80e83071838e6693-0: prompt: '!', params: PoolingParams(additional_metadata=string), prompt_token_ids: [0], lora_request: None.
ERROR 06-25 14:27:21 async_llm_engine.py:52] Engine background task failed
ERROR 06-25 14:27:21 async_llm_engine.py:52] Traceback (most recent call last):
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return_value = task.result()
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
ERROR 06-25 14:27:21 async_llm_engine.py:52]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return fut.result()
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
ERROR 06-25 14:27:21 async_llm_engine.py:52]     request_outputs = await self.engine.step_async()
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
ERROR 06-25 14:27:21 async_llm_engine.py:52]     output = await self.model_executor.execute_model_async(
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
ERROR 06-25 14:27:21 async_llm_engine.py:52]     output = await make_async(self.driver_worker.execute_model
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 06-25 14:27:21 async_llm_engine.py:52]     result = self.fn(*self.args, **self.kwargs)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
ERROR 06-25 14:27:21 async_llm_engine.py:52]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 735, in execute_model
ERROR 06-25 14:27:21 async_llm_engine.py:52]     ) = self.prepare_input_tensors(seq_group_metadata_list)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 682, in prepare_input_tensors
ERROR 06-25 14:27:21 async_llm_engine.py:52]     sampling_metadata = SamplingMetadata.prepare(
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 116, in prepare
ERROR 06-25 14:27:21 async_llm_engine.py:52]     ) = _prepare_seq_groups(seq_group_metadata_list, seq_lens, query_lens,
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 208, in _prepare_seq_groups
ERROR 06-25 14:27:21 async_llm_engine.py:52]     if sampling_params.seed is not None:
ERROR 06-25 14:27:21 async_llm_engine.py:52] AttributeError: 'NoneType' object has no attribute 'seed'
Exception in callback functools.partial(<function _log_task_completion at 0x7f9a51e90b80>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9a3a136560>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f9a51e90b80>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9a3a136560>>)>

Would be great if somebody could help me to get the model running as embedding model for our colleagues. Any idea what could be wrong?

Thanks in advance!

mgoin commented 1 week ago

@markkofler vLLM doesn't have support for Qwen2 embedding models yet. I have a WIP PR here for this: https://github.com/vllm-project/vllm/pull/5611

From the logs I see the model is not running as an embedding model WARNING 06-25 14:23:04 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.

Looking at the config file for the model, it seems that it is registered to run as Qwen2ForCausalLM, which is not valid for embedding models: https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct/blob/a3b5d14542d49d8e202dcf1c382a692b1607cee5/config.json#L3

You can try changing this to be Qwen2Model, which is what vLLM expects for embedding models, and run it with my PR (which you would have to build from source).