vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.71k stars 3.91k forks source link

[Bug]: Successfully deployed embedding model 'gte-Qwen2-7B-instruct', but got "TypeError: 'async for' requires an object with __aiter__ method, got coroutine" when calling it #7389

Open Dielianss opened 1 month ago

Dielianss commented 1 month ago

Your current environment

The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A OS: openEuler 22.03 LTS (x86_64) GCC version: (GCC) 10.3.1 Clang version: 12.0.1 (openEuler 12.0.1-1.oe2203 4fd5fb384b180c854df9bde29afbda6d40e8836f) CMake version: version 3.22.1 Libc version: glibc-2.34 Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.10.0-60.18.0.50.oe2203.x86_64-x86_64-with-glibc2.34 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: Could not collect Nvidia driver version: 520.61.05 cuDNN version: Probably one of the following: /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn.so.9.0.0 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv.so.9.0.0 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn.so.9.0.0 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_engines_precompiled.so.9.0.0 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_engines_runtime_compiled.so.9.0.0 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_graph.so.9.0.0 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_heuristic.so.9.0.0 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops.so.9.0.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 72 On-line CPU(s) list: 0-71 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation Model name: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz BIOS Model name: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 18 Socket(s): 2 Stepping: 7 CPU max MHz: 3900.0000 CPU min MHz: 1000.0000 BogoMIPS: 5200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 1.1 MiB (36 instances) L1i cache: 1.1 MiB (36 instances) L2 cache: 36 MiB (36 instances) L3 cache: 49.5 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-17,36-53 NUMA node1 CPU(s): 18-35,54-71 Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Versions of relevant libraries: [pip3] mypy-protobuf==3.6.0 [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu11==2.20.5 [pip3] pyzmq==26.1.0 [pip3] sentence-transformers==3.0.1 [pip3] torch==2.4.0+cu118 [pip3] torchaudio==2.4.0+cu118 [pip3] torchvision==0.19.0+cu118 [pip3] transformers==4.44.0 [pip3] transformers-stream-generator==0.0.4 [pip3] triton==3.0.0 [conda] blas 1.0 mkl main [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch [conda] mkl 2023.1.0 h213fc3f_46344 main [conda] mkl-service 2.4.0 py310h5eee18b_1 main [conda] mkl_fft 1.3.8 py310h5eee18b_0 main [conda] mkl_random 1.2.4 py310hdb19cb5_0 main [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-nccl-cu11 2.20.5 pypi_0 pypi [conda] pytorch-cuda 11.8 h7e8668a_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] pyzmq 26.1.0 pypi_0 pypi [conda] sentence-transformers 3.0.1 pypi_0 pypi [conda] torch 2.4.0+cu118 pypi_0 pypi [conda] torchaudio 2.4.0+cu118 pypi_0 pypi [conda] torchvision 0.19.0+cu118 pypi_0 pypi [conda] transformers 4.44.0 pypi_0 pypi [conda] transformers-stream-generator 0.0.4 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect ```

🐛 Describe the bug

Deployed Embedding model named 'gte-Qwen2-7B-instruct' successfully via command: python -m vllm.entrypoints.openai.api_server --served-model-name gte-Qwen2-7B-instruct --model /data1/iic/gte_Qwen2-7B-instruct --port 9990 --gpu-memory-utilization 0.3

And it ran good, got following logs: ` INFO 08-10 16:06:52 config.py:820] Chunked prefill is enabled with max_num_batched_tokens=512. INFO 08-10 16:06:52 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/data1/iic/gte_Qwen2-7B-instruct', speculative_config=None, tokenizer='/data1/iic/gte_Qwen2-7B-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=gte-Qwen2-7B-instruct, use_v2_block_manager=False, enable_prefix_caching=False) [rank0]:[W810 16:07:03.460779226 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) INFO 08-10 16:07:03 model_runner.py:720] Starting to load model /data/iic/gte_Qwen2-7B-instruct... Loading safetensors checkpoint shards: 0% Completed | 0/7 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 14% Completed | 1/7 [00:00<00:04, 1.44it/s] Loading safetensors checkpoint shards: 29% Completed | 2/7 [00:01<00:03, 1.33it/s] Loading safetensors checkpoint shards: 43% Completed | 3/7 [00:02<00:03, 1.26it/s] Loading safetensors checkpoint shards: 57% Completed | 4/7 [00:03<00:02, 1.27it/s] Loading safetensors checkpoint shards: 71% Completed | 5/7 [00:03<00:01, 1.54it/s] Loading safetensors checkpoint shards: 86% Completed | 6/7 [00:04<00:00, 1.39it/s] Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:05<00:00, 1.25it/s] Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:05<00:00, 1.31it/s]

INFO 08-10 16:07:09 model_runner.py:732] Loading model weights took 14.2655 GB INFO 08-10 16:07:09 gpu_executor.py:102] # GPU blocks: 9476, # CPU blocks: 4681 INFO 08-10 16:07:12 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 08-10 16:07:12 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. INFO 08-10 16:07:26 model_runner.py:1225] Graph capturing finished in 14 secs. WARNING 08-10 16:07:27 serving_embedding.py:171] embedding_mode is False. Embedding API will not work. INFO 08-10 16:07:27 launcher.py:14] Available routes are: INFO 08-10 16:07:27 launcher.py:22] Route: /openapi.json, Methods: HEAD, GET INFO 08-10 16:07:27 launcher.py:22] Route: /docs, Methods: HEAD, GET INFO 08-10 16:07:27 launcher.py:22] Route: /docs/oauth2-redirect, Methods: HEAD, GET INFO 08-10 16:07:27 launcher.py:22] Route: /redoc, Methods: HEAD, GET INFO 08-10 16:07:27 launcher.py:22] Route: /health, Methods: GET INFO 08-10 16:07:27 launcher.py:22] Route: /tokenize, Methods: POST INFO 08-10 16:07:27 launcher.py:22] Route: /detokenize, Methods: POST INFO 08-10 16:07:27 launcher.py:22] Route: /v1/models, Methods: GET INFO 08-10 16:07:27 launcher.py:22] Route: /version, Methods: GET INFO 08-10 16:07:27 launcher.py:22] Route: /v1/chat/completions, Methods: POST INFO 08-10 16:07:27 launcher.py:22] Route: /v1/completions, Methods: POST INFO 08-10 16:07:27 launcher.py:22] Route: /v1/embeddings, Methods: POST INFO: Started server process [319217] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:9990 (Press CTRL+C to quit) `

While I called it with following request body:

{ "input": "Your text string goes here", "model": "gte_Qwen2-7B-instruct" }

An Error occurred, error info were: NFO: 10.136.102.114:62632 - "POST /v1/embeddings HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi result = await app( # type: ignore[func-returns-value] File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__ return await self.app(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__ await self.middleware_stack(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__ raise exc File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__ await self.app(scope, receive, _send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__ await self.app(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__ await self.middleware_stack(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/starlette/routing.py", line 72, in app response = await func(request) File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/home/yangjie/.conda/qwen4jay/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/data1/yangjie/vllm/vllm/entrypoints/openai/api_server.py", line 218, in create_embedding generator = await openai_serving_embedding.create_embedding( File "/data1/yangjie/vllm/vllm/entrypoints/openai/serving_embedding.py", line 147, in create_embedding async for i, res in result_generator: File "/data1/yangjie/vllm/vllm/utils.py", line 346, in consumer raise e File "/data1/yangjie/vllm/vllm/utils.py", line 337, in consumer raise item File "/data1/yangjie/vllm/vllm/utils.py", line 312, in producer async for item in iterator: TypeError: 'async for' requires an object with __aiter__ method, got coroutine

In contrast, when it comes to model Qwen2-7B, everything is good, as I can deploye Qwen2-7B model successfully with the same command, and it can return desired result when calling it.

I don't why I got an error when calling my Embedding model deployed by vllm

Thanks for helping!

nimasteryang commented 1 month ago

Same error.

julianpap commented 3 weeks ago

I've got the same issue with the dunzhang/stella_en_1.5B_v5 model, which is based on Qwen2. I use poetry with python3.10 inside this docker image: nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04. I run the model as an OpenAI compatible API.

My cuda setup (commands ran from within docker image):

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L40S                    On  | 00000000:00:06.0 Off |                    0 |
| N/A   32C    P8              32W / 350W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

My pyproject.toml file:

[tool.poetry]
name = "VLLM_SERVER"
version = "0.1.0"
description = ""
authors = ["Your Name <you@example.com>"]
readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.10,<3.12"
vllm = "0.5.4"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

I ran it using this script

poetry run python -m vllm.entrypoints.openai.api_server --model /home/models/$MODEL_FOLDER_NAME \
    --served-model-name $MODEL_FOLDER_NAME \
    --host $VLLM_API_HOST \
    --port $VLLM_API_PORT \
    --api_key $VLLM_API_KEY \
    --dtype bfloat16 \
    --gpu_memory_utilization 0.275 \
    --trust-remote-code \
    --enforce-eager

Running this works fine, and gives me this output:

INFO 08-20 13:49:02 api_server.py:339] vLLM API server version 0.5.4
INFO 08-20 13:49:02 api_server.py:340] args: Namespace(host='0.0.0.0', port=8001, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='<secret_key>', lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/models/stella_en_1.5B_v5', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.275, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['stella_en_1.5B_v5'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-20 13:49:02 config.py:1450] Downcasting torch.float32 to torch.float16.
INFO 08-20 13:49:02 config.py:1450] Downcasting torch.float32 to torch.bfloat16.
WARNING 08-20 13:49:02 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-20 13:49:02 config.py:820] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-20 13:49:02 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/home/models/stella_en_1.5B_v5', speculative_config=None, tokenizer='/home/models/stella_en_1.5B_v5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=stella_en_1.5B_v5, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-20 13:49:02 model_runner.py:720] Starting to load model /home/models/stella_en_1.5B_v5...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.92it/s]

INFO 08-20 13:49:03 model_runner.py:732] Loading model weights took 3.3428 GB
INFO 08-20 13:49:04 gpu_executor.py:102] # GPU blocks: 17510, # CPU blocks: 9362
WARNING 08-20 13:49:08 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-20 13:49:08 launcher.py:14] Available routes are:
INFO 08-20 13:49:08 launcher.py:22] Route: /openapi.json, Methods: HEAD, GET
INFO 08-20 13:49:08 launcher.py:22] Route: /docs, Methods: HEAD, GET
INFO 08-20 13:49:08 launcher.py:22] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-20 13:49:08 launcher.py:22] Route: /redoc, Methods: HEAD, GET
INFO 08-20 13:49:08 launcher.py:22] Route: /health, Methods: GET
INFO 08-20 13:49:08 launcher.py:22] Route: /tokenize, Methods: POST
INFO 08-20 13:49:08 launcher.py:22] Route: /detokenize, Methods: POST
INFO 08-20 13:49:08 launcher.py:22] Route: /v1/models, Methods: GET
INFO 08-20 13:49:08 launcher.py:22] Route: /version, Methods: GET
INFO 08-20 13:49:08 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 08-20 13:49:08 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 08-20 13:49:08 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [3049]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)

Taking a look at the warnings and trying to run it with --enable_chunked_prefill false still gives me the same result. When I call the /v1/embeddings endpoint, I get the following stacktrace:

INFO:     100.73.154.107:59215 - "POST /v1/embeddings HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/_utils.py", line 83, in collapse_excgroups
    yield
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/middleware/base.py", line 190, in __call__
    async with anyio.create_task_group() as task_group:
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/middleware/base.py", line 189, in __call__
    with collapse_excgroups():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/_utils.py", line 89, in collapse_excgroups
    raise exc
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/middleware/base.py", line 191, in __call__
    response = await self.dispatch_func(request, call_next)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 260, in authentication
    return await call_next(request)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/middleware/base.py", line 165, in call_next
    raise app_exc
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/middleware/base.py", line 151, in coro
    await self.app(scope, receive_or_disconnect, send_no_error)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/routing.py", line 754, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/routing.py", line 774, in app
    await route.handle(scope, receive, send)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/routing.py", line 295, in handle
    await self.app(scope, receive, send)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
    response = await f(request)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 218, in create_embedding
    generator = await openai_serving_embedding.create_embedding(
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_embedding.py", line 147, in create_embedding
    async for i, res in result_generator:
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/vllm/utils.py", line 346, in consumer
    raise e
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/vllm/utils.py", line 337, in consumer
    raise item
  File "/embedding_model_api/.venv/lib/python3.10/site-packages/vllm/utils.py", line 312, in producer
    async for item in iterator:
TypeError: 'async for' requires an object with __aiter__ method, got coroutine

I've tried to run different Qwen2 architecture models and get the same result when calling the /v1/embeddings endpoint. Also note that running meta-llama/Meta-Llama-3.1-8B-Instruct on the same setup, calling the /v1/chat/completions entpoint does work without issue.

wangzhen0518 commented 3 weeks ago

Maybe this bug is caused by line 128 in vllm/entrypoints/openai/serving_embedding.py in function reate_embedding of class OpenAIServingEmbedding.

generator = self.async_engine_client.encode(
    {"prompt_token_ids": prompt_inputs["prompt_token_ids"]},
    pooling_params,
    request_id_item,
    lora_request=lora_request,
)

I add an await before self.async_engine_client.encode, and this bug disappers, i.e.,

generator = await self.async_engine_client.encode(
    {"prompt_token_ids": prompt_inputs["prompt_token_ids"]},
    pooling_params,
    request_id_item,
    lora_request=lora_request,
)
```.

However, I met another problem after solve this bug. I encountered `NotImplementedError("Embeddings not supported with multiprocessing backend")` in file `vllm/entrypoints/openai/rpc/client.py` in function `encode` of class `AsyncEngineRPCClient`.
```python
async def encode(self, *args,
                 **kwargs) -> AsyncIterator[EmbeddingRequestOutput]:
    raise NotImplementedError(
        "Embeddings not supported with multiprocessing backend")

Therefore, I noticed there are some feature request issues for embedding results, e.g., #5600, #5950, #6947, and the conclusion was that embedding feature has been supported for no model except e5-mistral-7b-instruct (#3734). I tried to implement the aforementioned encode function by myself, but I failed. I am a beginner using vllm.

I have an urgent need to use the embedding feature, which is also a crucial function as mentioned in #5950. If the vllm team could add this feature, I would be extremely grateful. Additionally, my usecase does not have the capacity to run a model as large as e5-mistral-7b-instruct, which is 7B in size.

FacuBarrios01 commented 2 weeks ago

Same error running 'TheBloke/Mistral-7B-Instruct-v0.2-AWQ' with OpenAI compatible server Docker image.

services:
  vllm-service:
      container_name: vllm_mistral
      image: vllm/vllm-openai:latest
      ports:
        - "8000:8000"
      deploy: # Enable GPU resources 
          resources:
            reservations:
              devices:
                - capabilities: ["gpu"]
      volumes: 
        - vllm-volume:/root/.cache/huggingface 
      command: 
        --model TheBloke/openinstruct-mistral-7B-AWQ
        --quantization awq
        --max-model-len 2048
robertgshaw2-neuralmagic commented 2 weeks ago

Qwen2 is not supported for Embeddings at the current moment. We need to improve the error message here