vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27k stars 3.96k forks source link

[Bug]: openai_embedding_client returns len 8192 embedding not 4096 #6744

Closed ehuaa closed 1 month ago

ehuaa commented 1 month ago

Your current environment

Collecting environment information... PyTorch version: 2.3.1+cu121

GPU models and configuration: GPU 0: NVIDIA A40 GPU 1: NVIDIA A40 GPU 2: NVIDIA A40 GPU 3: NVIDIA A40

Nvidia driver version: 535.161.08 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] flashinfer==0.0.9+cu121torch2.3 [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] onnx==1.14.1 [pip3] onnxruntime==1.18.1 [pip3] sentence-transformers==3.0.1 [pip3] torch==2.3.1 [pip3] torchvision==0.18.1 [pip3] transformers==4.42.4 [pip3] triton==2.3.1

vLLM Version: 0.5.3

🐛 Describe the bug

My vllm version is the latest version, v0.5.3 post1 first i launch a embedding server as below python3 -m vllm.entrypoints.openai.api_server --model Salesforce/SFR-Embedding-Mistral --dtype bfloat16 --enforce-eager --max-model-len 8192 Salesforce/SFR-Embedding-Mistral is an embedding model which has the same architecture with intfloat/e5-mistral

then i use https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py to test online embedding result. And returns a tensor of 8192 length which is not 4096 as MistralModel's hidden size. I also make two other test: a. run tests/entrypoints/openai/test_embedding.py and found that there is no problem with the three tests, which the embedding size is exactly 4096. b. run examples/offline_inference_embedding.py and the embedding size is also exactly 4096.

Can you have a look at what's going wrong with openai_embedding_client.py, thanks

CatherineSue commented 1 month ago

Just checked OpenAI's python lib, they defaultly encode the float data to "base64" if encoding_format is not given, see here, so in openai_embedding_client.py, the encoding of the embedding returned became "base64" instead of "float", hence 8192 dimensions, if we add encoding_format=float, the returned dimensions will be 4096. Will add a fix soon.

hibukipanim commented 1 month ago

setting encoding_format=float indeed resolve the issue, however maybe there is still a bug with base64 in the vllm server ? as it's the default encoding_format used by openai python API it should still return the correct size I guess? the reason it's 8192 is that every second element is 0

HollowMan6 commented 3 weeks ago

setting encoding_format=float indeed resolve the issue, however maybe there is still a bug with base64 in the vllm server ? as it's the default encoding_format used by openai python API it should still return the correct size I guess? the reason it's 8192 is that every second element is 0

@hibukipanim This should hopefully fixed by https://github.com/vllm-project/vllm/pull/7855