Open devMls opened 3 days ago
You need to pull the latest code (not latest release). You can follow these instructions to install it.
Hi, understarted.
is it the same reason for I can't find /v1/score endpoint?
WARNING 11-27 14:00:36 config.py:487] Async output processing is only supported for CUDA, TPU, XPU and HPU.Disabling it for other platforms.
INFO 11-27 14:00:36 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='BAAI/bge-m3', speculative_config=None, tokenizer='BAAI/bge-m3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=BAAI/bge-m3, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type='CLS', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None))
WARNING 11-27 14:00:38 cpu_executor.py:320] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 11-27 14:00:38 cpu_executor.py:350] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 11-27 14:00:38 selector.py:261] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-27 14:00:38 selector.py:144] Using XFormers backend.
INFO 11-27 14:00:38 weight_utils.py:243] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py:425: FutureWarning: You are using torch.load
with weights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only
will be flipped to True
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals
. We recommend you start setting weights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.04s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.04s/it]
INFO 11-27 14:00:42 api_server.py:249] vLLM to use /tmp/tmpsya_4v8d as PROMETHEUS_MULTIPROC_DIR
INFO 11-27 14:00:42 launcher.py:19] Available routes are:
INFO 11-27 14:00:42 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 11-27 14:00:42 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 11-27 14:00:42 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 11-27 14:00:42 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 11-27 14:00:42 launcher.py:27] Route: /health, Methods: GET
INFO 11-27 14:00:42 launcher.py:27] Route: /tokenize, Methods: POST
INFO 11-27 14:00:42 launcher.py:27] Route: /detokenize, Methods: POST
INFO 11-27 14:00:42 launcher.py:27] Route: /v1/models, Methods: GET
INFO 11-27 14:00:42 launcher.py:27] Route: /version, Methods: GET
INFO 11-27 14:00:42 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 11-27 14:00:42 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 11-27 14:00:42 launcher.py:27] Route: /v1/embeddings, Methods: POST
Score API is currently only supported for cross-encoder models. I think it makes sense to extend this to regular embedding models by computing the dot product between the embeddings of each sentence (normalization is already handled by Pooler). What do you think @maxdebayser ?
For add context to my goal:
I want to incorporate jinaai/jina-reranker-v2-base-multilingual as reranker to ragflow
using the possibility of add servers openai compatible of ragflow:
For the moment I try adding BAAI/bge-m3 ( jinaai/jina-reranker-v2-base-multilingual not work with the docker release) that is used as reranker in ragflow. but when I deploy VLLM with BAAI/bge-m3, score api is not available
For the moment I try adding BAAI/bge-m3 ( jinaai/jina-reranker-v2-base-multilingual not work with the docker release) that is used as reranker in ragflow. but when I deploy VLLM with BAAI/bge-m3, score api is not available
Can you use embedding API in the meantime?
Score API is currently only supported for cross-encoder models. I think it makes sense to extend this to regular embedding models by computing the dot product between the embeddings of each sentence (normalization is already handled by Pooler). What do you think @maxdebayser ?
@DarkLight1337 , I think it makes sense. My only concern is how the user would know whether bi-encoding or cross encoding is used. Perhaps we could add a metadata field in the response JSON to inform which one was used.
I think it's not necessary for users to know the details of the model being used. As long as it can output a score, it can satisfy the semantics of the API.
Yes, I agree. But the reason I thought about this is that at my company I often answer support issues from developers that are calling vLLM but don't know all the technical details about the models. It would be nice if there was a way for them to know what method is being used to arrive at the score. Would it make sense to add is_cross_encoder
to the info endpoint?
I think it is fine to add this information, just that it's not necessary.
I've opened an issue for that: https://github.com/vllm-project/vllm/issues/10752 . Can we assign @flaviabeo ?
Your current environment
I install using docker swarm on dedicated cloud vps on hetzner, I want run a lightweight model "jinaai/jina-embeddings-v3", I assume that the cpu and ram i sufficient in a 16gb ram and 4 dedicated cpu.
My docker compose file
services: jinna: hostname: jinaai image: vllm/vllm-openai:latest command: --trust_remote_code --model jinaai/jina-reranker-v2-base-multilingual --device="cpu"
volumes:
jinaai:/root/.cache/huggingface environment: HUGGING_FACE_HUB_TOKEN=${secret} networks: ragflow bridge volumes: jinaai: driver: local
networks: ragflow: external: true bridge: name: bridge external: true
Model Input Dumps
No response
🐛 Describe the bug
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 114, in from_engine_args
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 959, in create_engine_config
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 891, in create_model_config
File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 251, in init
File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 277, in _init_multimodal_config
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 422, in is_multimodal_model
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 391, in inspect_model_cls
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 352, in _raise_for_unsupported
ValueError: Model architectures ['XLMRobertaForSequenceClassification'] are not supported for now. Supported architectures: dict_keys(['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'InternLM2VEForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'FalconMambaForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'Florence2ForConditionalGeneration', 'BertModel', 'RobertaModel', 'XLMRobertaModel', 'Gemma2Model', 'LlamaModel', 'MistralModel', 'Qwen2Model', 'Qwen2ForRewardModel', 'Qwen2ForSequenceClassification', 'LlavaNextForConditionalGeneration', 'Phi3VForCausalLM', 'Qwen2VLForConditionalGeneration', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'H2OVLChatModel', 'InternVLChatModel', 'Idefics3ForConditionalGeneration', 'LlavaForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2AudioForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel'])
Before submitting a new issue...