[Bug]: documentation say XLMRobertaForSequenceClassification is supported but logs say ['XLMRobertaForSequenceClassification'] are not supported for now

devMls commented 3 days ago

Your current environment

I install using docker swarm on dedicated cloud vps on hetzner, I want run a lightweight model "jinaai/jina-embeddings-v3", I assume that the cpu and ram i sufficient in a 16gb ram and 4 dedicated cpu.

My docker compose file

services: jinna: hostname: jinaai image: vllm/vllm-openai:latest command: --trust_remote_code --model jinaai/jina-reranker-v2-base-multilingual --device="cpu"
volumes:

jinaai:/root/.cache/huggingface environment: HUGGING_FACE_HUB_TOKEN=${secret} networks: ragflow bridge volumes: jinaai: driver: local

networks: ragflow: external: true bridge: name: bridge external: true

Model Input Dumps

No response

🐛 Describe the bug

Traceback (most recent call last):

File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

self.run()

File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run

self._target(*self._args, **self._kwargs)

File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine

raise e

File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine

engine = MQLLMEngine.from_engine_args(engine_args=engine_args,

         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 114, in from_engine_args

engine_config = engine_args.create_engine_config()

                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 959, in create_engine_config

model_config = self.create_model_config()

               ^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 891, in create_model_config

return ModelConfig(

       ^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 251, in init

self.multimodal_config = self._init_multimodal_config(

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 277, in _init_multimodal_config

if ModelRegistry.is_multimodal_model(architectures):

   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 422, in is_multimodal_model

return self.inspect_model_cls(architectures).supports_multimodal

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 391, in inspect_model_cls

return self._raise_for_unsupported(architectures)

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 352, in _raise_for_unsupported

raise ValueError(

ValueError: Model architectures ['XLMRobertaForSequenceClassification'] are not supported for now. Supported architectures: dict_keys(['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'InternLM2VEForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'FalconMambaForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'Florence2ForConditionalGeneration', 'BertModel', 'RobertaModel', 'XLMRobertaModel', 'Gemma2Model', 'LlamaModel', 'MistralModel', 'Qwen2Model', 'Qwen2ForRewardModel', 'Qwen2ForSequenceClassification', 'LlavaNextForConditionalGeneration', 'Phi3VForCausalLM', 'Qwen2VLForConditionalGeneration', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'H2OVLChatModel', 'InternVLChatModel', 'Idefics3ForConditionalGeneration', 'LlavaForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2AudioForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel'])

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 3 days ago

You need to pull the latest code (not latest release). You can follow these instructions to install it.

devMls commented 2 days ago

Hi, understarted.

is it the same reason for I can't find /v1/score endpoint?

WARNING 11-27 14:00:36 config.py:487] Async output processing is only supported for CUDA, TPU, XPU and HPU.Disabling it for other platforms.

INFO 11-27 14:00:36 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='BAAI/bge-m3', speculative_config=None, tokenizer='BAAI/bge-m3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=BAAI/bge-m3, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type='CLS', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None))

WARNING 11-27 14:00:38 cpu_executor.py:320] CUDA graph is not supported on CPU, fallback to the eager mode.

WARNING 11-27 14:00:38 cpu_executor.py:350] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.

INFO 11-27 14:00:38 selector.py:261] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.

INFO 11-27 14:00:38 selector.py:144] Using XFormers backend.

INFO 11-27 14:00:38 weight_utils.py:243] Using model weights format ['*.bin']

Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]

/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py:425: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.

state = torch.load(bin_file, map_location="cpu")

Loading pt checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.04s/it]

INFO 11-27 14:00:42 api_server.py:249] vLLM to use /tmp/tmpsya_4v8d as PROMETHEUS_MULTIPROC_DIR

INFO 11-27 14:00:42 launcher.py:19] Available routes are:

INFO 11-27 14:00:42 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD

INFO 11-27 14:00:42 launcher.py:27] Route: /docs, Methods: GET, HEAD

INFO 11-27 14:00:42 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD

INFO 11-27 14:00:42 launcher.py:27] Route: /redoc, Methods: GET, HEAD

INFO 11-27 14:00:42 launcher.py:27] Route: /health, Methods: GET

INFO 11-27 14:00:42 launcher.py:27] Route: /tokenize, Methods: POST

INFO 11-27 14:00:42 launcher.py:27] Route: /detokenize, Methods: POST

INFO 11-27 14:00:42 launcher.py:27] Route: /v1/models, Methods: GET

INFO 11-27 14:00:42 launcher.py:27] Route: /version, Methods: GET

INFO 11-27 14:00:42 launcher.py:27] Route: /v1/chat/completions, Methods: POST

INFO 11-27 14:00:42 launcher.py:27] Route: /v1/completions, Methods: POST

INFO 11-27 14:00:42 launcher.py:27] Route: /v1/embeddings, Methods: POST

DarkLight1337 commented 2 days ago

Score API is currently only supported for cross-encoder models. I think it makes sense to extend this to regular embedding models by computing the dot product between the embeddings of each sentence (normalization is already handled by Pooler). What do you think @maxdebayser ?

devMls commented 2 days ago

For add context to my goal:

I want to incorporate jinaai/jina-reranker-v2-base-multilingual as reranker to ragflow

using the possibility of add servers openai compatible of ragflow:

For the moment I try adding BAAI/bge-m3 ( jinaai/jina-reranker-v2-base-multilingual not work with the docker release) that is used as reranker in ragflow. but when I deploy VLLM with BAAI/bge-m3, score api is not available

DarkLight1337 commented 2 days ago

For the moment I try adding BAAI/bge-m3 ( jinaai/jina-reranker-v2-base-multilingual not work with the docker release) that is used as reranker in ragflow. but when I deploy VLLM with BAAI/bge-m3, score api is not available

Can you use embedding API in the meantime?

maxdebayser commented 2 days ago

Score API is currently only supported for cross-encoder models. I think it makes sense to extend this to regular embedding models by computing the dot product between the embeddings of each sentence (normalization is already handled by Pooler). What do you think @maxdebayser ?

@DarkLight1337 , I think it makes sense. My only concern is how the user would know whether bi-encoding or cross encoding is used. Perhaps we could add a metadata field in the response JSON to inform which one was used.

DarkLight1337 commented 2 days ago

I think it's not necessary for users to know the details of the model being used. As long as it can output a score, it can satisfy the semantics of the API.

maxdebayser commented 2 days ago

Yes, I agree. But the reason I thought about this is that at my company I often answer support issues from developers that are calling vLLM but don't know all the technical details about the models. It would be nice if there was a way for them to know what method is being used to arrive at the score. Would it make sense to add is_cross_encoder to the info endpoint?

DarkLight1337 commented 2 days ago

I think it is fine to add this information, just that it's not necessary.

maxdebayser commented 2 days ago

I've opened an issue for that: https://github.com/vllm-project/vllm/issues/10752 . Can we assign @flaviabeo ?

vllm-project / vllm