vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.75k stars 4.49k forks source link

[Bug]: Error loading microsoft/Phi-3.5-vision-instruct #7718

Closed BabyChouSr closed 2 months ago

BabyChouSr commented 2 months ago

Your current environment

vllm version: Version: 0.5.4

🐛 Describe the bug

Repro command:

vllm serve microsoft/Phi-3.5-vision-instruct --trust-remote-code --max-model-len 4096

Error:

vllm serve microsoft/Phi-3.5-vision-instruct --trust-remote-code --max-model-len 4096
INFO 08-21 04:43:37 api_server.py:339] vLLM API server version 0.5.4
INFO 08-21 04:43:37 api_server.py:340] args: Namespace(model_tag='microsoft/Phi-3.5-vision-instruct', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='microsoft/Phi-3.5-vision-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7206f7951750>)
WARNING 08-21 04:43:37 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-21 04:43:38 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='microsoft/Phi-3.5-vision-instruct', speculative_config=None, tokenizer='microsoft/Phi-3.5-vision-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/Phi-3.5-vision-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-21 04:43:38 selector.py:170] Cannot use FlashAttention-2 backend due to sliding window.
INFO 08-21 04:43:38 selector.py:54] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-21 04:43:39 model_runner.py:720] Starting to load model microsoft/Phi-3.5-vision-instruct...
INFO 08-21 04:43:39 selector.py:170] Cannot use FlashAttention-2 backend due to sliding window.
INFO 08-21 04:43:39 selector.py:54] Using XFormers backend.
INFO 08-21 04:43:40 weight_utils.py:225] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.35it/s]

INFO 08-21 04:43:42 model_runner.py:732] Loading model weights took 7.7498 GB
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 263, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 362, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 940, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/phi3v.py", line 532, in forward
    inputs_embeds = merge_vision_embeddings(input_ids, inputs_embeds,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 29, in merge_vision_embeddings
    raise ValueError(
ValueError: Attempted to assign 1 x 781 = 781 image tokens to 2653 placeholders
DarkLight1337 commented 2 months ago

Can you check out #7710 and see if it fixes your issue?

berkecanrizai commented 2 months ago

@DarkLight1337 is this currently fixed? I am still getting the same error with the Dockerfile.cpu in this tutorial.

DarkLight1337 commented 2 months ago

@DarkLight1337 is this currently fixed? I am still getting the same error with the Dockerfile.cpu in this tutorial.

Which version of vLLM are you using?

berkecanrizai commented 2 months ago

@DarkLight1337 is this currently fixed? I am still getting the same error with the Dockerfile.cpu in this tutorial.

Which version of vLLM are you using?

0.5.5. I pulled from the source yesterday, so I assume that is the latest available version. I also tried adding a separate RUN pip install vllm==0.5.5 into the Docker to make sure it also happens in latest release.

Text only inference works fine for me (just text messages without any image), but, still getting the following errors with the image inputs:

ValueError: Attempted to assign 1921 = 1921 multimodal tokens to 0 placeholders

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 64, in _log_task_completion

This also happens with the microsoft/Phi-3-vision-128k-instruct, not only microsoft/Phi-3.5-vision-instruct.

DarkLight1337 commented 2 months ago

@DarkLight1337 is this currently fixed? I am still getting the same error with the Dockerfile.cpu in this tutorial.

Which version of vLLM are you using?

0.5.5. I pulled from the source yesterday, so I assume that is the latest available version. I also tried adding a separate RUN pip install vllm==0.5.5 into the Docker to make sure it also happens in latest release.

Text only inference works fine for me (just text messages without any image), but, still getting the following errors with the image inputs:

ValueError: Attempted to assign 1921 = 1921 multimodal tokens to 0 placeholders

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 64, in _log_task_completion

This also happens with the microsoft/Phi-3-vision-128k-instruct, not only microsoft/Phi-3.5-vision-instruct.

You may have to increase the max_model_len as multimodal tokens count towards the limit. Any excess tokens will be truncated.

berkecanrizai commented 2 months ago

@DarkLight1337 is this currently fixed? I am still getting the same error with the Dockerfile.cpu in this tutorial.

Which version of vLLM are you using?

0.5.5. I pulled from the source yesterday, so I assume that is the latest available version. I also tried adding a separate RUN pip install vllm==0.5.5 into the Docker to make sure it also happens in latest release. Text only inference works fine for me (just text messages without any image), but, still getting the following errors with the image inputs:

ValueError: Attempted to assign 1921 = 1921 multimodal tokens to 0 placeholders

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 64, in _log_task_completion

This also happens with the microsoft/Phi-3-vision-128k-instruct, not only microsoft/Phi-3.5-vision-instruct.

You may have to increase the max_model_len as multimodal tokens count towards the limit. Any excess tokens will be truncated.

I tried with larger max_model_len (80.000) as well as without limiting it, still getting the same error. I get this error on a CPU only machine. I had been running it without any errors on another machine with a GPU.

berkecanrizai commented 2 months ago
(VllmWorkerProcess pid=234352) ERROR 08-27 10:30:28 multiproc_worker_utils.py:226]   File "/home/{USER_NAME}/miniforge3/envs/vllm2/lib/python3.10/site-packages/vllm-0.5.5+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/utils.py", line 88, in merge_multimodal_embeddings
(VllmWorkerProcess pid=234352) ERROR 08-27 10:30:28 multiproc_worker_utils.py:226]     raise ValueError(
(VllmWorkerProcess pid=234352) ERROR 08-27 10:30:28 multiproc_worker_utils.py:226] ValueError: Attempted to assign 1921 = 1921 multimodal tokens to 0 placeholders

@DarkLight1337 this is the exact error I have. I get it in both the Docker and outside of the Docker.

DarkLight1337 commented 2 months ago

@Isotr0py since you have a CPU-only environment (and also implemented this model), can you help investigate this? Thanks!

Isotr0py commented 2 months ago

Ok, I will investigate this tonight.

berkecanrizai commented 2 months ago

Small addition @DarkLight1337 @Isotr0py ,

from vllm import LLM
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from vllm.utils import FlexibleArgumentParser

llm = LLM(
     model="microsoft/Phi-3.5-vision-instruct",
     trust_remote_code=True
 )

Image inputs work without any issues when I use the LLM as above with llm.generate..., however, OpenAI mimicking (python -m vllm.entrypoints.openai.api_server --model microsoft/Phi-3.5-vision-instruct --trust-remote-code) still fails with the error above.

DarkLight1337 commented 2 months ago

Please note that multi-image support is not supported yet for OpenAI-compatible server. Can you provide a minimum reproducible example?

berkecanrizai commented 2 months ago

Please note that multi-image support is not supported yet for OpenAI-compatible server. Can you provide a minimum reproducible example?

Sure, after running with the above instructions,

run the following:

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1". #### make sure this port is correct, I changed it to 8001 in server
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="microsoft/Phi-3.5-vision-instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                },
            },
        ],
    }],
)
print("Chat response:", chat_response)
Isotr0py commented 2 months ago

@berkecanrizai I have created #7916 to fix this. Please take a look at this :)

berkecanrizai commented 2 months ago

@berkecanrizai I have created #7916 to fix this. Please take a look at this :)

Thanks, that was fast :D