vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.67k stars 4.65k forks source link

[Feature]: soft limit-mm-per-prompt for MM OAI API #9805

Open SinanAkkoyun opened 3 weeks ago

SinanAkkoyun commented 3 weeks ago

🚀 The feature, motivation and pitch

When starting a VLM with --limit-mm-per-prompt and max_pixels set, vLLM won't start if the model ctx length exceeds the mm limit * max_pixels token usage. However, this is too precautious and now does not enable me to have 10 small images that easily fit into context length when also wanting to support bigger images.

Alternatives

Remove default limit-mm-per-prompt of 1 and only use existing logic when limit-mm-per-prompt is set

Additional context

WARNING 10-29 17:42:42 model_runner.py:1247] Computed max_num_seqs (min(256, 32768 // 147456)) to be less than 1. Setting it to the minimum value of 1.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 397, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 147, in from_engine_args
    return cls(ipc_path=ipc_path,
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 83, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 354, in __init__
    self._initialize_kv_caches()
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 491, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1259, in profile_run
    .dummy_data_for_profiling(self.model_config,
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/inputs/registry.py", line 223, in dummy_data_for_profiling
    seq_data, mm_data = dummy_factory(InputContext(model_config), seq_len,
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_vl.py", line 704, in dummy_data_for_qwen2_vl
    raise RuntimeError(
RuntimeError: Qwen2-VL cannot process 8 images in a prompt, please increase max_model_len or reduce image limit by --limit-mm-per-prompt.
[rank0]:[W1029 17:42:43.605343127 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Traceback (most recent call last):
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start

Before submitting a new issue...

DarkLight1337 commented 3 weeks ago

This other issue is basically asking for the same thing: https://github.com/vllm-project/vllm/issues/9169

alex-jw-brooks commented 3 weeks ago

Allowing per-request mm_processor_kwargs when running the server could also help rectify this specific issue, with the caveat that you could OOM the server if you're not careful with your settings - I was planning to open a PR to potentially allow that, but got sidetracked with other things. I will look into it again when I have time (likely in a few weeks)

SinanAkkoyun commented 3 weeks ago

Allowing per-request mm_processor_kwargs when running the server could also help rectify this specific issue

That would be even better, thanks! Would it be a lot of work to code a failsafe that returns 400 when the prompt exceeds max model length before the model decoding starts?