vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.85k stars 4.51k forks source link

[Bug]: LLaVa Next Value Error - "Incorrect type of image sizes" when running in Docker #5868

Closed FennFlyer closed 4 months ago

FennFlyer commented 4 months ago

Your current environment

Current Environment

Docker image: vllm/vllm-openai:v0.5.0.post1

Running as part of a Docker Compose stack. Relevant sections of my docker-compose.yaml are below. This is part of a multi-model deployment with other vLLM-based text generation/chat models running successfully behind a Traefik reverse proxy. I split out the instance running LLaVa 1.6 into its own service in the docker-compose.yaml to test the different commands it requires passed in on startup, it is the third service in the file. I have included the .env file entries as well.

###docker-compose.yaml###

services:

  reverseproxy:
    image: ${PROXY_IMAGE}
    container_name: reverseproxy
    # Enables the web UI and tells Traefik to listen to docker
    command: --api.insecure=true --providers.docker --api.dashboard=true
    ports:
      # The HTTP port
      - "80:80"
      # The Web UI (enabled by --api.insecure=true)
      - "8080:8080"
    volumes:
      # So that Traefik can listen to the Docker events
      - /var/run/docker.sock:/var/run/docker.sock
    networks: 
      - llm-net

  ## Current best solution for chat/text generation models
  ## Change GPU device_ids if necessary
  vllm-server:
    depends_on:
     - reverseproxy
    image: ${VLLM_IMAGE}
    container_name: vllm-server
    restart: always
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              device_ids: ['0']
    volumes:
      - ${MODEL_VOL}/${VLLM_MODEL_ID}:/vllm-workspace/${VLLM_MODEL_ID}
    command: ["--model", "${VLLM_MODEL_ID}", "--gpu-memory-utilization", "0.75", "--host", "0.0.0.0", "--root-path", "/vllm-server"]
    labels:
      - traefik.enable=true
      - traefik.http.routers.vllm-server.rule=PathPrefix(`/vllm-server`)
      - traefik.http.routers.vllm-server.middlewares=vllm-server-stripprefix
      - traefik.http.middlewares.vllm-server-stripprefix.stripprefix.prefixes=/vllm-server
      - traefik.http.services.vllm-server.loadbalancer.server.port=8000
    networks: 
      - llm-net
    # ports:
    #  - 8000:8000

## Testing llava serving with vllm
 ## Change GPU device_ids if necessary
  vllm-llava-server:
    depends_on:
     - reverseproxy
    image: ${VLLM_IMAGE}
    container_name: vllm-llava-server
    restart: always
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              device_ids: ['0']
    volumes:
      - ${MODEL_VOL}/${VLLM_IMAGE_MODEL_ID}:/vllm-workspace/${VLLM_IMAGE_MODEL_ID}
    command: ["--model", "${VLLM_IMAGE_MODEL_ID}", "--gpu-memory-utilization", "0.75", "--host", "0.0.0.0", "--root-path", "/vllm-llava-server",
      "--image-input-type", "pixel_values", "--image-token-id", "32000", "--image-input-shape", "1,3,336,336", "--image-feature-size", "576",
      "--chat-template", "template_llava.jinja"]
    labels:
      - traefik.enable=true
      - traefik.http.routers.vllm-llava-server.rule=PathPrefix(`/vllm-llava-server`)
      - traefik.http.routers.vllm-llava-server.middlewares=vllm-llava-server-stripprefix
      - traefik.http.middlewares.vllm-llava-server-stripprefix.stripprefix.prefixes=/vllm-llava-server
      - traefik.http.services.vllm-llava-server.loadbalancer.server.port=8000
    networks: 
      - llm-net
    # ports:
    #  - 8000:8000
###.env file###

MODEL_VOL=/home/<intermediate_paths>/models
VLLM_MODEL_ID=Meta-Llama-3-8B-Instruct
VLLM_IMAGE_MODEL_ID=llava-v1.6-mistral-7b-hf
PROXY_IMAGE=traefik
VLLM_IMAGE=vllm/vllm-openai:v0.5.0.post1

VLLM_IMAGE_MODEL_ID points to a cloned Huggingface directory from https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf (with template_llava.jinja added) that has directory structure:

###llava-v1.6-mistral-7b-hf directory structure###

config.json
generation_config.json
.git
.gitattributes
model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
model-00003-of-00004.safetensors
model-00004-of-00004.safetensors
model.safetensors.index.json
preprocessor_config.json
README.md
special_tokens_map.json
template_llava.jinja
tokenizer_config.json
tokenizer.json
tokenizer.model

🐛 Describe the bug

Bug description

On starting the service with docker compose --env-file .env.llava up reverseproxy vllm-llava-server, it appears to do the usual startup, but then throws a ValueError, see below for full text and STDOUT. I have included all startup values that appear to be required when instantiating a new LLM object from https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py, am I missing something from my command entry in the docker-compose.yaml?

vllm-llava-server  | INFO 06-26 18:28:25 api_server.py:177] vLLM API server version 0.5.0.post1                                                                                                                  vllm-llava-server  | INFO 06-26 18:28:25 api_server.py:178] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template='template_llava.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path='/vllm-llava-server', middleware=[], model='llava-v1.6-mistral-7b-hf', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.75, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type='pixel_values', image_token_id=32000, image_input_shape='1,3,336,336', image_feature_size=576, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
vllm-llava-server  | INFO 06-26 18:28:25 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='llava-v1.6-mistral-7b-hf', speculative_config=None, tokenizer='llava-v1.6-mistral-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=llava-v1.6-mistral-7b-hf)
vllm-llava-server  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
vllm-llava-server  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
vllm-llava-server  | INFO 06-26 18:29:15 model_runner.py:160] Loading model weights took 14.1020 GB
vllm-llava-server  | [rank0]: Traceback (most recent call last):
vllm-llava-server  | [rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
vllm-llava-server  | [rank0]:     return _run_code(code, main_globals, None,
vllm-llava-server  | [rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
vllm-llava-server  | [rank0]:     exec(code, run_globals)
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in <module>
vllm-llava-server  | [rank0]:     engine = AsyncLLMEngine.from_engine_args(
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args
vllm-llava-server  | [rank0]:     engine = cls(
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349, in __init__
vllm-llava-server  | [rank0]:     self.engine = self._init_engine(*args, **kwargs)
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine
vllm-llava-server  | [rank0]:     return engine_class(*args, **kwargs)
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 236, in __init__
vllm-llava-server  | [rank0]:     self._initialize_kv_caches()
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches
vllm-llava-server  | [rank0]:     self.model_executor.determine_num_available_blocks())
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 75, in determine_num_available_blocks
vllm-llava-server  | [rank0]:     return self.driver_worker.determine_num_available_blocks()
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
vllm-llava-server  | [rank0]:     return func(*args, **kwargs)
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
vllm-llava-server  | [rank0]:     self.model_runner.profile_run()
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
vllm-llava-server  | [rank0]:     return func(*args, **kwargs)
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 844, in profile_run
vllm-llava-server  | [rank0]:     self.execute_model(seqs, kv_caches)
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
vllm-llava-server  | [rank0]:     return func(*args, **kwargs)
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
vllm-llava-server  | [rank0]:     hidden_states = model_executable(
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
vllm-llava-server  | [rank0]:     return self._call_impl(*args, **kwargs)
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
vllm-llava-server  | [rank0]:     return forward_call(*args, **kwargs)
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava_next.py", line 383, in forward
vllm-llava-server  | [rank0]:     image_input = self._parse_and_validate_image_input(**kwargs)
vllm-llava-server  | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava_next.py", line 196, in _parse_and_validate_image_input
vllm-llava-server  | [rank0]:     raise ValueError("Incorrect type of image sizes. "
vllm-llava-server  | [rank0]: ValueError: Incorrect type of image sizes. Got type: <class 'NoneType'>
vllm-llava-server exited with code 0
DarkLight1337 commented 4 months ago

~Does the model fail upon startup? Otherwise, can you provide an example OpenAI API request that triggers this error?~

Can you try out #5214 and see if you get the same problem? The profile_run logic should be fixed there.

FennFlyer commented 4 months ago

Sure, do you have a recommended way to build the container? Just do the usual clone and Docker build on the branch or does your team have any build magic happening that I need to know about? Right now I'm just pulling straight from Docker Hub.

DarkLight1337 commented 4 months ago

Sorry I missed this - I haven't used the Docker container myself, but from my understanding, you can use the Dockerfile from the main branch directly.

DarkLight1337 commented 4 months ago

v0.5.1 has been released so you can directly use the official Docker image now.

FennFlyer commented 4 months ago

Thank you, I was out on holiday last week so I will test the new image ASAP!