vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.19k stars 4.74k forks source link

[Bug]: Paligemma support for PNG files #6427

Closed BabyChouSr closed 4 months ago

BabyChouSr commented 4 months ago

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.5.0-1023-azure-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe
GPU 2: NVIDIA A100 80GB PCIe
GPU 3: NVIDIA A100 80GB PCIe

Nvidia driver version: 535.183.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             96
On-line CPU(s) list:                0-95
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7V13 64-Core Processor
CPU family:                         25
Model:                              1
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          2
Stepping:                           1
BogoMIPS:                           4890.88
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          3 MiB (96 instances)
L1i cache:                          3 MiB (96 instances)
L2 cache:                           48 MiB (96 instances)
L3 cache:                           384 MiB (12 instances)
NUMA node(s):                       4
NUMA node0 CPU(s):                  0-23
NUMA node1 CPU(s):                  24-47
NUMA node2 CPU(s):                  48-71
NUMA node3 CPU(s):                  72-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.4
[pip3] triton==2.3.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] torchvision               0.18.0                   pypi_0    pypi
[conda] transformers              4.42.4                   pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV12    SYS SYS NODE    0-23    0       N/A
GPU1    NV12     X  SYS SYS SYS 24-47   1       N/A
GPU2    SYS SYS  X  NV12    SYS 48-71   2       N/A
GPU3    SYS SYS NV12     X  SYS 72-95   3       N/A
NIC0    NODE    SYS SYS SYS  X              

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

🐛 Describe the bug

PNG files don't seem to work for paligemma-3b-mix-448.

To test, try the following command: python -m vllm.entrypoints.openai.api_server --model google/paligemma-3b-mix-448 --chat-template examples/template_llava.jinja" on the server.

Then, test this command using:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/paligemma-3b-mix-448",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What’s in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://placehold.co/600x400/png"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'

Error Traceback Output:

ERROR 07-15 00:01:58 async_llm_engine.py:54]   File "/home/lmsys/vllm/vllm/multimodal/registry.py", line 93, in map_input                                                                          [41/1414]
ERROR 07-15 00:01:58 async_llm_engine.py:54]     .map_input(model_config, data_value)
ERROR 07-15 00:01:58 async_llm_engine.py:54]   File "/home/lmsys/vllm/vllm/multimodal/base.py", line 213, in map_input
ERROR 07-15 00:01:58 async_llm_engine.py:54]     return mapper(InputContext(model_config), data)
ERROR 07-15 00:01:58 async_llm_engine.py:54]   File "/home/lmsys/vllm/vllm/multimodal/image.py", line 122, in _default_input_mapper
ERROR 07-15 00:01:58 async_llm_engine.py:54]     batch_data = image_processor \
ERROR 07-15 00:01:58 async_llm_engine.py:54]   File "/home/lmsys/miniconda3/envs/vllm-source/lib/python3.10/site-packages/transformers/models/siglip/image_processing_siglip.py", line 233, in preprocess
ERROR 07-15 00:01:58 async_llm_engine.py:54]     input_data_format = infer_channel_dimension_format(images[0])
ERROR 07-15 00:01:58 async_llm_engine.py:54]   File "/home/lmsys/miniconda3/envs/vllm-source/lib/python3.10/site-packages/transformers/image_utils.py", line 255, in infer_channel_dimension_format
ERROR 07-15 00:01:58 async_llm_engine.py:54]     raise ValueError("Unable to infer channel dimension format")
ERROR 07-15 00:01:58 async_llm_engine.py:54] ValueError: Unable to infer channel dimension format

However, if we test using a jpg image:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/paligemma-3b-mix-448",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What’s in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://placehold.co/600x400/jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'

Output:

{"id":"cmpl-7e2f12b67ab74eaeb0afd4d72d253540","object":"chat.completion","created":1721001808,"model":"google/paligemma-3b-mix-448","choices":[{"index":0,"message":{"role":"assistant","content":"Sorry, as a base VLM I am not trained to answer this question.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":1040,"total_tokens":1057,"completion_tokens":17}}

I believe that the reason why this is the case is because SigLip has a default num_channels parameter that is set to 3. When we take in PNG images, PNG images can have 4 channels (RGBA), which can lead to this mismatch. I discovered this mismatch when I was trying to load images using Image.open(image_url).convert('RGBA') and then realized that passing these images into vllm would not work due to the above error.

DarkLight1337 commented 4 months ago

Thanks for reporting this! Can you check whether #6430 fixes this issue?

ywang96 commented 4 months ago

Not related to this PR in particular, but since you're serving this from the OpenAI API server, I don't think PaliGemma is supposed to work out-of-box with it because it was never instruction fine-tuned.

In the PaliGemma paper, it says

Gemma [79] is a family of auto-regressive decoder-only open large language models built
from the same research and technology used to create the Gemini [7] models. The models come
in different sizes (2B, 7B), both pretrained and instruction fine-tuned. PaliGemma uses the 2B
pretrained version.
BabyChouSr commented 4 months ago

@DarkLight1337 Thank you for taking on this issue! Sorry, but this still doesn't work for me. I pulled your branch using git fetch origin pull/6430/head but i still run into the same error with the same input.

@ywang96 You bring up a good point! I'll have to familiarize myself with the paper, thanks for sharing.

DarkLight1337 commented 4 months ago

Oops, I forgot to update the async version of fetch_image. Can you try again?

BabyChouSr commented 4 months ago

thank you! works now :)

JanuRam commented 3 months ago

Hi @BabyChouSr I tried the below curl command on the paligemma model that we have hosted

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/paligemma-3b-mix-448",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What’s in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://placehold.co/600x400/jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'

I am getting the following output, not the one you mentioned

<|im_start>assistant\n<|im_start>assistant\n<|im_start>assistant\n<|im_start>assistant\n<|im_start>assistant\n<|im_start>assistant\n<|im_start>assistant\n<|im_start>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_participation>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_particip>assistant\n<|im_participation>assistant"

What might be the issue here, can you please help me!

DarkLight1337 commented 3 months ago

You should use a custom chat template so that the input has the same format as the one shown on HuggingFace.

JanuRam commented 3 months ago

@DarkLight1337 I hope the request body for the paligemma api is same for all when hosted through vLLM. Why we should be using custom chat template. Can you please elaborate much on this?

DarkLight1337 commented 3 months ago

From my understanding, PaliGemma isn't designed as a chat model so it doesn't have a built in chat template. In this case you are required to define your own template since there isn't a default chat template that works for all models.

JanuRam commented 3 months ago

@DarkLight1337 To give more context, I tried the above curl command on the paligemma model that we have hosted through vLLM framework as same as what @BabyChouSr used for his query. But our output was completely different from he has told. So, I had asked a help for that.

DarkLight1337 commented 3 months ago

How are you hosting the model? Please show the command that you used.

ywang96 commented 3 months ago

@DarkLight1337 To give more context, I tried the above curl command on the paligemma model that we have hosted through vLLM framework as same as what @BabyChouSr used for his query. But our output was completely different from he has told. So, I had asked a help for that.

I don't think by default the temperature is set to 0 (i.e, we're not greedily sampling) and that's probably why you're seeing the difference.

I would also encourage you to take a look at our example script examples/offline_inference_vision_language.py.

JanuRam commented 3 months ago

How are you hosting the model? Please show the command that you used.

@DarkLight1337 It is through a cloud platform called Jarvislabs.ai, they have a vLLM option to host open source models through hugging face. When I tried with paligemma, it gave us two apis, one is /v1/chat/completions and /v1/completions. I thought /v1/chat/completions would work for us and tried it, but didn't proper response. The simple goal here is to given an image and a prompt. It should be able to give the output.

DarkLight1337 commented 3 months ago

Do you have the ability to pass through command-line arguments? As mentioned above:

From my understanding, PaliGemma isn't designed as a chat model so it doesn't have a built in chat template. In this case you are required to define your own template since there isn't a default chat template that works for all models.

JanuRam commented 3 months ago

Do you have the ability to pass through command-line arguments? As mentioned above:

From my understanding, PaliGemma isn't designed as a chat model so it doesn't have a built in chat template. In this case you are required to define your own template since there isn't a default chat template that works for all models.

No, I have control only on the request body given for the API call.

DarkLight1337 commented 3 months ago

How about selecting the HuggingFace model to use? Maybe you can fork the model repo and add the chat template to it.

JanuRam commented 3 months ago

Not sure. But, my doubt is why I am not able to get a proper output as like @BabyChouSr got for his jpg image query using /v1/chat/completions api call with paligemma model?

BabyChouSr commented 3 months ago

@JanuRam I don't think that the model should be used for chat responses. You will not receive content that is very meaningful. Try by using the llava template. However, I would say that chat is probably not the use case that you would want to use this model for. If you are looking for chat, you should try https://huggingface.co/openbmb/MiniCPM-V-2_6

python -m vllm.entrypoints.openai.api_server \
    --model google/paligemma-3b-mix-224 \
    --chat-template template_llava.jinja
JanuRam commented 3 months ago

It is not for chat (conversational purpose), mainly for visual question answering to be precise.