vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.7k stars 4.48k forks source link

[Bug]: use Internvl2 generated content is incomplete #7190

Open linssonSUSUSU opened 3 months ago

linssonSUSUSU commented 3 months ago

Your current environment

PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.17

Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G Nvidia driver version: 530.30.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7R32 Stepping: 0 CPU MHz: 2799.998 BogoMIPS: 5599.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] pyzmq==26.1.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.43.4 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] pyzmq 26.1.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.43.4 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU0 X 0-3 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

sampling_params = SamplingParams(temperature=0, stop=["<|end|>"])
llm = LLM(model="OpenGVLab/InternVL2-2B", trust_remote_code=True, enforce_eager=True, max_model_len=4096, gpu_memory_utilization=0.9)
prompt = llm.get_tokenizer().apply_chat_template(
    [
        {"role": "system", "content": "Answer the question."},
        {"role": "user", "content": "<image>\nWhat is shown in the image?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

image = Image.open(
    "/home/centos/linson/InternVL/test_datas/test_data_口红/口红_1_1.png"
)

inputs = {"prompt": prompt, "multi_modal_data": {"image": image}}
outputs = llm.generate(inputs, sampling_params=sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

output:
    Prompt: '<s><|im_start|>system\nAnswer the question.<|im_end|>\n<|im_start|>user\n<image>\nWhat is shown in the image?<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The image shows two lip products. On the left is a lip balm,'

I tried the 2B and 8B models, the output appears to be interrupted and not complete, and the finish_reason = 'length'. How can I solve this problem?

Isotr0py commented 3 months ago

You can increase the max_tokens in SamplingParams, it is set as 16 by default.

linssonSUSUSU commented 3 months ago

You can increase the max_tokens in SamplingParams, it is set as 16 by default.

Thank you. I set it as 1024, but the output over and over again.

Prompt: '<s><|im_start|>system\nAnswer the question.<|im_end|>\n<|im_start|>user\n<image>\nWhat is shown in the image?<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The image shows two lip products. On the left is a lip balm, and on the right is a lip gloss. Both products have a similar design, with a white base and a pink lip gloss on the right and a lip balm on the left. The lip balm has the text "Glasting Melting Balm" written on it.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.'

Ryan-Nightwish commented 2 months ago

You can increase the max_tokens in SamplingParams, it is set as 16 by default.

I encountered the same issue as well, where I utilized the InternVL2-1B model. My code remained consistent with @linssonSUSUSU but I use vllm==0.5.5. In fact, I noticed that the models accelerated using VLLM not only tended to produce repetitive responses but also exhibited a significant decline in the quality of the generated answers. Below, I will provide an example to demonstrate this behavior.

For Image:

4_1

For Image:

11_1

For both methods, I set the max_tokens=1024. I find that answer generated without vllm is more organized and detailed. I wonder why sometimes the answer is repetitive, and why the generated quality is different.

Isotr0py commented 2 months ago

@linssonSUSUSU Sorry for the delay reply. I have forgotten the message in notification. 😢

@Ryan-Nightwish About the repetitive answer, you can try to increase the repetition_penalty in SamplingParams. And it's reported that the repetitive issue may also be related to the model training itself refer to InternVL#490.

Besides, the vision transformer implementation of InternVL models in vllm would have a numeric difference compared to hf currently. And this may affect the generated quality as well.