[Bug]: use Internvl2 generated content is incomplete

linssonSUSUSU commented 3 months ago

Your current environment

PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.17

Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G Nvidia driver version: 530.30.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7R32 Stepping: 0 CPU MHz: 2799.998 BogoMIPS: 5599.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] pyzmq==26.1.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.43.4 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] pyzmq 26.1.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.43.4 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU0 X 0-3 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

sampling_params = SamplingParams(temperature=0, stop=["<|end|>"])
llm = LLM(model="OpenGVLab/InternVL2-2B", trust_remote_code=True, enforce_eager=True, max_model_len=4096, gpu_memory_utilization=0.9)
prompt = llm.get_tokenizer().apply_chat_template(
    [
        {"role": "system", "content": "Answer the question."},
        {"role": "user", "content": "<image>\nWhat is shown in the image?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

image = Image.open(
    "/home/centos/linson/InternVL/test_datas/test_data_口红/口红_1_1.png"
)

inputs = {"prompt": prompt, "multi_modal_data": {"image": image}}
outputs = llm.generate(inputs, sampling_params=sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

output:
    Prompt: '<s><|im_start|>system\nAnswer the question.<|im_end|>\n<|im_start|>user\n<image>\nWhat is shown in the image?<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The image shows two lip products. On the left is a lip balm,'

I tried the 2B and 8B models, the output appears to be interrupted and not complete, and the finish_reason = 'length'. How can I solve this problem?

Isotr0py commented 3 months ago

You can increase the max_tokens in SamplingParams, it is set as 16 by default.

linssonSUSUSU commented 3 months ago

You can increase the max_tokens in SamplingParams, it is set as 16 by default.

Thank you. I set it as 1024, but the output over and over again.

Prompt: '<s><|im_start|>system\nAnswer the question.<|im_end|>\n<|im_start|>user\n<image>\nWhat is shown in the image?<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The image shows two lip products. On the left is a lip balm, and on the right is a lip gloss. Both products have a similar design, with a white base and a pink lip gloss on the right and a lip balm on the left. The lip balm has the text "Glasting Melting Balm" written on it.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.\nThe image shows two lip products.'

Ryan-Nightwish commented 2 months ago

You can increase the max_tokens in SamplingParams, it is set as 16 by default.

I encountered the same issue as well, where I utilized the InternVL2-1B model. My code remained consistent with @linssonSUSUSU but I use vllm==0.5.5. In fact, I noticed that the models accelerated using VLLM not only tended to produce repetitive responses but also exhibited a significant decline in the quality of the generated answers. Below, I will provide an example to demonstrate this behavior.

For Image:

4_1

Answer with vllm:

he image is a graph with two plots, each for a different dataset. The x-axis is the iteration number, and the y-axis is the error percentage. The plots are for the datasets "plain-18" and "plain-34", and the "18-layer" and "34-layer" datasets. The plots are color-coded to differentiate the datasets. The graphs are for the 104th iteration of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of the 104th epoch of

Answer without vllm:

The image contains two graphs, each representing different datasets and their respective error rates over iterations (1e4). The graphs are plotted on a Cartesian coordinate system with the x-axis representing the number of iterations (1e4) and the y-axis representing the error rate percentage.

### Left Graph:
- **Title:** "error (%)"
- **X-axis:** "iter. (1e4)"
- **Y-axis:** "error (%)"
- **Legend:** 
- Blue line: "plain-18"
- Red line: "plain-34"
- **Data Points:**
- The blue line (plain-18) starts at a high error rate and decreases steadily over the iterations.
- The red line (plain-34) starts at a lower error rate and also decreases steadily but at a slower rate compared to the plain-18 line.
- Both lines converge towards the bottom of the graph, indicating a similar error rate over the iterations.

### Right Graph:
- **Title:** "error (%)"
- **X-axis:** "iter. (1e4)"
- **Y-axis:** "error (%)"
- **Legend:** 
- Blue line: "ResNet-18"
- Red line: "ResNet-34"
- **Data Points:**
- The blue line (ResNet-18) starts at a high error rate and decreases steadily over the iterations.
- The red line (ResNet-34) starts at a lower error rate and also decreases steadily but at a slower rate compared to the plain-18 line.
- Both lines converge towards the bottom of the graph, indicating a similar error rate over the iterations.

### Analysis:
- Both graphs show a similar trend of decreasing error rates over iterations.
- The error rates for both plain-18 and plain-34 datasets decrease steadily, but the rate of decrease is slower for the plain-34 dataset compared to the plain-18 dataset.
- The error rates for the ResNet-18 and ResNet-34 datasets also decrease steadily, but the rate of decrease is slower compared to the plain-18 dataset.
- The error rates for the plain-18 and plain-34 datasets converge towards the bottom of the graph, indicating a similar performance over the iterations.

### Conclusion:
The graphs demonstrate that both plain-18 and plain-34 datasets show a consistent decrease in error rates over iterations, but the rate of decrease is slower for the plain-34 dataset compared to the plain-18 dataset. The error rates for the ResNet-18 and ResNet-34 datasets also show a similar trend but are slower to decrease compared to the plain-18 dataset.

For Image:

11_1

Answer with vllm:

The image is a table that compares the performance of different machine learning models in a classification task. The table is in a Markdown format, with the columns and rows in a table. The table is as follows:

| method | top-5 localization err |
|-------|----------------------|
| OverFeat [40] (ILSVRC'13) | 30.0 |
| GoogLeNet [44] (ILSVRC'14) | - |
| VGG [41] (ILSVRC'14) | 26.9 |
| ours (ILSVRC'15) | 8.9 |

The table is a 2x2 matrix, with the rows and columns. The columns are the names of the machine learning models, and the rows are the names of the models and the metrics for the localization error in the top 5. The metrics are the average and the test error. The data is as follows:

1. OverFeat [40] (ILSVRC'13): 30.0, Test Error: 29.9
2. GoogLeNet [44] (ILSVRC'14): -, Test Error: 26.7
3. VGG [41] (ILSVRC'14): 26.9, Test Error: 25.3
4. ours (ILSVRC'15): 8.9, Test Error: 9.0

The table is a data summary of the localization error for the models in the ILSVRC'15 dataset.

Answer without vllm:

The image is a table that presents the results of a comparative analysis of different methods for localizing errors in a dataset. The table is organized into two main columns: "method" and "top-5 localization err." The "method" column lists the names of various methods, while the "top-5 localization err." column lists the error values for each method. The table also includes two rows for each method, one for the "val" and one for the "test" values.

Here is the detailed breakdown of the table:

| method | val | test |
|-------|-----|-----|
| OverFeat [40] (ILSVRC'13) | 30.0 | 29.9 |
| GoogLeNet [44] (ILSVRC'14) | - | 26.7 |
| VGG [41] (ILSVRC'14) | 26.9 | 25.3 |
| ours (ILSVRC'15) | 8.9 | 9.0 |

### Analysis and Description:

1. **OverFeat [40] (ILSVRC'13)**:
 - The top-5 localization error is 30.0.
 - The validation error is 29.9.
 - The test error is 29.9.

2. **GoogLeNet [44] (ILSVRC'14)**:
 - The top-5 localization error is -.
 - The validation error is 26.7.
 - The test error is 26.9.

3. **VGG [41] (ILSVRC'14)**:
 - The top-5 localization error is 26.9.
 - The validation error is 25.3.
 - The test error is 25.3.

4. **ours (ILSVRC'15)**:
 - The top-5 localization error is 8.9.
 - The validation error is 9.0.
 - The test error is 9.0.

### Observations:

- **OverFeat [40] (ILSVRC'13)** has the highest top-5 localization error of 30.0, which is significantly higher than the other methods.
- **GoogLeNet [44] (ILSVRC'14)** has the lowest top-5 localization error of -1, indicating it has the lowest error rate among the methods listed.
- **VGG [41] (ILSVRC'14)** has a middle-range top-5 localization error of 26.9, which is lower than the other methods.
- **ours (ILSVRC'15)** has the lowest top-5 localization error of 8.9, which is the lowest among the methods listed.

### Conclusion:

From the data presented in the table, it is evident that the method "OverFeat [40] (ILSVRC'13)" has the highest error rate in the top-5 localization errors, followed by "GoogLeNet [44] (ILSVRC'14)" and "VGG [41] (ILSVRC'14)." The lowest error rate is observed in "ours (ILSVRC'15)." The comparison indicates that "OverFeat [40] (ILSVRC'13)" may be more effective in localizing errors compared to the other methods listed.

For both methods, I set the max_tokens=1024. I find that answer generated without vllm is more organized and detailed. I wonder why sometimes the answer is repetitive, and why the generated quality is different.

Isotr0py commented 2 months ago

@linssonSUSUSU Sorry for the delay reply. I have forgotten the message in notification. 😢

@Ryan-Nightwish About the repetitive answer, you can try to increase the repetition_penalty in SamplingParams. And it's reported that the repetitive issue may also be related to the model training itself refer to InternVL#490.

Besides, the vision transformer implementation of InternVL models in vllm would have a numeric difference compared to hf currently. And this may affect the generated quality as well.

vllm-project / vllm

[Bug]: use Internvl2 generated content is incomplete #7190

Your current environment

🐛 Describe the bug