[Performance]: FP8 performance worse than FP16 for Qwen2-VL-2B-Instruct

LinJianping commented 2 weeks ago

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

estimated QPS is as follows: bs=1：11.402357925880366 for FP16 and 10.642891382295932 for FP8 bs=8：51.62193861376064 for FP16 and 49.57986576846022 for FP8 bs=16：61.87048607358999 for FP16 and 57.58566218192532 for FP8 bs=32: For FP8: Processed prompts: 100%|████████████████████| 32/32 [00:00<00:00, 67.85it/s, est. speed input: 11468.33 toks/s, output: 271.44 toks/s]

For FP16: Processed prompts: 100%|████████████████████| 32/32 [00:00<00:00, 74.14it/s, est. speed input: 12531.11 toks/s, output: 296.59 toks/s]

The FP8 model convert script is as follow:

from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot, wrap_hf_model_class
MODEL_ID = "/home/hadoop-platcv/qwen2-vl-2b-instruct/00-src-files/00-model/qwen2-vl-2b-instruct/v4-20241028-151341/checkpoint-14427-merged"

# Load model.
model_class = wrap_hf_model_class(Qwen2VLForConditionalGeneration)
model = model_class.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per channel via ptq
#   * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["re:.*lm_head", "re:visual.*"],
)

# Apply quantization and save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
SAVE_DIR = MODEL_ID + "-FP8-Dynamic"
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(processor.decode(output[0]))
print("==========================================")

The inference script is as follows

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "/home/hadoop-platcv/qwen2-vl-2b-instruct/00-src-files/00-model/qwen2-vl-2b-instruct/v4-20241028-151341/checkpoint-14427-merged-FP8-Dynamic"
#MODEL_PATH = "/home/hadoop-platcv/qwen2-vl-2b-instruct/00-src-files/00-model/qwen2-vl-2b-instruct/v4-20241028-151341/checkpoint-14427-merged"
device = "cuda" # the device to load the model onto

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
)
sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    repetition_penalty=1.05,
    max_tokens=256,
    stop_token_ids=[],
)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/home/hadoop-platcv/qwen2-vl-2b-instruct/00-src-files/01-datasets/02_data/MLLM_moderator/09_国家形象/test/01_secure_llm_shezheng_中国地图/images_part0/25332605707.jpg",
                "min_pixels":3136,
                "max_pixels": 602112,
            },
            {"type": "text", "text": "请为以下图片打标签。可能的标签包括：地图、国旗、党旗、军旗、徽章、残奥会、北京冬奥吉祥物、北京冬奥会会徽、奥运五环、残联、红十字、共青团、少先队、工会、妇联、党政军制服。一个图片可能有多个标签共存，请用逗号分隔每个标签。没有则返回无。"},
        ],
    },
]
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

_REPEAT = 100
_bs = 32
import time
tic = time.time()
for _ in range(_REPEAT):
    outputs = llm.generate([llm_inputs]*_bs, sampling_params=sampling_params)
toc = time.time()
print(f'process {_REPEAT*_bs} query cost {toc-tic} seconds, QPS estimated:{_REPEAT*_bs/(toc-tic)}')

Your current environment (if you think it is necessary)

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.4.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 3.25.0-rc2
Libc version: glibc-2.17

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-4.18.0-147.mt20200626.413.el8_1.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 12.4.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L40
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.9.6
/usr/lib64/libcudnn_adv_infer.so.8.9.6
/usr/lib64/libcudnn_adv_train.so.8.9.6
/usr/lib64/libcudnn_cnn_infer.so.8.9.6
/usr/lib64/libcudnn_cnn_train.so.8.9.6
/usr/lib64/libcudnn_ops_infer.so.8.9.6
/usr/lib64/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   52 bits physical, 57 bits virtual
CPU(s):                          192
On-line CPU(s) list:             0-22
Off-line CPU(s) list:            23-191
Thread(s) per core:              0
Core(s) per socket:              48
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8468V
Stepping:                        8
CPU MHz:                         2900.000
CPU max MHz:                     3800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4800.00
Virtualization:                  VT-x
L1d cache:                       2.3 MiB
L1i cache:                       1.5 MiB
L2 cache:                        96 MiB
L3 cache:                        97.5 MiB
NUMA node0 CPU(s):               0-47,96-143
NUMA node1 CPU(s):               48-95,144-191
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid cldemote movdiri movdir64b md_clear pconfig flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.2.65
[pip3] nvidia-cuda-cupti-cu12==12.4.99
[pip3] nvidia-cuda-nvrtc-cu12==12.4.99
[pip3] nvidia-cuda-runtime-cu12==12.4.99
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.0.44
[pip3] nvidia-curand-cu12==10.3.5.119
[pip3] nvidia-cusolver-cu12==11.6.0.99
[pip3] nvidia-cusparse-cu12==12.3.0.142
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.4.99
[pip3] nvidia-nvtx-cu12==12.4.99
[pip3] pynvml==11.5.3
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0+cu124
[pip3] torchaudio==2.4.0+cu124
[pip3] torchvision==0.19.0+cu124
[pip3] transformers==4.46.1
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.2.65                pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.99                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.99                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.99                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.0.44                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.119               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.0.99                pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.0.142               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.99                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.99                  pypi_0    pypi
[conda] pynvml                    11.5.3                   pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.4.0+cu124              pypi_0    pypi
[conda] torchvision               0.19.0+cu124             pypi_0    pypi
[conda] transformers              4.46.1                   pypi_0    pypi
[conda] transformers-stream-generator 0.0.5                    pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.post1
vLLM Build Flags:
CUDA Archs: ; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  SYS SYS 0-22    0       N/A
NIC0    SYS  X  SYS
NIC1    SYS SYS  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 2 weeks ago

Do you get a similar performance drop when you use HF? For Qwen2-VL specifically, the majority of processing time is actually spent on preprocessing rather than the model itself. See #9238.

LinJianping commented 2 weeks ago

Do you get a similar performance drop when you use HF? For Qwen2-VL specifically, the majority of processing time is actually spent on preprocessing rather than the model itself. See #9238.

In my inference test script, the input data of the two models is the same and has been preprocessed in advance. During the evaluation, only the inference time is counted, and the data preprocessing time is not counted. So I think that this difference should be the difference in the inference of the model itself. Perhaps it is because the number of parameters of the model itself is relatively small, and the cost of FP8 and FP16 conversion in FP8 model cause the degradation?

DarkLight1337 commented 2 weeks ago

In my inference test script, the input data of the two models is the same and has been preprocessed in advance.

Even if you preprocess the data in advance, vLLM doesn't know this and will pass the data to HF processor internally. Unless HF processor has a way to automatically skip preprocessed data, there will still be preprocessing overhead.

DarkLight1337 commented 2 weeks ago

I suggest you run a profiler and check the results.

vllm-project / vllm