Open LinJianping opened 2 weeks ago
Do you get a similar performance drop when you use HF? For Qwen2-VL specifically, the majority of processing time is actually spent on preprocessing rather than the model itself. See #9238.
Do you get a similar performance drop when you use HF? For Qwen2-VL specifically, the majority of processing time is actually spent on preprocessing rather than the model itself. See #9238.
In my inference test script, the input data of the two models is the same and has been preprocessed in advance. During the evaluation, only the inference time is counted, and the data preprocessing time is not counted. So I think that this difference should be the difference in the inference of the model itself. Perhaps it is because the number of parameters of the model itself is relatively small, and the cost of FP8 and FP16 conversion in FP8 model cause the degradation?
In my inference test script, the input data of the two models is the same and has been preprocessed in advance.
Even if you preprocess the data in advance, vLLM doesn't know this and will pass the data to HF processor internally. Unless HF processor has a way to automatically skip preprocessed data, there will still be preprocessing overhead.
I suggest you run a profiler and check the results.
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
estimated QPS is as follows: bs=1:11.402357925880366 for FP16 and 10.642891382295932 for FP8 bs=8:51.62193861376064 for FP16 and 49.57986576846022 for FP8 bs=16:61.87048607358999 for FP16 and 57.58566218192532 for FP8 bs=32: For FP8: Processed prompts: 100%|████████████████████| 32/32 [00:00<00:00, 67.85it/s, est. speed input: 11468.33 toks/s, output: 271.44 toks/s]
For FP16: Processed prompts: 100%|████████████████████| 32/32 [00:00<00:00, 74.14it/s, est. speed input: 12531.11 toks/s, output: 296.59 toks/s]
The FP8 model convert script is as follow:
The inference script is as follows
Your current environment (if you think it is necessary)
Before submitting a new issue...