vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.05k stars 4.33k forks source link

[Performance]: InternVL multi image speed is not improved compare to original #9483

Open luohao123 opened 6 days ago

luohao123 commented 6 days ago

Your current environment

The output of `python collect_env.py` ```text latest vllm 0.6.1 ```

Model Input Dumps

tt

🐛 Describe the bug

InternVL multi image speed is slower than original

Before submitting a new issue...

DarkLight1337 commented 6 days ago

Could you elaborate more? What do you mean by the original speed?

luohao123 commented 6 days ago

Compare with torch, same device, same dtype (float16, V100). (torch means hf with flashattn default)

Single image faster about 20%, while multiple image are slower, A100 got same result.

DarkLight1337 commented 6 days ago

Can you show the scripts you used to measure the performance of HF vs vLLM?

luohao123 commented 6 days ago

Hi, the test based on internvl 8b model, have u guys tested vllm speed improvement on multiple images? I am not lying. Multiple images actually slower than torch. For some inhouse issue, I didn't got a chance to paste code here, but I think you guys might can be easily replicate the result.

DarkLight1337 commented 6 days ago

Hi, the test based on internvl 8b model, have u guys tested vllm speed improvement on multiple images? I am not lying. Multiple images actually slower than torch. For some inhouse issue, I didn't got a chance to paste code here, but I think you guys might can be easily replicate the result.

No, we have not tested the speed for multiple images (benchmarking work for multi-modal models is still in the early stages). Since vLLM was originally designed around language generation, most of vLLM's optimizations don't currently work on the vision encoder part of the model, which may explain the decrease in speed when more images are passed. There may also be CPU bottlenecks associated with image preprocessing.

We are still busy with making multi-modal support feature complete, so it may take a while before we can focus on optimization - any help is welcome!

noooop commented 3 days ago
  1. Most multimodal encoder are hardcoded to use F.scaled_dot_product_attention,refer to #8898, intern_vit.py
        x = F.scaled_dot_product_attention(q, k, v, scale=self.scale)
  1. there is still a noticeable gap between F.scaled_dot_product_attention and FlashAttention. refer to #8453

image

This may be the reason why vllm speed decreases significantly as the number of input images increases

  1. vllm flash_attn backend only support decoder only model currently, refer to #4888 flash_attn.py
  • FlashAttention backend support for encoder/decoder models is left as future work.

Therefore, the encoder only model and encoder-decoder model cannot use the fast flash_attn backend for the time being.

  1. Maybe we can manage to merge #9124, not only for the encoder only model, but also to make the encoder-decoder model faster

To use flash_attn backend for encoder only model and encoder-decoder model requires significant modification of ModelInputBuilder, attention_metadata and Runner, which is not a simple matter

Maybe that's why now most multimodal encoder choose to use F.scaled_dot_product_attention

luohao123 commented 3 days ago

Hi, am not expert in accelerate, but as far as I can understand, why encoder-decoder can not use flashattn?

noooop commented 3 days ago

Hi, am not expert in accelerate, but as far as I can understand, why encoder-decoder can not use flashattn?

simple answer

vllm flash_attn backend only support decoder only model currently, refer to [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888 flash_attn.py

  • FlashAttention backend support for encoder/decoder models is left as future work.

Why backend support encoder-decoder so hard?

Currently the only backend xformer that supports encoder-decoder model is still undergoing bugfixes, refer to #9026

To use flash_attn backend for encoder only model and encoder-decoder model requires significant modification of ModelInputBuilder, attention_metadata and Runner, which is not a simple matter. Not to mention cuda graph and torch.compile, tp, pp.....

luohao123 commented 3 days ago

How about using flashatten2 package, or using torch inside sdpa?

noooop commented 2 days ago

PTAL #9559

There are some subtle differences between Multimodal models and encoder-decoder model, I'm not sure it works on Multimodal models.

Jeremy-J-J commented 14 hours ago

Same problem