vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.11k stars 3.83k forks source link

[Feature]: 4D Attention Mask #6615

Open littletomatodonkey opened 1 month ago

littletomatodonkey commented 1 month ago

🚀 The feature, motivation and pitch

I am workning on 4D attention mask input and LLM generateion process. Huggingface provides an interface for the 4D attention mask. Does vllm have any plan? https://github.com/huggingface/transformers/pull/27539

Alternatives

No response

Additional context

No response

iiLaurens commented 1 month ago

AFAIK custom attention masks are not supported in optimized implementations of attention (FlashInfer, FlashAttention), so this wouldn't be possible without also incurring a dramatic drop in performance.

littletomatodonkey commented 1 month ago

AFAIK custom attention masks are not supported in optimized implementations of attention (FlashInfer, FlashAttention), so this wouldn't be possible without also incurring a dramatic drop in performance.

Thanks for your reply, i might have to realize the function myself.