vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.8k stars 4.68k forks source link

[torch.compile] support all attention backends #10558

Closed youkaichao closed 3 days ago

youkaichao commented 4 days ago

previously we register attention ops separately, e.g. flashinfer, flash attention.

this pr changes the registration to be the unified attention interface, so that we don't need to register these attention backends one by one.

how it works:

  1. when we create an attention class, we register it in the per-model static forward context, identified by its layer name
  2. when we call the attention implementation, we pass in the layer name through pytorch custom op, and inside the custom op, we find the attention object, and call the implementation.

TODO:

in the future, we should make all attention implementation accept an output argument, so that it is aligned with the v1 attention behavior.

github-actions[bot] commented 4 days ago

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

🚀

youkaichao commented 3 days ago

error comes from huggingface timeout