[x] I am not making a trivial change, such as fixing a typo in a comment.
[x] I have written a PR description following these
rules.
[x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
[ ] I have added tests.
/test for lit tests
/unittest for C++ tests
/python/test for end-to-end tests
[x] This PR does not need a test because this is a tutorial file.
Select one of the following.
[x] I have not added any lit tests.
[ ] The lit tests I have added follow these best practices,
including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)
SageAttention is a 8-bit attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without losing end-to-end metrics across various models.
This PR provides the official implementation, and we have verified the correctness of the codes.
New contributor declaration
[x] I am not making a trivial change, such as fixing a typo in a comment.
[x] I have written a PR description following these rules.
[x] I have run
pre-commit run --from-ref origin/main --to-ref HEAD
.Select one of the following.
/test
forlit
tests/unittest
for C++ tests/python/test
for end-to-end teststhis is a tutorial file
.Select one of the following.
lit
tests.lit
tests I have added follow these best practices, including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)SageAttention is a 8-bit attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without losing end-to-end metrics across various models. This PR provides the official implementation, and we have verified the correctness of the codes.