Your idea is very excellent and I have starred your repo. I want to check my understanding's correctness:
This paper does not modify the kernel implementation but instead considers that different rows in the sequence dimension of Q are independent. Therefore, it calculates from attention to FFN in one go, which quickly consumes intermediate results and allows for the computation of larger sequence lengths.
Your idea is very excellent and I have starred your repo. I want to check my understanding's correctness:
This paper does not modify the kernel implementation but instead considers that different rows in the sequence dimension of Q are independent. Therefore, it calculates from attention to FFN in one go, which quickly consumes intermediate results and allows for the computation of larger sequence lengths.
Is it correct?