Open Lucy7298 opened 1 year ago
hmm, curious!
@Lucy7298 which pytorch implementation are you using? Just brute force computation of the attention output? The triton impl is using float16 so would expect diffs if you are comparing against a float32 implementation
I am trying to run the test_op pytest on the fused attention tutorial (https://triton-lang.org/master/getting-started/tutorials/06-fused-attention.html) on a A100 with CUDA 11.4. The error is:std::vector::reference std::vector<unsigned int>::operator[](std::vector::size_type) [_Tp = unsigned int, _Alloc = std::allocator<unsigned int>]: Assertion '__n < this->size()' failed
I tried applying the changes from this issue, but it did not help. I can make the error go away by applying this change:This change allows the test case to proceed without raising an error. However, the outputs of the self-attention are incorrect after applying this change:I'm no longer seeing problems with the vector access, even after removing the change. However, it seems like there are some differences in the outputs of the triton kernel and the pytorch implementation:
I examined the output, and it seems like the differences in the 2 outputs are pretty small. If you compare using:
torch.isclose(ref_out, tri_out, rtol=0.01, atol=0.001).all()
, you can get the outputs to match. However, the gradients of the model don't seem to be close. Have you tried to train a neural network on the tutorial implementation? Can it get similar accuracy compared to the pytorch implementation?