Test case from multi-headed self-attention tutorial fails

Lucy7298 commented 1 year ago

~~I am trying to run the test_op pytest on the fused attention tutorial (https://triton-lang.org/master/getting-started/tutorials/06-fused-attention.html) on a A100 with CUDA 11.4. The error is:~~

~~std::vector::reference std::vector<unsigned int>::operator[](std::vector::size_type) [_Tp = unsigned int, _Alloc = std::allocator<unsigned int>]: Assertion '__n < this->size()' failed~~

~~I tried applying the changes from this issue, but it did not help. I can make the error go away by applying this change:~~

  layout.cc:160
  + for (unsigned o : order_) {
  +  if (o >= max_contiguous.size()) {
  +    return;
  +  }
  + }
  if(max_contiguous.size() > 0){
    std::sort(order_.begin(), order_.end(), [&](unsigned a, unsigned b) {
      return max_contiguous[a] > max_contiguous[b];
    });

~~This change allows the test case to proceed without raising an error. However, the outputs of the self-attention are incorrect after applying this change:~~

I'm no longer seeing problems with the vector access, even after removing the change. However, it seems like there are some differences in the outputs of the triton kernel and the pytorch implementation:

  File "....numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 2 decimals

Mismatched elements: 397466 / 786432 (50.5%)
Max absolute difference: 1.103
Max relative difference: inf
 x: array([[[[-3.95e-01,  2.08e+00,  4.46e-01, ..., -1.49e+00,  9.86e-02,
           2.61e-01],
         [-1.43e-01,  9.78e-01,  5.16e-01, ..., -9.55e-01,  6.51e-01,...
 y: array([[[[-2.79e-01,  2.30e+00,  8.63e-01, ..., -1.17e+00,  8.92e-02,
           3.55e-01],
         [-9.31e-04,  1.32e+00,  5.63e-01, ..., -1.09e+00,  7.93e-01,...

I examined the output, and it seems like the differences in the 2 outputs are pretty small. If you compare using:

torch.isclose(ref_out, tri_out, rtol=0.01, atol=0.001).all(), you can get the outputs to match. However, the gradients of the model don't seem to be close. Have you tried to train a neural network on the tutorial implementation? Can it get similar accuracy compared to the pytorch implementation?

murphymatt commented 1 year ago

hmm, curious!

CHDev93 commented 1 year ago

@Lucy7298 which pytorch implementation are you using? Just brute force computation of the attention output? The triton impl is using float16 so would expect diffs if you are comparing against a float32 implementation

triton-lang / triton

Test case from multi-headed self-attention tutorial fails #932