line 244 -> c = accumulator
line 339 -> a = torch.randn((M, K), device='cuda', dtype=torch.float32)
line 340 -> b = torch.randn((K, N), device='cuda', dtype=torch.float32)
the result differs from the pure pytorch by more than the defined tolerance. This error seems to increase the larger I make M,K, and N. Particularly problematic appears to be if I make M and N very large and K small.
Using an Triton 2.1.0, Pytorch 2.0.1, Python 3.10, Cuda 11.8 on a pc running Ubuntu 22.04 with an nvidia 4090, if I take the example https://github.com/openai/triton/blob/main/python/tutorials/03-matrix-multiplication.py and just change the following line i.e. to use float32 format.
line 244 -> c = accumulator line 339 -> a = torch.randn((M, K), device='cuda', dtype=torch.float32) line 340 -> b = torch.randn((K, N), device='cuda', dtype=torch.float32)
the result differs from the pure pytorch by more than the defined tolerance. This error seems to increase the larger I make M,K, and N. Particularly problematic appears to be if I make M and N very large and K small.
Using torch.float16 this works perfectly.