Matmul Example incorrect if use float32 & large matrices.

adam-hartshorne commented 1 year ago

Using an Triton 2.1.0, Pytorch 2.0.1, Python 3.10, Cuda 11.8 on a pc running Ubuntu 22.04 with an nvidia 4090, if I take the example https://github.com/openai/triton/blob/main/python/tutorials/03-matrix-multiplication.py and just change the following line i.e. to use float32 format.

line 244 -> c = accumulator line 339 -> a = torch.randn((M, K), device='cuda', dtype=torch.float32) line 340 -> b = torch.randn((K, N), device='cuda', dtype=torch.float32)

the result differs from the pure pytorch by more than the defined tolerance. This error seems to increase the larger I make M,K, and N. Particularly problematic appears to be if I make M and N very large and K small.

Using torch.float16 this works perfectly.

ptillet commented 1 year ago

Are you sure that you're not comparing TF32 vs FP32 ? Triton uses the former as default.

daemyung commented 1 year ago

@adam-hartshorne Try accumulator += tl.dot(a, b, False) instead of accumulator += tl.dot(a, b).

triton-lang / triton

Matmul Example incorrect if use float32 & large matrices. #1800