Open Qu-Xiangjun opened 1 year ago
I have also encountered this problem. Have you resolved it?
Unfortunately, I didn't find the reason.
I faced a similar issue when playing around with blocksparse matrix multiplication, here is my code:
import torch
import triton.ops
device = torch.device("cuda")
dtype = torch.float16
# Parameters
batch_size = 2
head_size = 1
sequence_length = 32
d_model = 16
block_size = 16
use_int_tensors = False
max_int_val = 20
# Tensors
if use_int_tensors:
tensor_a = torch.randint(1, max_int_val, (head_size, batch_size, sequence_length, d_model), device=device,
dtype=dtype)
tensor_b = torch.randint(1, max_int_val, (head_size, batch_size, sequence_length, d_model), device=device,
dtype=dtype)
else:
tensor_a = torch.rand((head_size, batch_size, sequence_length, d_model), device=device,
dtype=dtype)
tensor_b = torch.rand((head_size, batch_size, sequence_length, d_model), device=device,
dtype=dtype)
identity_matrix = torch.eye(sequence_length, device=device, dtype=dtype).unsqueeze(0).unsqueeze(0).repeat(
(head_size, batch_size, 1, 1))
sparsity_layout = torch.tensor([[[1, 1],
[1, 1]],
[[1, 1],
[1, 1]]
])
# Triton Matmul
triton_matmul_sdd = triton.ops.blocksparse.matmul(sparsity_layout, block_size, "sdd", device)
triton_result = triton_matmul_sdd(tensor_a, torch.transpose(tensor_b, -1, -2))
triton_matmul_dsd = triton.ops.blocksparse.matmul(sparsity_layout, block_size, "dsd", device)
triton_identity = triton_matmul_dsd(triton_result, identity_matrix)
# Conventional Matmul
conventional_result = torch.matmul(tensor_a, torch.transpose(tensor_b, -1, -2))
conventional_identity = torch.matmul(conventional_result, identity_matrix)
assert torch.allclose(conventional_result, conventional_identity)
# Passes only up to max_int_val ~ 15 if computing with dtype float16 and integer values
if not torch.allclose(triton_identity, conventional_identity):
difference = torch.abs(conventional_identity - triton_identity)
indices = torch.where(difference > 1e-6)
# Print the differing values
print("Conventional Identity at differing indices: ", conventional_identity[indices])
print("Triton Identity at differing indices: ", triton_identity[indices])
print("Number of differing values: ", len(conventional_identity[indices]))
In this I generate two tensors containing only integer (or floating values if use_int_tensors=False
) and multiply the two matrices using torch.matmul
and triton.ops.blocksparse.matmul
(with a sparsity layout that corresponds to a regular "full" matrix multiplication). When using float32
as dtype I see numerical inaccuracies, with float16
everything seems to work fine.
Am I overlooking something here? For now I'll stick to float16
, please let me know if there are other ways of achieving better precision!
Using an Triton 2.0.0, Pytorch 2.0.0, Python 3.9.16, Cuda 11.6 on a pc running Centos release 7.4.1708 with an nvidia A100. I using the
matmul
andblocksparse/matmul
ops in https://github.com/openai/triton/tree/main/python/triton/ops . And I using the test code like to [test_matmul.py](https://github.com/openai/triton/blob/main/python/test/unit/operators/test_matmul.py) and [test_blocksparse.py](https://github.com/openai/triton/blob/main/python/test/unit/operators/test_blocksparse.py).Then I find some problem when I compare the tirton matmul with torch.matmul, the result is different by torch.allclose(atol = 1e-5, rtol=0) as follow:
Matmul Test
the tesing code as follow:
This code will print
total difference
more than 0.0, and thetorch.allclose
is return false.Then I tried observed some character:
the
diff
increasing as the shape increase. I guess it maybe related from cumulative accuracy of the calculation. But when I usingM,K,N = 4096,4096,4096
running this code in my machine, it's pass ✅ theallclose
function and diff = 0.000000. It's also related withshape
? Because only some shape will occur the problem.Moreover, I had try some special data to test in shape
M, N, K = 2048, 2048, 2048
.I take the
a = torch.ones ,b = torch.ones
to run the code, which result is always pass ✅. So in some times this don't related from shape.I take the
a = torch.ones ,b = torch.randn
to run the code, which every row for the result matrix is same, also same in the incorrect elements.Blocksparse Matmul Test
The incorrect precision also in blocksparse matmul function. the test code as follow, which only using the forward testing for [test_blocksparse.py](https://github.com/openai/triton/blob/main/python/test/unit/operators/test_blocksparse.py) :
This code will print
total difference
more than 0.0, and thetorch.allclose
is return false.Then I tried observed some character:
M, N, K = 256, 256, 256
, the code always pass ✅N, K = 4096, 4096
, which show the more than half of the range print the❌ Triton and Torch differ
.So what could be causing the incorrect precision and how to solute the problem?