Closed horrorChen closed 3 months ago
Also, after I load the data with shape [2048 8], I can either call tl.sum(x)
or tensor multiply `x y. But the answer goes wrong when I want to get
tl.sum(x * y)`.
Main post have been solved. Problem comes from that the backward gradient tensor is not contiguous. Call contiguous()
for the input scores_grad and then it can be loaded correctly.
Also, after I load the data with shape [2048 8], I can either call
tl.sum(x)
or tensor multiply `x y. But the answer goes wrong when I want to get
tl.sum(x * y)`.
However, this problem still exists. Both x * y
and tl.sum(x)
have no precision error compared to torch, but tl.sum(x * y)
has a e-16 difference. May open a new issue if cannot solve.
Thx.
Hello, I am training a model with triton kernel, but it comes NaN in the backward. I export the intermediate data and find that
tl.load
cannot keep the value as outside the kernel. The original value range from ~1e-9 to ~1e-12, while the loaded numbers differ max to 1e35, with some data just go 0.I have tried several methods to locate the bug, but I cannot reproduce the problem with torch.rand inputs. It seems like the problem is related to both the shape and value.
Some insights:
I have tested on the nightly release 3.0.0.post20240716052845. To reproduce:
Or the shape and BLOCK_SIZE can be changed to other numbers.
The data is here: tensor_20240716-231606.pth.zip