triton-lang / triton

Development repository for the Triton language and compiler
https://triton-lang.org/
MIT License
13.09k stars 1.6k forks source link

Error on test 03 #3805

Closed hforoughmand closed 4 months ago

hforoughmand commented 5 months ago

When I run the test 03 (https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html#sphx-glr-getting-started-tutorials-03-matrix-multiplication-py) on a V100 I get the following error.

triton_output_with_fp16_inputs=tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16)
torch_output_with_fp16_inputs=tensor([[  1.1045, -36.9688,  31.4688,  ..., -11.3906,  24.4531, -32.3438],
        [  6.3516, -19.6094,  34.0938,  ...,  -5.8906,   5.2812,   6.8828],
        [-32.0625,   5.9531,  15.3984,  ..., -21.4062, -23.9844, -10.1328],
        ...,
        [ -5.7070,   7.4492,   8.2656,  ..., -10.6953, -40.0000,  17.7500],
        [ 25.5000,  24.3438,  -8.4609,  ..., -18.9375,  32.5312, -29.9219],
        [ -5.3477,   4.9805,  11.8828,  ...,   5.5859,   6.4023, -17.3125]],
       device='cuda:0', dtype=torch.float16)
❌ Triton and Torch differ
loc("03-matrix-multiplication.py":261:35): error:  size mismatch when packing elements for LLVM struct expected 32 but got 64
python: /root/.triton/llvm/llvm-5e5a22ca-centos-x64/include/llvm/ADT/ArrayRef.h:257: const T& llvm::ArrayRef<T>::operator[](size_t) const [with T = mlir::Type; size_t = long unsigned int]: Assertion `Index < Length && "Invalid index!"' failed.

Is that a problem with the installation or a known bug? My cuda version is 11.8, my python version is 3.8.18, my pytorch version is 2.3.0+cu118.

fkouteib commented 5 months ago

hey @hforoughmand, if you want to run the version of the tutorial at tip of main branch (also published on the triton website), I recommend building/installing Triton main branch from source, or installing a nightly build (v3.0-*) per readme instructions.

If you want to run a Triton stable release (2.x) installed from PyPI (or implicitly installed with PyTorch stable release install), then I recommend you run the version of the tutorial code in the corresponding release branch (which may be different from the website).

For mat mul on V100 specifically, you may want to review open issues referencing V100. I think there may be a known regression on it for FP16, see https://github.com/openai/triton/issues/3478.