Open rayleizhu opened 1 year ago
3d ops maybe buggy in certain cases.
We will update the documentation soon
So, with triton, it is safe to think in a way like CUDA-C (e.g. regard tl.dot as something like tensor core MMA ), right?
it is safe to think in a way like CUDA-C
Maybe I'm not getting your question clearly. The programming models are different, using CUDA you program each thread, block, and block cluster, but triton only allows you to specify the behavior of each block.
You got it. My description abuses some term. Actually, I turn to triton exactly because I want to avoid tedious issues under thread block level. Thanks.On 1 Feb 2023, at 2:35 AM, Keren Zhou @.***> wrote:
it is safe to think in a way like CUDA-C
Maybe I'm not getting your question clearly. The programming models are different, using CUDA you program each thread, block, and block cluster, but triton only allows you to specify the behavior of each block.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>
Once thinking in a CUDA-C way, I have one more question:
Should I take the shared memory into consideration? For example, do need to consider the tile size in a thread block so it won't overflow smem? How about registers?
Should I take the shared memory into consideration
No, triton handles it for you.
How about registers
Again, no.
Although sometimes the generated code may have suboptimal shared memory usage or register counts. In that case, you can submit another issue.
Any suggestions to cope with 3D or 4D dimensional indexing in a safe way for the current version (triton 1.1)?
Basically, I need to load a 3D or 4D tile (e.g. a tile of shape (ch, h_tile, w_tile)) and then reshape it to 2D for matmul.
Ideally, I want to do it like the below:
offs_h = h_start + tl.arange(0, h_tile)[:, None, None]
offs_w = w_start + tl.arange(0, w_tile)[None, :, None]
offs_c = c_start + tl.arange(0, w_tile)[None, None, :]
tile_ptrs = ptr + offs_h * stride_h + offs_w * stride_w + offs_c * stride_c
tile_ptrs = tl.reshape(tile_ptrs, (h_tile*w_tile, c_tile))
tile = tl.load(tile_ptrs)
Otherwise, I reduce it to 2D manually:
offs_h = h_start + tl.arange(0, h_tile)[:, None,]
offs_w = w_start + tl.arange(0, w_tile)[None, :]
offs_spatial = offs_h * stride_h + offs_w * stride_w
offs_spatial = tl.ravel(offs_spatial) # (h_tile*w_tile, )
offs_c = (c_start + tl.arange(0, c_tile) ) * stride_c # (c_tile, )
tile_ptrs = ptr + offs_spatial[:, None] + offs_c[None, :]
tile = tl.load(tile_ptrs)
Any recommendation?
If you don't have dot/trans ops in the code, I suppose you could still declare 3d/3d tensors. Haven't tested though.
I need to slice a 3D volume and do something like batched matmul (torch.bmm), like below
Is this doable? Or do I need to decompose it into 2D subproblems by myself?
BTW, triton is a great tool that may revolutionize operator customization in deep learning. It would be better if there is more detailed documentation 😃