The following TTGIR currently fails with CUDA_ERROR_ILLEGAL_ADDRESS.
The configuration used for this test case is {"block_m":16,"block_n":16,"block_k":16,"split_k":1,"num_stages":2,"num_warps":2,"num_ctas":1}
Note that the test also fails if the configuration is {"block_m":16,"block_n":32,"block_k":32"split_k":1,"num_stages":2,"num_warps":2,"num_ctas":1} ...
... but succeeds if it is {"block_m":16,"block_n":16,"block_k":32,"split_k":1,"num_stages":2,"num_warps":2,"num_ctas":1}
This would fail in a different way before #5044 where you would instead get the error reported here https://github.com/triton-lang/triton/issues/3435 which is mitigated by applying the changes here #4768.
Note that the TTGIR at this step before #5044 and after is identical, so the changes only happen while/after lowering to LLVM.
Looking at the differences in the LLVM IR, I can notice that these are extra in the failing one, so perhaps we need to also remove these extra ones similar to how it was done in the old logic here #4768.
The following TTGIR currently fails with CUDA_ERROR_ILLEGAL_ADDRESS.
The configuration used for this test case is
{"block_m":16,"block_n":16,"block_k":16,"split_k":1,"num_stages":2,"num_warps":2,"num_ctas":1}
Note that the test also fails if the configuration is
{"block_m":16,"block_n":32,"block_k":32"split_k":1,"num_stages":2,"num_warps":2,"num_ctas":1}
...... but succeeds if it is
{"block_m":16,"block_n":16,"block_k":32,"split_k":1,"num_stages":2,"num_warps":2,"num_ctas":1}
This would fail in a different way before #5044 where you would instead get the error reported here https://github.com/triton-lang/triton/issues/3435 which is mitigated by applying the changes here #4768.
Note that the TTGIR at this step before #5044 and after is identical, so the changes only happen while/after lowering to LLVM.
Looking at the differences in the LLVM IR, I can notice that these are extra in the failing one, so perhaps we need to also remove these extra ones similar to how it was done in the old logic here #4768.