Closed turbo0628 closed 1 year ago
Minimal reproduction:
import taichi as ti block_dim = 64 N=256 ti.init(arch=ti.cuda, print_ir=True, print_kernel_llvm_ir=True) @ti.kernel def test(out:ti.types.ndarray()): ti.loop_config(block_dim=block_dim) for i in range(N): #gtid = ti.global_thread_idx() tid = i % block_dim val = i * 1.0 sharr = ti.simt.block.SharedArray((block_dim,), ti.f32) sharr[tid] = val ti.simt.block.sync() #ti.atomic_add(sharr[0], val) sharr[0] += sharr[tid] ti.simt.block.sync() out[i] = sharr[tid] arr = ti.ndarray(ti.f32, (N)) test(arr) print(arr.to_numpy())
This gives unexpected numerical results.
The atomic add is somehow demoted.
Maybe this is due to the fact that some pass treat it as a local variable and then removed the atomic add. Will keep on investigating.
Minimal reproduction:
This gives unexpected numerical results.