The current fallback dlight schedule does not decompose the reduction init blocks, which might lead to some correctness issue (observed on Metal while not CUDA).
Doing DecomposeReduction effectively resolve the issue and meanwhile provide (minor) performance improvement.
The current fallback dlight schedule does not decompose the reduction init blocks, which might lead to some correctness issue (observed on Metal while not CUDA).
Doing DecomposeReduction effectively resolve the issue and meanwhile provide (minor) performance improvement.