Open jansel opened 2 hours ago
@dnikolaev-amd @tenpercent @jithunnair-amd who is the right person to tag on this type of stuff?
@jansel could you share a runnable repro script, e.g. the generated wrapper code, for the cases which are timing out? AMD's Triton folks could look into that
137756 adds support for generating cooperative reductions in Triton, something like:
where you have a grid-wide barrier allowing multiple thread groups to exchange data.
According to https://github.com/pytorch/pytorch/pull/137756#issuecomment-2441579322 these kernels time out on ROCM.
Can someone from the AMD team take a look at getting cooperative reductions working on AMD hardware?
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @ezyang @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @aakhundov