I previously read a paper that proposed the bruck2phase algorithm for alltoallv communication. In this paper, it mentioned that the bruck2phase algorithm outperforms the SLOVX algorithm, which is the ucp_alltoallv_hybrid algorithm in UCC. Therefore, I want to port it to UCC and compare the performance. However, when running the ucc_perftest program for benchmarking, the program got stuck and did not return any latency. Why is this happening?
I previously read a paper that proposed the bruck2phase algorithm for alltoallv communication. In this paper, it mentioned that the bruck2phase algorithm outperforms the SLOVX algorithm, which is the ucp_alltoallv_hybrid algorithm in UCC. Therefore, I want to port it to UCC and compare the performance. However, when running the ucc_perftest program for benchmarking, the program got stuck and did not return any latency. Why is this happening?
paper: https://dl.acm.org/doi/10.1145/3502181.3531468
Here is the source code: `static void ucc_tl_ucp_alltoallv_bruck2phase_progress(ucc_coll_task_t coll_task) { ucc_tl_ucp_task_t task = ucc_derived_of(coll_task, ucc_tl_ucp_task_t); ucc_tl_ucp_team_t team = TASK_TEAM(task); ucc_rank_t grank = UCC_TL_TEAM_RANK(team); ucc_rank_t gsize = UCC_TL_TEAM_SIZE(team); ptrdiff_t sbuf = (ptrdiff_t)TASK_ARGS(task).src.info_v.buffer; ptrdiff_t rbuf = (ptrdiff_t)TASK_ARGS(task).dst.info_v.buffer; ucc_memory_type_t smem = TASK_ARGS(task).src.info_v.mem_type; ucc_memory_type_t rmem = TASK_ARGS(task).dst.info_v.mem_type; size_t sdt_size = ucc_dt_size(TASK_ARGS(task).src.info_v.datatype); size_t rdt_size = ucc_dt_size(TASK_ARGS(task).dst.info_v.datatype); int s_disps = (int)TASK_ARGS(task).src.info_v.displacements; int r_disps = (int)TASK_ARGS(task).dst.info_v.displacements; int scounts = (int)TASK_ARGS(task).src.info_v.counts; int rcounts = (int*)TASK_ARGS(task).dst.info_v.counts;
out: return; }`