Closed nsarka closed 5 months ago
Can one of the admins verify this patch?
@nsarka Does this run (optimally or sub optimally) on Hosts without DPUs?
@nsarka Does this run (optimally or sub optimally) on Hosts without DPUs?
Yes, it should work suboptimally on Hosts without DPUs, so long as the global work buffer info struct is populated with the correct ucp info.
Hi Nicolas! Thank you for this PR which looks really great. I left a bunch of remarks, most of them are really minor and only deal with the codestyle. Let me stress here a couple of items:
- The collective needs to be non-blocking; however I fear that the progress function
ucc_tl_ucp_allreduce_sliding_window_req_test
is blocking and needs to be fixed- I may be wrong but there might be an issue with
ucc_tl_ucp_allreduce_sliding_window_reduction
, which seems to only post the reduction but does not poll completion- Can you please update the tests to test this algorithm as well?
Hi Sam, I have taken care of all of the above items. Please let me know if it looks good to you.
This PR implements Sliding Window Allreduce, to be used by the UROM UCC worker. The algorithm gets passed all of the data it needs (all host source bufs, destination bufs, src rkeys, dst rkeys, ucp workers, endpoints, ...) via the global work buffer. It's set up to use one ucp worker per thread. All ucp setup is done by the UROM UCC worker outside of this UCC change.
The algorithm itself is meant to run on the DPU.