Sliding Window Allreduce

nsarka commented 8 months ago

This PR implements Sliding Window Allreduce, to be used by the UROM UCC worker. The algorithm gets passed all of the data it needs (all host source bufs, destination bufs, src rkeys, dst rkeys, ucp workers, endpoints, ...) via the global work buffer. It's set up to use one ucp worker per thread. All ucp setup is done by the UROM UCC worker outside of this UCC change.

The algorithm itself is meant to run on the DPU.

swx-jenkins3 commented 8 months ago

Can one of the admins verify this patch?

manjugv commented 8 months ago

@nsarka Does this run (optimally or sub optimally) on Hosts without DPUs?

nsarka commented 8 months ago

@nsarka Does this run (optimally or sub optimally) on Hosts without DPUs?

Yes, it should work suboptimally on Hosts without DPUs, so long as the global work buffer info struct is populated with the correct ucp info.

nsarka commented 7 months ago

Hi Nicolas! Thank you for this PR which looks really great. I left a bunch of remarks, most of them are really minor and only deal with the codestyle. Let me stress here a couple of items:

The collective needs to be non-blocking; however I fear that the progress function ucc_tl_ucp_allreduce_sliding_window_req_test is blocking and needs to be fixed

I may be wrong but there might be an issue with ucc_tl_ucp_allreduce_sliding_window_reduction, which seems to only post the reduction but does not poll completion

Can you please update the tests to test this algorithm as well?

Hi Sam, I have taken care of all of the above items. Please let me know if it looks good to you.

openucx / ucc

Sliding Window Allreduce #862