openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

CL_DOCA_UROM #978

Open nsarka opened 1 month ago

nsarka commented 1 month ago

This PR adds CL_DOCA_UROM, responsible for reading coll args and building a UROM command to send to the DPU for offload.

The PR will be 1/n PRs for getting the complete code in. This first one is just the CL boilerplate code and empty functions.

@janjust

janjust commented 1 month ago

@Sergei-Lebedev @samnordmann We're opening up the first of probably 2-3 additional PRs.

The first one is a bit larger; however, we trid to strip away as much as possible. This is literally just a template code. If you were to copy/paste any other CL. Please take a look. I'm already going to approve it because Nick and I went over this code multiple times. We also did a code review with @wfaderhold21

janjust commented 1 month ago

bot:retest

janjust commented 1 month ago

@B-a-S can you issue a re-test command? I think this is a clear CI issue

[2024-05-24T17:30:56.579Z] + docker pull harbor.mellanox.com/torch-ucc/ucc/1.0.0/x86_64/centos8/cuda12.1.1:3312
[2024-05-24T17:30:56.579Z] Error response from daemon: unknown: artifact torch-ucc/ucc/1.0.0/x86_64/centos8/cuda12.1.1:3312 not found
[Pipeline] echo
[2024-05-24T17:30:56.601Z] matrix_job INFO: Image NOT found - harbor.mellanox.com/torch-ucc/ucc/1.0.0/x86_64/centos8/cuda12.1.1:3312 - will build .ci/Dockerfile.ngc_pytorch ...

I can't re-run tests

nsarka commented 3 weeks ago

Hey Sam, I updated the PR with all of your comments. I only replied to one though to ask your opinion

manjugv commented 2 weeks ago

Thanks for syncing up today (@nsarka @janjust @Sergei-Lebedev @wfaderhold21 ).

Next steps:

  1. Restructure code and move plugin code to contrib and UROM CL
  2. Naming to reflect the differentiation between UROM CL and plugin code
  3. Add FAQ reflecting the usage and design document (@wfaderhold21 to help)

In future releases, we will evolve towards plugin API.

janjust commented 6 days ago

I took the liberty to add the arch-review completed since we had our sync on this PR. This is now a full implementation over what was discussed with the plug-in code in the contrib folder.

@wfaderhold21 @Sergei-Lebedev @manjugv Where should we put the readme for the PR?

janjust commented 5 days ago

@wfaderhold21 Please review @Sergei-Lebedev this is ready now

wfaderhold21 commented 5 days ago

@janjust I have started my review a few weeks ago now, but still have no answers to my questions. Waiting for @nsarka

janjust commented 5 days ago

@wfaderhold21 I'm sorry we missed it - can you point me to the questions? I looked over the code and I'm not seeing it - what are we missing?

wfaderhold21 commented 5 days ago

@wfaderhold21 I'm sorry we missed it - can you point me to the questions? I looked over the code and I'm not seeing it - what are we missing?

@janjust I think I'm just not good at github... I assume you can see them now.

nsarka commented 4 days ago

The failing CI test is in the IMB code:

/tmp/ompi/install/bin/mpicc -g -O0 -Wall -Werror -DCHECK=1 -Ihelpers -I../src_c -DMPI1 -I. -DMPI1 -c -o MPI1/CPU/IMB_scatter.o ../src_c/IMB_scatter.c
../src_c/IMB_chk_diff.c: In function ‘IMB_chk_distr’:
../src_c/IMB_chk_diff.c:1065:13: error: this ‘if’ clause does not guard... [-Werror=misleading-indentation]
 1065 |             if (ranks[i + 1] == rank && pos2 - pos1 + 1 == (lengths[i] + (lengths[i + 1])))
      |             ^~
../src_c/IMB_chk_diff.c:1066:41: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
 1066 |                 c_info->reccnt[rank]++; i++;