Open nsarka opened 1 month ago
@Sergei-Lebedev @samnordmann We're opening up the first of probably 2-3 additional PRs.
The first one is a bit larger; however, we trid to strip away as much as possible. This is literally just a template code. If you were to copy/paste any other CL. Please take a look. I'm already going to approve it because Nick and I went over this code multiple times. We also did a code review with @wfaderhold21
bot:retest
@B-a-S can you issue a re-test command? I think this is a clear CI issue
[2024-05-24T17:30:56.579Z] + docker pull harbor.mellanox.com/torch-ucc/ucc/1.0.0/x86_64/centos8/cuda12.1.1:3312
[2024-05-24T17:30:56.579Z] Error response from daemon: unknown: artifact torch-ucc/ucc/1.0.0/x86_64/centos8/cuda12.1.1:3312 not found
[Pipeline] echo
[2024-05-24T17:30:56.601Z] matrix_job INFO: Image NOT found - harbor.mellanox.com/torch-ucc/ucc/1.0.0/x86_64/centos8/cuda12.1.1:3312 - will build .ci/Dockerfile.ngc_pytorch ...
I can't re-run tests
Hey Sam, I updated the PR with all of your comments. I only replied to one though to ask your opinion
Thanks for syncing up today (@nsarka @janjust @Sergei-Lebedev @wfaderhold21 ).
Next steps:
In future releases, we will evolve towards plugin API.
I took the liberty to add the arch-review completed since we had our sync on this PR. This is now a full implementation over what was discussed with the plug-in code in the contrib folder.
@wfaderhold21 @Sergei-Lebedev @manjugv Where should we put the readme for the PR?
@wfaderhold21 Please review @Sergei-Lebedev this is ready now
@janjust I have started my review a few weeks ago now, but still have no answers to my questions. Waiting for @nsarka
@wfaderhold21 I'm sorry we missed it - can you point me to the questions? I looked over the code and I'm not seeing it - what are we missing?
@wfaderhold21 I'm sorry we missed it - can you point me to the questions? I looked over the code and I'm not seeing it - what are we missing?
@janjust I think I'm just not good at github... I assume you can see them now.
The failing CI test is in the IMB code:
/tmp/ompi/install/bin/mpicc -g -O0 -Wall -Werror -DCHECK=1 -Ihelpers -I../src_c -DMPI1 -I. -DMPI1 -c -o MPI1/CPU/IMB_scatter.o ../src_c/IMB_scatter.c
../src_c/IMB_chk_diff.c: In function ‘IMB_chk_distr’:
../src_c/IMB_chk_diff.c:1065:13: error: this ‘if’ clause does not guard... [-Werror=misleading-indentation]
1065 | if (ranks[i + 1] == rank && pos2 - pos1 + 1 == (lengths[i] + (lengths[i + 1])))
| ^~
../src_c/IMB_chk_diff.c:1066:41: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
1066 | c_info->reccnt[rank]++; i++;
This PR adds CL_DOCA_UROM, responsible for reading coll args and building a UROM command to send to the DPU for offload.
The PR will be 1/n PRs for getting the complete code in. This first one is just the CL boilerplate code and empty functions.
@janjust