openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
195 stars 96 forks source link

TL/MLX5: fix memtype in bcast reliability #1022

Open MamziB opened 1 week ago

MamziB commented 1 week ago

TL/MLX5: fix memtype in bcast reliability

MamziB commented 1 week ago

@samnordmann you are correct, it was a shortcoming from my side. Reliablity protocol kicks when a packet is dropped and it does not happen in normal scenarios. Previously, I tried forcing the dropped packet in the HOST version for testing purposes, but not for GPU. Since we saw a dropped packet on the ROCK system, where MTT runs, we saw this error, and here is a fix for that.

samnordmann commented 1 week ago

@samnordmann you are correct, it was a shortcoming from my side. Reliablity protocol kicks when a packet is dropped and it does not happen in normal scenarios. Previously, I tried forcing the dropped packet in the HOST version for testing purposes, but not for GPU. Since we saw a dropped packet on the ROCK system, where MTT runs, we saw this error, and here is a fix for that.

Can you please add a test, covering both Host and CUDA memtype, checking that the reliability is not boggy?

MamziB commented 1 week ago

@samnordmann OK I will open a separate PR soon

janjust commented 1 week ago

@MamziB no need for a separate PR, just push the commit here

MamziB commented 1 week ago

@janjust I have added the gtest before but it fails the github ucc tests for some unknown reasons. I rather open a new PR so that it does not affect this PR as I would like this PR to go in ASAP. I will open a PR today for that gtest changes.

janjust commented 1 week ago

I don't think this PR will go in until the test is in, open the other PR, but then this PR will be predicated on the test going in.

MamziB commented 1 week ago

I opened https://github.com/openucx/ucc/pull/1023 for testing mcast

janjust commented 1 week ago

great, thanks -now the hard part, figuring out why the other is failing

janjust commented 1 week ago

@samnordmann @Sergei-Lebedev https://github.com/openucx/ucc/pull/1023

Here is the test

samnordmann commented 1 week ago

@samnordmann @Sergei-Lebedev #1023

Here is the test

Thank you, but shouldn't we also test the reliability protocol in case of dropped packets? The present PR fixes a bug that we cannot catch nor reproduce, so how can we ensure that it is effectively fixed?

I don't want to block the PR if there are some time considerations though, so let me know what you guys think

MamziB commented 1 week ago

@samnordmann Sure, I will update the other PR to force the bcast to have dropped packets