Open MamziB opened 1 week ago
@samnordmann you are correct, it was a shortcoming from my side. Reliablity protocol kicks when a packet is dropped and it does not happen in normal scenarios. Previously, I tried forcing the dropped packet in the HOST version for testing purposes, but not for GPU. Since we saw a dropped packet on the ROCK system, where MTT runs, we saw this error, and here is a fix for that.
@samnordmann you are correct, it was a shortcoming from my side. Reliablity protocol kicks when a packet is dropped and it does not happen in normal scenarios. Previously, I tried forcing the dropped packet in the HOST version for testing purposes, but not for GPU. Since we saw a dropped packet on the ROCK system, where MTT runs, we saw this error, and here is a fix for that.
Can you please add a test, covering both Host and CUDA memtype, checking that the reliability is not boggy?
@samnordmann OK I will open a separate PR soon
@MamziB no need for a separate PR, just push the commit here
@janjust I have added the gtest before but it fails the github ucc tests for some unknown reasons. I rather open a new PR so that it does not affect this PR as I would like this PR to go in ASAP. I will open a PR today for that gtest changes.
I don't think this PR will go in until the test is in, open the other PR, but then this PR will be predicated on the test going in.
I opened https://github.com/openucx/ucc/pull/1023 for testing mcast
great, thanks -now the hard part, figuring out why the other is failing
@samnordmann @Sergei-Lebedev https://github.com/openucx/ucc/pull/1023
Here is the test
@samnordmann @Sergei-Lebedev #1023
Here is the test
Thank you, but shouldn't we also test the reliability protocol in case of dropped packets? The present PR fixes a bug that we cannot catch nor reproduce, so how can we ensure that it is effectively fixed?
I don't want to block the PR if there are some time considerations though, so let me know what you guys think
@samnordmann Sure, I will update the other PR to force the bcast to have dropped packets
TL/MLX5: fix memtype in bcast reliability