Open lq470379 opened 2 years ago
@OasisArtisan @AmedeoSapio Also, I just noticed that if I ping between the workers and use tcpdump at the destination, the packets are arriving but with all zero values:
While the ICMP packets going out of the worker seem fine.
Does this mean they were malformed by the switch?
@OasisArtisan @AmedeoSapio I think this issue should be related to something simple like a misconfiguration but unfortunately I have not found a solution.
I have setup SwitchML P4 app and controller successfully and compiled the client library for RDMA backend but I'm having a similar issue as #8 that no communication happens after worker setup. I have two workers with disabled ICRC and have tried this both for hello_world example and allreduce benchmark. Here is the output of
GLOG_logtostderr=1 GLOG_v=2 ./allreduce_benchmark
for one of the workers (the other worker has a similar output and the behavior is the same for hello_world as well):After no progress when I exit the process I get:
Here is the outputs on controller side:
And on the switch side I get:
My environment is:
Switch: Wedge BF100-32x SDE: 9.9.0 Python: 3.8 NICs: ConnectX-5
My ports.yaml has:
And finally my config file is here:
switchml.cfg.txt
My guess is that the switch data plane as an end point is unreachable for some reason (but only one packet does not timeout so I'm not sure). Is there a way to ensure connectivity between
Thank you!