invesdwin / invesdwin-context-integration

invesdwin-context modules that provide integration features
GNU Lesser General Public License v3.0
1 stars 0 forks source link

retest OpenMPI, jucx, infinileap and disni when SoftiWarp or SoftRoCe module is loaded #47

Closed subes closed 1 year ago

subes commented 1 year ago

https://github.com/zrlio/softiwarp

software based infiniband (similar to TCP/SCTP)

subes commented 1 year ago

also try again to run RMA put/get tests in jucx

subes commented 1 year ago

https://www.reflectionsofthevoid.com/2020/07/software-rdma-revisited-setting-up.html

Reflections Of The Void Software RDMA revisited setting up SoftiWARP on Ubuntu 20.04.pdf

Requires the use of an actually connected network interface:

modprobe siw
ifconfig #to find a connected ethernet or wifi module, "lo" did not work
sudo rdma link add siw0 type siw netdev wlp112s0
rdma link #should list the new device
ifconfig #to find the ip address of wlp112s0
rping -s -a 192.168.0.20 -v #server
rping -c -a 192.168.0.20 -v #client
sudo rdma link delete siw0 #call this during a test to verify that the interface is used, test should crash
subes commented 1 year ago

ucx does not support iWarp as it seems: https://github.com/openucx/ucx/issues/2507

They have some commits for it but say it is untested since 2017? At least I can not get it to work with SoftiWarp.

Also seems as if the code does not support iWarp because it checks for only Infiniband? image https://github.com/openucx/ucx/blob/eadd74f9fe5b0edc081ba1ce589fb850d6809934/src/uct/ib/base/ib_md.c

subes commented 1 year ago

Alternative is rdma_rxe (similar to UDP, though seems to keep packet order?): https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce (though outdated https://github.com/linux-rdma/rdma-core/commit/0d2ff0e1502ebc63346bc9ffd37deb3c4fd0dbc9)

modprobe rdma_rxe
ifconfig #to find a connected ethernet or wifi module, "lo" did not work
sudo rdma link add rxe0 type rxe netdev wlp112s0
rdma link #should list the new device
ifconfig #to find the ip address of wlp112s0
rping -s -a 192.168.0.20 -v #server
rping -c -a 192.168.0.20 -v #client
sudo rdma link delete rxe0 #call this during a test to verify that the interface is used, test should crash

Hadronio works with Soft-RoCe, our Jucx integration requires that the listener does not get closed (which is now the default).

This here suggests Soft-RoCe can improve performance of normal networks cards as well: https://www.reflectionsofthevoid.com/2011/08/soft-roce-alternative-to-soft-iwarp.html image https://www.lanl.gov/projects/national-security-education-center/information-science-technology/_assets/docs/2010-si-docs/Team_CYAN_Implementation_and_Comparison_of_RDMA_Over_Ethernet_Presentation.pdf

subes commented 1 year ago

Soft-RoCe Checklist:

RoCe hangs might be due to unreliability of the protocol: https://github.com/zrlio/disni/issues/37#issuecomment-458055469

SoftiWarp Checklist:

subes commented 1 year ago

finished