linux-nfs / nfsd

Linux kernel source tree
Other
0 stars 0 forks source link

Better support for NFS/RDMA with mismatched interface speeds #42

Closed chucklever closed 2 months ago

chucklever commented 7 months ago

NFS is frequently deployed on fabrics where the NFS server's NICs are far more capable than the client NICs. This is less true on RDMA fabrics, where most NICs are fast and have similar capabilities.

However with the advent of both RoCE and software-emulated RDMA, it's possible for server NICs to far out-class client capabilities. In these situations, the server can easily overrun a client when sending RDMA Reads or Write, even though the ULP (NFS in this case) does not have much control over the network layer. RDMA is not known for handling network environments that are lossy.

chucklever commented 7 months ago

One suggestion for managing this situation is that NFSD should respect the client's advertised IRD (Incoming Read Depth). Currently, no Linux kernel ULP pays attention to the IRD/ORD limits. Rather, they all depend on the RDMA layer and NICs to manage RDMA Read queuing.

chucklever commented 7 months ago

Another thought was that the Linux NFS/RDMA client should allow its NIC to retransmit more frequently so that the workflow can detect and recover from lost RDMA operations more quickly.

chucklever commented 6 months ago

I've set up an NFS/RDMA client system with a FastLinQ RoCE card on a 25Gbps port, and an NFS/RDMA server system with a CX-5 Ethernet card on a 100Gbps port. The usual workloads seem to run comfortably on this rig without loss of connection. So basic function works, but I'm not able to reproduce significant losses.

chucklever commented 2 months ago

The original issue was reported as long delays when handling server failover / reboot scenarios on RoCE. This issue can be explained by the improper retention of stale ARP cache entries on the clients. There does not appear to be an ARP cache flush after connection errors (as usually happens after a TCP connection failure, say). This is being explored separately.

chucklever commented 2 months ago

I've discussed the use of explicit ORD/IRD checking with a few upstream RDMA experts, who agree that better performance will be achieved leaving that to drivers and devices.