namjaejeon / ksmbd

ksmbd kernel server(SMB/CIFS server)
https://github.com/cifsd-team/ksmbd
280 stars 63 forks source link

SMB Direct/RDMA: ksmbd failed to report InfiniBand interface as RDMA capable #456

Closed chaserhkj closed 1 year ago

chaserhkj commented 1 year ago

Conducted test on Arch Linux with kernel 6.5.6-zen2-1-zen, and master branch ksmbd of this repo as externally compiled module. The client is Windows 10 Pro N for Workstation

When client initializes the connection to the server, this is captured over the IPoIB interface: smb.pcapng.gz

Inside the IOCTL response, the IPoIB interface with IP 192.168.109.1 is reported as RDMA-incapable:

Network Interface, RSS, 100.0 GBits/s, IPv4: 192.168.109.1
    Next Offset: 0x00000098
    Interface Index: 7
    Interface Cababilities: 0x00000001, RSS
        .... .... .... .... .... .... .... ..0. = RDMA: This interface does not support RDMA
        .... .... .... .... .... .... .... ...1 = RSS: This interface supports RSS
    RSS Queue Count: 0
    Link Speed: 100000000000, 100.0 GBits/s
    Socket Address, IPv4: 192.168.109.1
        Socket Family: 2
        Socket Port: 0
        Socket IPv4: 192.168.109.1

From the Windows client we also get:

PS C:\Windows\system32> Get-SmbMultichannelConnection

Server Name   Selected Client IP     Server IP     Client Interface Index Server Interface Index Client RSS Capable Client RDMA Capable
-----------   -------- ---------     ---------     ---------------------- ---------------------- ------------------ -------------------
192.168.109.1 True     192.168.109.2 192.168.109.1 12                     7                      True               False

The network interface is Mellanox connectx-4 VPI single port 100Gb (MCX455A-ECAT), with its only port configurated in IB mode.

Strangely, looking at ksmbd debug logs, we only get ksmbd: smb_direct: init RDMA listener. cm_id=0000000071691f94, debug message at this line is nowhere to be found. This could be why the interface is not detected as RDMA later.

chaserhkj commented 1 year ago

Upon closer examination and debugging, it seems that on my setup my IPoIB interface ibp2s0 have its ib_device->ops.get_netdev to be NULL (related to this line). When calling ib_device_get_by_netdev on its corresponding net_device, that returned NULL as well. (related to this line) It seems that two structures are unconnected in my kernel.

I am not sure if this is a misconfiguration from my side or a bug from ib_ipoib upstream. Any help or hints are appreciated.

chaserhkj commented 1 year ago

Also, I just validated that my RDMA connection is correct by using rping from each end with no problems.

namjaejeon commented 1 year ago

It looks like you probably configured the smb-direct incorrectly on the client side. When analyzing the smb2 ioctl request to the client, information was sent to ksmbd indicating that the device only supports RSS.(NOT RDMA) RSS_NOT_RDMA

namjaejeon commented 1 year ago

My mistake. As you told me, ksmbd_rdma_capable_netdev doesn't set rdma_capable flags as true. Can you tell me your RDMA NIC device ?

namjaejeon commented 1 year ago

If you force to return rdma_capable as true, smb-direct can be worked ?

chaserhkj commented 1 year ago

I am using a pair of MCX455A-ECAT in IB mode. Thus I don't have ethernet stacks on both ends, the IP stacks are run atop the IPoIB protocol.

Unfortunately, I am traveling right now and only have remote access to my Linux system and not my Windows client system so I cannot perform a full test. I can only do that later this week when I am back.

However, I do have the chance to look a bit deeper into ksmbd and kernel RDMA stack code, I think the way that ksmbd is accessing interface info cannot handle IPoIB interfaces. Currently in ksmbd ksmbd_rdma_capable_netdev we are associating ib_device and net_device stucts with two APIs: ib_device_get_by_netdev and ib_device->ops.get_netdev. However, the kernel documentation and comments (here and here) state clearly that these APIs return a net_device that is "backing the ib_device". If I understand this wording correctly, this means that the net_device is the underlying transport for the ib_device, as in a RoCE scenario, the ib_device is backed by the ethernet net_device as the underlying transport.

For my use case, it's the other way around, the physical transport runs IB and the IPoIB net_device is backed by the underlying ib_device transport. So consequentially, all the APIs that ksmbd are accessing are returning NULL.

On the other hand, this issue in theory should only affect the flags that ksmbd sent out when the interface is a IPoIB interface because ksmbd only uses these APIs for building the flags. When actually performing RDMA transport, ksmbd calls the rdma_cm module, which just handles IPoIB fine. As a result, this only affects Windows clients on IPoIB because Windows clients negotiate RDMA via the flags. With a Linux CIFS client, setting -o rdma forces RDMA usage, and the flags are simply ignored.

There are two possible ways to fix this. One way is just to return true in ksmbd_rdma_capable_netdev whenever the net_device is having netdev->type == ARPHRD_INFINITEBAND. This assumes that all native IB interfaces are RDMA-capable, and I am not very sure if that is a safe assumption. The other possibly more safe way is to emulate the behavior in rdma_cm (this function), by looking at the hardware device address on net_device with the GUID on ib_device ports to match them together.

I will test if SMB Direct indeed works after the flags are fixed and maybe formulate a PR as well later this week.

namjaejeon commented 1 year ago

Sound good. Please reorganize ksmbd_rdma_capable_netdev() with your way.

chaserhkj commented 1 year ago

After implementing the GUID-based idea of matching IPoIB devices, the RDMA-capable flags are fixed on my setup. I went ahead and did a full test. The Windows server is using RDMA protocol properly, showing monitor-able RDMA traffic from both the client and the server side. It also shows much lower CPU usage than non-RDMA transfers.

chaserhkj commented 1 year ago

The PR is submitted: #457

namjaejeon commented 1 year ago

Cool! I will apply this patch after checking it with other NIC (my Mellanox and chelsio)

chaserhkj commented 1 year ago

The patch for this is merged in this repo and in the pipeline to be merged in Linux mainline. Closing this issue.

Reference in the mailing list.

namjaejeon commented 1 year ago

@chaserhkj Thanks for your patch!