shenango / caladan

Interference-aware CPU scheduling that enables performance isolation and high CPU utilization for datacenter servers
Apache License 2.0
117 stars 50 forks source link

Can Caladan run on CloudLab m510 machines? #1

Closed yilongli closed 3 years ago

yilongli commented 3 years ago

Hi Caladan developers,

I am running into some issues when trying to build and run Caladan on CloudLab m510 machines.

  1. README.md says one should use make submodules to build the submodules. I think the correct command is actually build/init_submodules, right?

  2. build/shared.mk didn't set MLX4_INC and MLX4_LIBS like it did for mlx5, so I added the following lines:

+# mlx4 build                       
+MLX4_INC = -I$(ROOT_PATH)/rdma-core/build/include
+MLX4_LIBS = -L$(ROOT_PATH)/rdma-core/build/lib/statics/
+MLX4_LIBS += -lmlx4 -libverbs -lnl-3 -lnl-route-3

Similarly, Makefile didn't use MLX4_[INC,LIBS], so I changed that too.

  1. Typo in build/config. CONFIG_MLX4 is for ConnectX-3 support, not ConnectX-4.

  2. After fixing the problems above, I can build iokerneld on m510 successfully. However, DPDK fails to initialize the port at startup. Here is the error message:

    yilongl@rc01:/shome/caladan$ sudo ./iokerneld
    CPU 09| <5> cpu: detected 16 cores, 1 nodes
    CPU 09| <5> time: detected 1995 ticks / us
    [  0.001901] CPU 09| <5> sched: CPU configuration...
            node 0: [0,8][1,9][2,10][3,11][4,12][5,13][6,14][7,15]
    [  0.001944] CPU 09| <5> sched: dataplane on 8, control on 0
    IBRS and IBPB supported  : yes
    STIBP supported          : yes
    Spec arch caps supported : no
    IBRS enabled in the kernel   : no
    STIBP enabled in the kernel  : no
    Socket 0: 1 memory controllers detected with total number of 4 channels. 0 QPI ports detected. 0 M2M (mesh to memory) blocks detected. 1 Home Agents detected. 0 M3UPI blocks detected.
    [  0.101894] CPU 00| <5> control: spawning control thread
    EAL: Detected 16 lcore(s)
    EAL: Detected 1 NUMA nodes
    EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
    EAL: Selected IOVA mode 'PA'
    EAL: No available hugepages reported in hugepages-1048576kB
    EAL: Probing VFIO support...
    EAL: VFIO support initialized
    EAL: PCI device 0000:09:00.0 on NUMA socket 0
    EAL:   probe driver: 15b3:1007 net_mlx4
    net_mlx4: 0x561d86587300: cannot attach flow rules (code 95, "Operation not supported"), flow error type 2, cause 0x1073cd200, message: flow rule rejected by device

CloudLab m510 machines are equipped with ConnectX3-Pro NIC so I assume Caladan should be able run on them. Any help would be greatly appreciated.

joshuafried commented 3 years ago

Hi Yilong,

Sorry to hear you are having issues. I am able to run on a Cloudlab m510 machine so we should be able to resolve this.

Regarding (1), make submodule is correct; that make target invokes build/init_submodules.sh. See https://github.com/shenango/caladan/blob/main/Makefile#L111.

(2) The runtime only supports ConnectX-5 for direct polling, so no need to link it with MLX4. For other NICs, packets will be forwarded to runtimes from the IOKernel which uses DPDK.

(3) Thank you, we will fix the typo.

(4) Are you using the Mellanox OFED drivers for the ConnectX-3 (see https://doc.dpdk.org/guides-19.11/nics/mlx4.html)? We tested it using versions 4.6 and 5.0; both worked for us. There are a few configuration parameters mentioned in the instructions related to flow steering, perhaps making sure those are set correctly will help solve the error. Separately, you will likely want to change this line from 0 to 1 to make sure that DPDK uses the experimental link on the m510: https://github.com/shenango/caladan/blob/main/iokernel/dpdk.c#L234.

yilongli commented 3 years ago

Hi Josh,

I see. So the problem is that I didn't install Mellanox OFED. Somehow I was under the impression that since caladan is using a patched rdma-core anyway, maybe there is no need to install Mellanox OFED. And I was able to run Caladan on CloudLab xl170 without installing Mellanox OFED. So when I tried m510 without Mellanox OFED, I got compilation errors complaining about missing headers that belong to the rdma-core package. And that's why I tried to fix the makefiles in (2). I have changed to use Mellanox OFED. Thank you.

andreybleme commented 3 months ago

It seems support for MLX4 has been dropped. If that's the case, we are not able to use m510 machines anymore. Is that correct @joshuafried ? I created a new issue to further discuss this https://github.com/shenango/caladan/issues/20