mTCP netmap multicore over 40G NIC

Hi,

I am having a host to host setup over 40G intel NICs (XL710). I was able to run epserver, epwget example for a single core, but unable to do that for more than 1 cores.

sudo ./epwget 169.254.9.84/small.txt 10000000 -f epwget.conf -N 2

The error I get is:

[netmap_init_handle:  79] Opening netmap:ens259f1-0 with j: 0 (cpu: 0)
[netmap_init_handle:  79] Opening netmap:ens259f1-1 with j: 0 (cpu: 1)
672.103553 nm_open [847] NIOCREGIF failed: Invalid argument ens259f1-0
[netmap_init_handle:  88] Unable to open netmap:ens259f1-0: Invalid argument
672.103605 nm_open [847] NIOCREGIF failed: Invalid argument ens259f1-1

Also, the dmesg says: [357006.217823] 606.558023 [ 376] netmap_ioctl_legacy Minimum supported API is 14 (requested 11)

I was able to run netmap pkt-gen with multiple cores. I have also changed the RSS hash in /netmap/LINUX/i40e-2.4.6/src/ and did make again.

My config file is as below: epwget.txt

Please help me resolve this issue. Thanks, Priyanka

Hi @ppnaik1890,

This looks to be a version mismatch issue on my side. I don't have access to an XL710 NIC in my lab here. Can you do me a favor? Please copy (overwrite) netmap/sys/net/netmap.h and netmap/sys/net/netmap_user.h to mtcp/mtcp/src/include/* directory. You will notice that the existing netmap.h in mtcp/src/include/netmap.h has NETMAP_API set to 11:

https://github.com/mtcp-stack/mtcp/blob/1ad1b1a386ad2e17b671c000d08eb1296a94be95/mtcp/src/include/netmap.h#L42

If this patch works, please submit these file changes to the mTCP devel branch as a pull request. I will merge your changes.

Hi @ajamshed,

Thanks for your quick response. With the following changes, you suggested I could get mTCP working with 4 cores.

But now I have some performance issue while running epwget example. sudo ./epserver -p /home/turing_05/www -f epserver.conf -N 4

[CPU 0] ens259f1 flows:   4126, RX:   13750(pps) (err:     0),  0.01(Gbps), TX:   10344(pps),  0.01(Gbps)
[CPU 1] ens259f1 flows:   4203, RX:   13921(pps) (err:     0),  0.01(Gbps), TX:   10408(pps),  0.01(Gbps)
[CPU 2] ens259f1 flows:   4051, RX:   12097(pps) (err:     0),  0.01(Gbps), TX:    9045(pps),  0.01(Gbps)
[CPU 3] ens259f1 flows:   4178, RX:   13788(pps) (err:     0),  0.01(Gbps), TX:   10329(pps),  0.01(Gbps)
[ ALL ] ens259f1 flows:  16558, RX:   53556(pps) (err:     0),  0.04(Gbps), TX:   40126(pps),  0.04(Gbps)
[CPU 0] ens259f1 flows:    597, RX:   15095(pps) (err:     0),  0.01(Gbps), TX:    9058(pps),  0.01(Gbps)
[CPU 1] ens259f1 flows:    520, RX:   15356(pps) (err:     0),  0.01(Gbps), TX:    9322(pps),  0.01(Gbps)
[CPU 2] ens259f1 flows:    489, RX:   16572(pps) (err:     0),  0.01(Gbps), TX:   10656(pps),  0.01(Gbps)
[CPU 3] ens259f1 flows:    747, RX:   15352(pps) (err:     0),  0.01(Gbps), TX:    9347(pps),  0.01(Gbps)
[ ALL ] ens259f1 flows:   2353, RX:   62375(pps) (err:     0),  0.05(Gbps), TX:   38383(pps),  0.03(Gbps)
[CPU 0] ens259f1 flows:    643, RX:   11014(pps) (err:     0),  0.01(Gbps), TX:    5507(pps),  0.00(Gbps)
[CPU 1] ens259f1 flows:    574, RX:   11012(pps) (err:     0),  0.01(Gbps), TX:    5506(pps),  0.00(Gbps)
[CPU 2] ens259f1 flows:    529, RX:   11031(pps) (err:     0),  0.01(Gbps), TX:    5541(pps),  0.00(Gbps)
[CPU 3] ens259f1 flows:    676, RX:   10994(pps) (err:     0),  0.01(Gbps), TX:    5497(pps),  0.00(Gbps)
[ ALL ] ens259f1 flows:   2422, RX:   44051(pps) (err:     0),  0.03(Gbps), TX:   22051(pps),  0.02(Gbps)
[CPU 0] ens259f1 flows:      4, RX:     196(pps) (err:     0),  0.00(Gbps), TX:     192(pps),  0.00(Gbps)
[CPU 1] ens259f1 flows:      4, RX:     129(pps) (err:     0),  0.00(Gbps), TX:     125(pps),  0.00(Gbps)
[CPU 2] ens259f1 flows:      0, RX:      16(pps) (err:     0),  0.00(Gbps), TX:      16(pps),  0.00(Gbps)
[CPU 3] ens259f1 flows:      0, RX:     202(pps) (err:     0),  0.00(Gbps), TX:     200(pps),  0.00(Gbps)
[ ALL ] ens259f1 flows:      8, RX:     543(pps) (err:     0),  0.00(Gbps), TX:     533(pps),  0.00(Gbps)
[CPU 0] ens259f1 flows:    511, RX:   10998(pps) (err:     0),  0.01(Gbps), TX:    5499(pps),  0.00(Gbps)
[CPU 1] ens259f1 flows:    512, RX:   10996(pps) (err:     0),  0.01(Gbps), TX:    5498(pps),  0.00(Gbps)
[CPU 2] ens259f1 flows:    513, RX:   10996(pps) (err:     0),  0.01(Gbps), TX:    5498(pps),  0.00(Gbps)
[CPU 3] ens259f1 flows:    516, RX:   10990(pps) (err:     0),  0.01(Gbps), TX:    5495(pps),  0.00(Gbps)

on server 2: sudo ./epwget 169.254.9.84/small.txt 10000000 -c 22000 -f epwget.conf -N 4

[WARINING] Available # addresses (16127) is smaller than the max concurrency (16500).
Thread 2 handles 2500000 flows. connecting to 169.254.9.84:80
CPU 3: initialization finished.
CPU 0: initialization finished.
CPU 1: initialization finished.
[WARINING] Available # addresses (16127) is smaller than the max concurrency (16500).
Thread 1 handles 2500000 flows. connecting to 169.254.9.84:80
Learned new arp entry.
ARP Table:
IP addr: 169.254.9.84, dst_hwaddr: 3C:FD:FE:9E:7B:85
---------------------------------------------------------------------------------
[WARINING] Available # addresses (16127) is smaller than the max concurrency (16500).
Thread 0 handles 2500000 flows. connecting to 169.254.9.84:80
[WARINING] Available # addresses (16127) is smaller than the max concurrency (16500).
Thread 3 handles 2500000 flows. connecting to 169.254.9.84:80
Response size set to 204
[CPU 0] ens259f1 flows:   5500, RX:   18963(pps) (err:     0),  0.02(Gbps), TX:   30720(pps),  0.02(Gbps)
[CPU 1] ens259f1 flows:   5596, RX:   18027(pps) (err:     0),  0.02(Gbps), TX:   30520(pps),  0.02(Gbps)
[CPU 2] ens259f1 flows:   5893, RX:   18058(pps) (err:     0),  0.02(Gbps), TX:   30166(pps),  0.02(Gbps)
[CPU 3] ens259f1 flows:   5500, RX:   18856(pps) (err:     0),  0.02(Gbps), TX:   30841(pps),  0.02(Gbps)
[ ALL ] ens259f1 flows:  22489, RX:   73904(pps) (err:     0),  0.07(Gbps), TX:  122247(pps),  0.10(Gbps)
[CPU 0] ens259f1 flows:   5500, RX:    5478(pps) (err:     0),  0.00(Gbps), TX:   10978(pps),  0.01(Gbps)
[CPU 1] ens259f1 flows:   5596, RX:    5465(pps) (err:     0),  0.00(Gbps), TX:   10965(pps),  0.01(Gbps)
[CPU 2] ens259f1 flows:   5893, RX:    5475(pps) (err:     0),  0.00(Gbps), TX:   10975(pps),  0.01(Gbps)
[CPU 3] ens259f1 flows:   5500, RX:    5515(pps) (err:     0),  0.00(Gbps), TX:   11009(pps),  0.01(Gbps)
[ ALL ] ens259f1 flows:  22489, RX:   21933(pps) (err:     0),  0.02(Gbps), TX:   43927(pps),  0.03(Gbps)
[CPU 0] ens259f1 flows:   5500, RX:     435(pps) (err:     0),  0.00(Gbps), TX:    1191(pps),  0.00(Gbps)
[CPU 1] ens259f1 flows:   5500, RX:     749(pps) (err:     0),  0.00(Gbps), TX:     847(pps),  0.00(Gbps)

I have not run affinity.py file for IRQ pinning. Could you let me know if the issue is because of that or do I need some other configuration changes too? Also can you please help me with affinity.py for i40e (40G NIC). Thanks again for all your help.

@ppnaik1890,

I don't think this is an irq affinitization issue (although performance of your experiment may go up slightly if you correctly bind irq #s to cores). What is the file size of small.txt? You may either be hitting PCI lane bottleneck (if your file size --> average packet size is ~ 64B). Or your conifg files (epserver.conf, epwget.conf) files may need some tuning (hint: [WARINING] Available # addresses (16127) is smaller than the max concurrency (16500).).

Also, make sure that the NIC is placed in the first NUMA node (since you are using CPUs 0-3). To understand affinity.py better, please see this link: https://null.53bits.co.uk/index.php?page=numa-and-queue-affinity

Hi @ajamshed,

I feel the RSS is not working properly. But could not find the reason for it. Following are the steps we followed to tune/ improve but did not succeed:

The WARINING] Available # addresses (16127) is smaller than the max concurrency (16500). was because of the init_rss in epwget. c. Commenting it removed it. We need that line but do not know how to fix this error.
We ran affinity.py, also turned off the other cores on the machine (i.e. kept only 4 running). But still no help.
Confirmed that the 4 cores are on the same numa node (using lscpu). Did not get how to verify if NIC is in numa 0 too.
added sym_seed in netmap/LINUX/i40e-2.4.6/i40e_main.c
The outout for 1core is ~5.5 - 6Gbps.

Observations:

We tried setting the RSS when staring the netmap i40e driver. But it gave a i40e: unknown parameter 'RSS' ignored in dmesg.

Also we tried to verify the RSS hash using ethtool using sudo ethtool --show-rxfh-indir ens259f1 but it too showed:

RSS hash key:
00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00

We monitored the CPU utilization on epwget side which is the recieve side and observed non-uniform utilizations amongst the 4 cores: A snippet of top:

%Cpu0  : 18.9 us,  3.3 sy,  0.0 ni, 77.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 19.2 us,  4.6 sy,  0.0 ni, 76.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 67.9 us,  2.0 sy,  0.0 ni, 29.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu3  : 15.6 us,  2.7 sy,  0.0 ni, 81.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st

Sorry for such long posts and trail. Please help me debug the issue. Thanks again for your help.

Hi @ppnaik1890,

[WARINING] Available # addresses (16127) is smaller than the max concurrency (16500). is saying that the number of available source IP & port pairs is lower than the number of connection you are trying to create concurrently. This may have led mTCP to an unexpected behavior. I recommend you to reduce the concurrency of the client. You can still test larger server-side concurrency by increasing the number of clients. Client-side mTCP does not use all of 65535 port space to support symmetric RSS.

mtcp-stack / mtcp

mTCP netmap multicore over 40G NIC #222