tbarbette / fastclick

FastClick - A faster version of the Click Modular Router featuring batching, advanced multi-processing and improved Netmap and DPDK support (ANCS'15). Check the metron branch for Metron specificities (NSDI'18). PacketMill modifications (ASPLOS'21) as well as MiddleClick(ToN, 2021) are merged in main.
Other
280 stars 81 forks source link

Metron branch is broken #243

Open gkatsikas opened 4 years ago

gkatsikas commented 4 years ago

Even the simplest FastClick app is broken in the Metron branch. Issues occur with conf/metron/metron-dispatcher-flow.click when launching secondary processes.

sudo gdb --args bin/click --dpdk -w 0000:03:00.0 -- conf/dpdk/dpdk-bounce.click

GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from bin/click...done. (gdb) r

Starting program: /home/katsikas/nfv/projects/fastclick/bin/click --dpdk -w 0000:03:00.0 -- conf/dpdk/dpdk-bounce.click [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". EAL: Detected 16 lcore(s) EAL: Detected 2 NUMA nodes [New Thread 0x7ffff0199700 (LWP 10006)] EAL: Multi-process socket /var/run/dpdk/rte/mp_socket [New Thread 0x7fffef998700 (LWP 10007)] EAL: Selected IOVA mode 'PA' EAL: Probing VFIO support... EAL: VFIO support initialized [New Thread 0x7fffef197700 (LWP 10008)] [New Thread 0x7fffee996700 (LWP 10009)] [New Thread 0x7fffee195700 (LWP 10010)] [New Thread 0x7fffed994700 (LWP 10011)] [New Thread 0x7fffed193700 (LWP 10012)] [New Thread 0x7fffec992700 (LWP 10013)] [New Thread 0x7fffec191700 (LWP 10014)] [New Thread 0x7fffeb990700 (LWP 10015)] [New Thread 0x7fffeb18f700 (LWP 10016)] [New Thread 0x7fffea98e700 (LWP 10017)] [New Thread 0x7fffea18d700 (LWP 10018)] [New Thread 0x7fffe998c700 (LWP 10019)] [New Thread 0x7fffe918b700 (LWP 10020)] [New Thread 0x7fffe898a700 (LWP 10021)] [New Thread 0x7fffe8189700 (LWP 10022)] EAL: PCI device 0000:03:00.0 on NUMA socket 0 EAL: probe driver: 15b3:1017 net_mlx5 Initializing flow parser... Initializing DPDK Ingress traffic on port 0 is not restricted anymore to the defined flow rules deleted virtual method called terminate called without an active exception

Thread 1 "click" received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt

0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

1 0x00007ffff5ced801 in __GI_abort () at abort.c:79

2 0x00007ffff66e0957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

3 0x00007ffff66e6ae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

4 0x00007ffff66e6b21 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

5 0x00007ffff66e791f in __cxa_deleted_virtual () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

6 0x00005555560aa48a in Router::initialize (this=, errh=0x555556de0f90) at ../lib/router.cc:1451

7 0x00005555560444db in parse_configuration (text=..., text_is_expr=, hotswap=, errh=0x555556de0f90) at click.cc:404

8 0x00005555556fb58e in main (argc=, argv=) at click.cc:739

tbarbette commented 4 years ago

Which binutils? Which GCC? NSLab racks? I would bet on binutils, those developers are cow boys^^

gkatsikas commented 4 years ago

nslrack06-07-08 (Ubuntu 18.04.4, kernel 4.15.0-91-generic) gcc: 7.5 binutils: 2.30 DPDK: 20.02 (also failed with older versions)

Is there a known issue with binutils? I will try on other racks with different versions

tbarbette commented 4 years ago

Yes, with 2.30 the code will crash with Xeon Skylake and higher, because they did an incorrect AVX512 optimization. With 2.34 we observed a problem similar to this one where some code was optimized out but actually still called... Reverting to 2.32 worked in that case :p

But then maybe it's different here. Will take a look.

gkatsikas commented 4 years ago

Racks 06-08 are Haswell-based; probably this problems holds for this architecture too.

gkatsikas commented 4 years ago

Also, I noticed something which is not related to the bug but caught my attention. The configuration flag --enable-cpu-load is not recognized by configure.in anymore. Did you change anything that slept my attention or is it a problematic merge that we should roll-back?

gkatsikas commented 4 years ago

Interestingly, on rack14 (Skylake) the dpdk-bounce works fine, but metron is still problematic when spawning secondary processes (try_slave() method):

EAL: PCI device 0000:17:00.0 on NUMA socket 0 EAL: probe driver: 15b3:1017 net_mlx5 Device 0000:17:00.0 is not driven by the primary process net_mlx5: can not attach rte ethdev net_mlx5: probe of PCI device 0000:17:00.0 aborted after encountering an error: Cannot allocate memory EAL: Requested device 0000:17:00.0 cannot be used Continuing initialization... Successful initialization!

tbarbette commented 4 years ago

--enable-cpu-load suffered a bad merge for sure.

I'm finishing something and then will look at it.

tbarbette commented 4 years ago

For me it works. Maybe you should recompile both DPDK and Click, cleaning before from the same machine?

gkatsikas commented 4 years ago

Did you also try Metron with a Mellanox NIC? Which machine did you use?

tbarbette commented 4 years ago

I just tried to launch (Mellanox yes) and did not get the messages you had. Rack 05

gkatsikas commented 4 years ago

Problem found:

When passing the following configuration to the Metron element: SLAVE_DPDK_ARGS "-w0000:03:00.0" one should be careful to omit any space between -w and the PCI ID of the NIC (i.e., 0000:03:00.0)

gkatsikas commented 4 years ago

RSS and VMDq-based service chain deployments crash in run_service_chain() method (Child part, just before or during DPDK initialization). See the output below (RSS-based deployment):

Writing configuration: elementclass MetronSlave { input[0] -> MarkIPHeader(OFFSET 14) -> filter0 :: IPFilter(allow ((ip ttl >= 2 && ip ttl <= 255)), deny all); filter0 -> IPRewriter(pattern 10.0.0.4 1000-65535 - - 0 0) -> DecIPTTL() -> EtherRewrite(SRC 50:6B:4B:43:88:CA, DST 50:6B:4B:43:8A:DA) -> [0]output; filter0[1] -> Discard; };

slave :: MetronSlave();

slaveFD0C0 :: FromDPDKDevice(0, QUEUE 0, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 1); StaticThreadSched(slaveFD0C0 0); slaveFD0C0 -> [0]slave; slaveFD0C1 :: FromDPDKDevice(0, QUEUE 1, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0); StaticThreadSched(slaveFD0C1 1); slaveFD0C1 -> [0]slave; slaveFD0C2 :: FromDPDKDevice(0, QUEUE 2, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0); StaticThreadSched(slaveFD0C2 2); slaveFD0C2 -> [0]slave; slaveFD0C3 :: FromDPDKDevice(0, QUEUE 3, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0); StaticThreadSched(slaveFD0C3 3); slaveFD0C3 -> [0]slave; slaveFD0C4 :: FromDPDKDevice(0, QUEUE 4, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0); StaticThreadSched(slaveFD0C4 4); slaveFD0C4 -> [0]slave; slaveFD0C5 :: FromDPDKDevice(0, QUEUE 5, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0); StaticThreadSched(slaveFD0C5 5); slaveFD0C5 -> [0]slave; slaveFD0C6 :: FromDPDKDevice(0, QUEUE 6, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0); StaticThreadSched(slaveFD0C6 6); slaveFD0C6 -> [0]slave; slaveFD0C7 :: FromDPDKDevice(0, QUEUE 7, N_QUEUES 1, MAXTHREADS 1, BURST 32, NUMA false, VERBOSE 99, ACTIVE 0); StaticThreadSched(slaveFD0C7 7); slaveFD0C7 -> [0]slave;

slaveTD0 :: ExactCPUSwitch(); slaveTD0C0 :: ToDPDKDevice(0, QUEUE 0, VERBOSE 99, MAXQUEUES 1);slaveTD0[0] -> slaveTD0C0; slaveTD0C1 :: ToDPDKDevice(0, QUEUE 1, VERBOSE 99, MAXQUEUES 1);slaveTD0[1] -> slaveTD0C1; slaveTD0C2 :: ToDPDKDevice(0, QUEUE 2, VERBOSE 99, MAXQUEUES 1);slaveTD0[2] -> slaveTD0C2; slaveTD0C3 :: ToDPDKDevice(0, QUEUE 3, VERBOSE 99, MAXQUEUES 1);slaveTD0[3] -> slaveTD0C3; slaveTD0C4 :: ToDPDKDevice(0, QUEUE 4, VERBOSE 99, MAXQUEUES 1);slaveTD0[4] -> slaveTD0C4; slaveTD0C5 :: ToDPDKDevice(0, QUEUE 5, VERBOSE 99, MAXQUEUES 1);slaveTD0[5] -> slaveTD0C5; slaveTD0C6 :: ToDPDKDevice(0, QUEUE 6, VERBOSE 99, MAXQUEUES 1);slaveTD0[6] -> slaveTD0C6; slaveTD0C7 :: ToDPDKDevice(0, QUEUE 7, VERBOSE 99, MAXQUEUES 1);slaveTD0[7] -> slaveTD0C7; slave[0] -> slaveTD0;

Initializing flow parser...

:2: While configuring ‘slave/filter0 :: IPFilter’: pattern 0: warning: relation ‘<= 255’ is always true (range 0-255) FromDPDKDevice : remove StaticThreadSched to use FastClick's auto-thread assignment slaveFD0C1: using queues from 1 to 1 slaveFD0C1: Queue 1 handled by th 1 click: ../include/click/vector.hh:291: T& Vector::operator[](Vector::size_type) [with T = QueueDevice::QueueInfo; long unsigned int ALIGNMENT = 64; Vector::size_type = int]: Assertion `(unsigned) i < (unsigned) vm_.n_' failed. Could not read from control socket: Error 0 Could not launch service chain... Cannot instantiate service chain with ID e82807f5-b89e-438a-b22d-583448a1542c
tbarbette commented 4 years ago

Could you run it under gdb? Compiled with "-O1 -g"? As it's the slave you can run it with "gdb -ex run -ex "signal 2" -ex bt -batch -args " prefixes so without input it starts and shows the stacktrace upon failure.

tbarbette commented 4 years ago

Is this fixed?

gkatsikas commented 4 years ago

I could not get the stacktrace of the slave, so I abandoned. I need to re-visit it at some point