refraction-networking / conjure

Conjure Refraction Networking station code
https://refraction.network
Apache License 2.0
70 stars 19 forks source link

Standup Troubleshooting [gitlab] #43

Open jmwample opened 4 years ago

jmwample commented 4 years ago

This will be a series of posts discussing the issues I ran into while standing up the stations windyheron and artemis.

original submission: 1/21/2020

jmwample commented 4 years ago

PF_RING submodule incompatibility

Currently the tapdance server has it's git submodule tracking

[submodule "PF_RING"]
    path = PF_RING
    url = https://github.com/ntop/PF_RING.git
    branch = 6.4.0-stable

This is the stable version from 2017. Conjure on the other hand tracks 7.4.0-stable by default. The compilation process is the exact same and will work. HOWEVER if you have the kernel modules for i40e ethernet devices from 6.4.0-stable and you try to run a program compiled against the `7.4.0-stable libraries it will crash.

Symptom

The sign that you are experiencing this issue is a floating point exception error that crashes the detector.

Using public key: a1cb97be697c5ed5aefd78ffa4db7e68101024603511e40a89951bc158807177
PF_RING Tapdance shutting down...
PF_RING Tapdance done shutting down!
Starting process 0...
Core 6: PID 17346, lcore 9
Starting process 1...
Core 7: PID 17347, lcore 10
Starting process 2...
Core 8: PID 17348, lcore 11
Starting process 3...
Core 9: PID 17349, lcore 12
Starting process 4...
Core 10: PID 17350, lcore 13
Starting process 5...
Core 11: PID 17351, lcore 14
...child proc 0 killed by signal 8 -- Coredump created
...child proc 1 killed by signal 8 -- Coredump created
...child proc 2 killed by signal 8 -- Coredump created
...child proc 3 killed by signal 8 -- Coredump created
...child proc 4 killed by signal 8 -- Coredump created
...child proc 5 killed by signal 8 -- Coredump created

Solution

The current solution is to downgrade the conjure station to the 6.4.0-stable pf_ring tag. There is currently an issue open to update tapdance to support 7.4.0-stable (#42).

Downgrading the pf_ring libraries can be done in one of two ways.

EITHER:

OR

original comment: 1/21/2020

jmwample commented 4 years ago

Here is the diff of the makefile which worked for compiling on artemis. (solution 2)

$ git diff Makefile
diff --git a/Makefile b/Makefile
index 7651143..0fa6aa4 100644
--- a/Makefile
+++ b/Makefile
@@ -7,8 +7,8 @@ PFRINGDIR=./PF_RING/
 PFRING_LIBS=${PFRINGDIR}/userland/lib/libpfring.a ${PFRINGDIR}/userland/libpcap/libpcap.a
 RUST_LIB=./target/release/librust_dark_decoy.a
 TD_LIB=./libtapdance/libtapdance.a
-LIBS=${PFRING_LIBS} ${RUST_LIB} ${TD_LIB} -L/usr/local/lib -lzmq -lcrypto -lpthread -lrt -lgmp -ldl -lm
-CFLAGS = -Wall -DENABLE_BPF -DHAVE_PF_RING -DHAVE_PF_RING_ZC -DTAPDANCE_USE_PF_RING_ZERO_COPY -I${PFRINGDIR}/userland/lib/ -I${PFRINGDIR}/kernel -O2 # -g
+LIBS= ${RUST_LIB} ${TD_LIB} -L/usr/local/lib -lzmq -lcrypto -lpthread -lrt -lgmp -ldl -lm -lpfring -lpcap
+CFLAGS = -Wall -DENABLE_BPF -DHAVE_PF_RING -DHAVE_PF_RING_ZC -DTAPDANCE_USE_PF_RING_ZERO_COPY -O2 # -g
 PROTO_RS_PATH=src/signalling.rs

original comment: 1/21/2020

jmwample commented 4 years ago

Held Packages

Symptom

When attempting to install libzmq3-dev apt informs the user that they do not have the proper version of libzmq5 installed saying that

... you have held broken packages

This message can mean that a package was "held" using the apt system so that it would not update. To see a list of these packages you can run the following command:

sudo apt-mark showhold

This should list all packages held on the system. If none are listed (as was the case during the windyheron setup) the next step is to look for broken apt sources.

Solution

In this case there was an extraneous apt source in /etc/apt/sources.list which tied the libzmq to an old (broken) source. [Unfortunately I forgot to copy paste it anywhere after removing it].

Removing the zmq specific source and allowing apt to use the default repositories which worked great. Another error here could be caused by a default apt source being of a version not matching your current kernel distribution, but that was not the case this time.

Tried and Failed

While the apt package was broken I attempted to install libzmq from source. This worked for conjure, but did NOT work for tapdance and resulted in downtime while I attempted to recover the installation.

original comment: 1/21/2020

jmwample commented 4 years ago

Zbalance Teardown

When restarting zbalance_ipc you must first stop the programs consuming data from the queues (detector.service) otherwise the teardown leaves things in a strange state.

Symptoms

If you find yourself in this strange state zbalance_ipc will complain about huge-pages when you try to restart it.

Solution

If you are running zbalance on it's own when this happens choose a new cluster id (-c [CLUSTER_ID]) and restart zbalance_ipc.

If you are running the zbalance service from Tapdance you will need to change the TD_CLUSTER_ID to something new in /opt/tapdance/config and then restart zbalance.service which automatically sources the tapdance config.

Note: This is only tested on PF_RING 6.4.0-stable

original comment: 1/21/2020