Add a tool for gathering data to debug networking problems

danwinship commented 9 years ago

This is still slightly a work in progress, but it's basically working (unless people hate what it does and want a total rewrite...).

This adds a script which you can run on the OpenShift master, which will gather data there, on each node, and in each running pod, which can then be sent to a human for debugging purposes. (Automatically diagnosing problems comes next.) Currently this includes:

journalctl --unit openshift-master.service on the master and journalctl --unit openshift-node.service on each node
journalctl --boot on the master and each node. (FIXME: too invasive? Might make admins nervous...)
ip a and ip r on the master, each node, and inside each pod
iptables-save on the master and each node
/etc/hosts from the master and each node
oc get nodes -o json and oc get pods --all-namespaces -o json
brctl show on each node
node-config.yaml from each node
ovs-ofctl -O OpenFlow13 dump-flows br0 on each node
a set of ovs-appctl ofproto/trace outputs on each node, showing traces of up to four different pairs (send/receive) of pod traffic (as many of "packets between local pods in the same namespace", "packets between local pods in different namespaces", "packets between local and remote pod in the same namespace" and "packets between local and remote pod in different namespaces" as it's possible to show given the currently running pods).
the results of every pod attempting to ping every other pod. (FIXME: even with the ping timeout set to 2 seconds, this still really slows things down when using multitenant. It should just do a "representative" sample of pings like with the flow traces.)
the results of every pod attempting to ping www.redhat.com. (FIXME: should do an ofproto/trace test of external network traffic as well)

One catch is that it requires that root@master be able to ssh to root on each node without needing a password. Alternatively, maybe it would make more sense to have the script run from an outside machine that has the ability to ssh to root at the master and each node, rather than running it from the master?

(To test in the vagrant setup, as root:

# root ends up with an erroneous KUBECONFIG when you do "sudo bash"
KUBECONFIG=/vagrant/openshift.local.config/master/admin.kubeconfig
ssh-keygen   # and hit return a few times
scp -pr ~/.ssh openshift-minion-1:.ssh   # root pw is "vagrant"
scp -pr ~/.ssh openshift-minion-2:.ssh
debug.sh

)

dcbw commented 9 years ago

Should probably have 'ovs-ofctl -O OpenFlow13 show br0' too so that we get the real port numberss that match up with the ones in the flow tables.

dcbw commented 9 years ago

Also lets do 'systemctl status openshift-[master|node]' too; that should get us the command line openshift was launched with and any very recent log messages.

dcbw commented 9 years ago

And might as well grab /etc/sysconfig/network-scripts/ifcfg-*, 'nmcli dev', 'nmcli con', and NM journal output too just to isolate any network setup problems that 'ip a' and 'ip r' don't show.

rajatchopra commented 9 years ago

I think we must decide on the operational method of this script (ssh to master/nodes etc). All other improvements can come as further PRs.

I had one proposal:

Run the script in master mode and collect all useful data.
Then run the script in node modes on all nodes.
Run in assemble mode that chews all the data and produces obvious broken points.

The above can be done all at once when password-less ssh is allowed between master/nodes; otherwise it will need manual runs.

danwinship commented 9 years ago

Pushed a new version:

Fixes the ping and flow-trace FIXMEs above: the two tests are now merged together into a single "connectivity" test that does both ofproto/trace and ping
Adds ovs-ofctl -O OpenFlow13 show br0 as suggested by dcbw
Adds systemctl show openshift-{master,node} as sort-of-suggested by dcbw; as opposed to systemctl status, this includes more data and is more machine-parseable. It doesn't include journal output but that's ok because we already have that separately.
Adds /etc/sysconfig/network-scripts/ifcfg-*, nmcli -f all dev, and nmcli -f all con as suggested by dcbw (but with the addition of -f all). Doesn't add NM journal output since we already have that in the main journal file.
Can now be run either from the master, ssh'ing to the nodes from there, or from any random machine (taking the name/IP of the master as an argument), ssh'ing to master and nodes from there. In theory you could also run it manually on each machine, passing either --master or --node to tell it what info to gather, and then merge the data yourself...
FIXME just realized that I need to test the case where the master is also a node

danwinship commented 9 years ago

Sample output at http://people.redhat.com/dwinship/openshift-sdn-debug-2015-09-18.tgz

rajatchopra commented 9 years ago

Looks excellent.

openshift / openshift-sdn

Add a tool for gathering data to debug networking problems #154