openshift / openshift-sdn

Apache License 2.0
69 stars 63 forks source link

Add a tool for gathering data to debug networking problems #154

Closed danwinship closed 9 years ago

danwinship commented 9 years ago

This is still slightly a work in progress, but it's basically working (unless people hate what it does and want a total rewrite...).

This adds a script which you can run on the OpenShift master, which will gather data there, on each node, and in each running pod, which can then be sent to a human for debugging purposes. (Automatically diagnosing problems comes next.) Currently this includes:

One catch is that it requires that root@master be able to ssh to root on each node without needing a password. Alternatively, maybe it would make more sense to have the script run from an outside machine that has the ability to ssh to root at the master and each node, rather than running it from the master?

(To test in the vagrant setup, as root:

# root ends up with an erroneous KUBECONFIG when you do "sudo bash"
KUBECONFIG=/vagrant/openshift.local.config/master/admin.kubeconfig
ssh-keygen   # and hit return a few times
scp -pr ~/.ssh openshift-minion-1:.ssh   # root pw is "vagrant"
scp -pr ~/.ssh openshift-minion-2:.ssh
debug.sh

)

dcbw commented 9 years ago

Should probably have 'ovs-ofctl -O OpenFlow13 show br0' too so that we get the real port numberss that match up with the ones in the flow tables.

dcbw commented 9 years ago

Also lets do 'systemctl status openshift-[master|node]' too; that should get us the command line openshift was launched with and any very recent log messages.

dcbw commented 9 years ago

And might as well grab /etc/sysconfig/network-scripts/ifcfg-*, 'nmcli dev', 'nmcli con', and NM journal output too just to isolate any network setup problems that 'ip a' and 'ip r' don't show.

rajatchopra commented 9 years ago

I think we must decide on the operational method of this script (ssh to master/nodes etc). All other improvements can come as further PRs.

I had one proposal:

  1. Run the script in master mode and collect all useful data.
  2. Then run the script in node modes on all nodes.
  3. Run in assemble mode that chews all the data and produces obvious broken points.

The above can be done all at once when password-less ssh is allowed between master/nodes; otherwise it will need manual runs.

danwinship commented 9 years ago

Pushed a new version:

danwinship commented 9 years ago

Sample output at http://people.redhat.com/dwinship/openshift-sdn-debug-2015-09-18.tgz

rajatchopra commented 9 years ago

Looks excellent.