sudomesh / sudowrt-firmware

Scripts to build the sudo mesh OpenWRT firmware.
Other
73 stars 19 forks source link

Some nodes losing connections #76

Closed max-b closed 8 years ago

max-b commented 8 years ago

Some set of our nodes seem to be occasionally losing connectivity to the mesh. It's only 3 of them, and the rest have had perfect uptime, so I think that I may have somehow pushed a bad firmware to them, probably somehow related to tunneldigger and or a hook script.

Ideally, we would be able to get some debug info from one of these disconnected devices to figure out what exactly is going on. If someone were to be able to connect to a device that has lost mesh connectivity and run the following commands, it would be very helpful:

logread ip addr ip link ip route ip route show table public ps ip rule list kill -USR1 $(pgrep babeld); cat /var/log/babeld.log ping -c 3 8.8.8.8 traceroute 8.8.8.8 iptables -L -v -n iptables -L -v -n -t nat

max-b commented 8 years ago

I think that this commit will help: cbbaf1be035aecdac71e016ab143ad463e0d919c

I added a /etc/init.d/tunneldigger restart to the udhcpc.user script which gets called after the node gets a new dhcp lease

max-b commented 8 years ago

I actually fancied that udhcpc.user script to test for a connection and if not found then run /etc/init.d/meshrouting restart

It could be the basis of a mesh watchdog....

max-b commented 8 years ago

I've created a mesh-watchdog script which just pings our exit server on a preset occasion and runs a restart script after a certain period of time. I haven't set it up on all of the nodes, because it could be a problem if someone is intentionally creating a small mesh situation where they don't intend for it to access the publicly accessible exit server.

Out of our alpha test devices, we have a couple that have shown significant downtimes, while others seem to have almost perfect uptimes: http://monitor.sudomesh.org/smokeping/smokeping.cgi?target=Mesh

Unfortunately, what we really need is to talk to the affected folks and find out more about their situations and why they fail. In certain circumstances it seems like folks have just actually unplugged their router, which means that this isn't really a technical fail as much as it's a logistics/community fail.

The ar71xx devices also have a hardware watchdog.

max-b commented 8 years ago

We haven't had this problem really for the last month or so. I think that the udhcpc.user script is a decent fix for now. @Juul is working on a hardware "watchpuppy" and I wrote a ping watchdog that I've been testing on a handful of nodes: https://github.com/sudomesh/sudowrt-packages/tree/master/net/mesh-watchdog