libremesh / lime-packages

LibreMesh packages configuring OpenWrt for wireless mesh networking
https://libremesh.org/
GNU Affero General Public License v3.0
277 stars 96 forks source link

Tiny monitor daemon #137

Closed p4u closed 7 years ago

p4u commented 7 years ago

In the last versions of libremesh, dnsmasq is not started properly or it simply dies in run time. Also bmx6 experienced some unexpected crashes in the past.

A small daemon might be used for running tests every N minutes on different processes, and execute a set of orders to fix (such as restart the process) the wrong state of it. Also a log might be written somewhere in the system so errors can by analysed afterwards.

p4u commented 7 years ago

Current development at branch https://github.com/libremesh/lime-packages/tree/feature/smonit

ilario commented 7 years ago

This would solve #111

p4u commented 7 years ago

Yes. But it will solve also more scenarios. @nicoechaniz from quintanalibre reports that some times dnsmasq just stop working but the process is still alive. This might be checked using smonit.

G10h4ck commented 7 years ago

This cases should be handled by OpenWrt/LEDE init system, we should not add another custom init daemon that may eventually crash itself, we should have this solved in upstream init system

p4u commented 7 years ago

I agree, as many things may be handled by procd/init LEDE system, the better. But:

  1. It depends on another team, so at the end accept or deny a pull request is up to them
  2. If we launch a release, we cannot wait for their approval, so until the problem is fixed upstream, we have to deal with our own tools
  3. The reality is that this daemon monitoring is currently done by many network communities, but performed by a uci script which writes at crontab (see https://github.com/libremesh/network-profiles/tree/master/quintanalibre.org.ar/comun/etc/uci-defaults). So, this is happening and hiding it (by not adding them to the official source) does not help at all. Smonit will be a better way to handle these situations and will bring the possibility to reuse the same code for the different communities.

In addition, smonit is not a daemon (my previous definition was not accurate) but a small script executed by cron to check states and perform actions according these states. It's a small layer between crontab and the system based on hooks which might be enabled or disabled depending on system and user needs.

Not all cases that can be covered by smonit can by covered by procd/init system. While the first might only check if the daemon PID is alive, the second can execute also some actions to detect a wrong behaviour, i.e dnsmasq is alive but it is not resolving hosts.

Also the logs handling (when crash or wrong behaviour) is something that smonit can take care of. This will help the developers to debug the current problems.

p4u commented 7 years ago

Another good example is this one: https://github.com/libremesh/network-profiles/blob/master/quintanalibre.org.ar/comun/usr/sbin/reset_deaf_phys.sh

This is an actual (I've seen it) problem from Atheros drivers which has been there for a long time and yet there is no solution. Fixing it upstream (mac80211) would take a lot of efforts, time and someone with deep knowledge on Kernel drivers.

So instead of keeping it hidden, and let other users experience this weird behaviour, let's add this fix to a (single) packet to manage all this kind of fixes (call it smonit or anything else).

G10h4ck commented 7 years ago

In this case it is related to integrating the workaround in the official distribution, and I agree on that

nicoechaniz commented 7 years ago

I have tried to put together the workarounds we are using in this commented uci-defaults file: https://github.com/libremesh/network-profiles/blob/master/quintanalibre.org.ar/comun/etc/uci-defaults/92-cron-workaround-tasks

considering we have had to implement such workarounds for live deployments in every release I think it's a good idea to integrate a more robust version in our distribution.

nicopace commented 7 years ago

was closed by https://github.com/libremesh/lime-packages/pull/142