laberning / openrowingmonitor

A free and open source performance monitor for rowing machines
https://laberning.github.io/openrowingmonitor
GNU General Public License v3.0
98 stars 19 forks source link

Pigpio startup error #106

Open Abasz opened 1 year ago

Abasz commented 1 year ago

Sometimes when I do frequent restart of the app I get pigpio initialize error. ˙

2023-01-01 17:40:12 initInitialise: Can't lock /var/run/pigpio.pid
/opt/openrowingmonitor/node_modules/pigpio/pigpio.js:54
    pigpio.gpioInitialise();
           ^

Error: pigpio error -1 in gpioInitialise
    at initializePigpio (/opt/openrowingmonitor/node_modules/pigpio/pigpio.js:54:12)
    at new Gpio (/opt/openrowingmonitor/node_modules/pigpio/pigpio.js:158:5)
    at createGpioTimerService (file:///opt/openrowingmonitor/app/gpio/GpioTimerService.js:40:18)
    at file:///opt/openrowingmonitor/app/gpio/GpioTimerService.js:69:1

Based on some research this may be due to pigpio not shutting down cleanly. The recommendation is add a listener to the process exit codes where the pigpio.terminate() should be called:

process.on('SIGHUP', shutdown)
process.on('SIGINT', shutdown)
process.on('SIGCONT', shutdown)
process.on('SIGTERM', shutdown)

function shutdown () {
  pigpio.terminate()
  process.exit(0)
}

Has anyone encountered this issue?

JaapvanEkris commented 1 year ago

I haven't encountered this specific error yet, but a listener is useless when the app crashes in an uncontrolled way. Couldn't we do a killall for the specific processes when doing a service restart?

Abasz commented 1 year ago

I encountered the issue when I do npm start and then terminate with ctrl+c after several cycle. Adding these listeners seemingly solv the issue but I would need to do more testing. These are catching the ctrl+c termination that is for sure (SIGINT). Probably not app crash l, but that could be solved by wrapping the the gpio controller start up call in a try catch. So any unhandled error is caught by that, logged and then in the catch block pigpio termite can be called.

Note that the pigpio handler is a forked node process so it's probably terminates differently when I ctrl+C the main process. I will do some testing what signals are being sent.

Abasz commented 1 year ago

Actually there is an event signal for unhandledException:

process.on('uncaughtException', err => {
  console.log(`Uncaught Exception: ${err.message}`)
  process.exit(1)
})

So technically there is no need a huge try catch block to handle app crash

JaapvanEkris commented 1 year ago

I currently test with a service restart, allowing the service handler to manage a couple of things. The npm start approach is usefull when the app crashes (allowing node.js to log to console), but too often a wifi hickup would kill the session and thus the app.

I guess the ANT+ server will run its own process as well. So keeping track of all spawned processes and cleaning up the mess when something crashes has to happen somewhere. My instinct is the service, as systemctl seems to monitor and reboot processes sometimes as well (see https://ma.ttias.be/auto-restart-crashed-service-systemd/).

Abasz commented 1 year ago

I guess the ANT+ server will run its own process as well. So keeping track of all spawned processes and cleaning up the mess when something crashes has to happen somewhere. My instinct is the service, as systemctl seems to monitor and reboot processes sometimes as well (see https://ma.ttias.be/auto-restart-crashed-service-systemd/).

I would not mix ANT+ things here as that is a very different kind of service. Actually it is very important that all peripherals are shutdown properly even in case of app termination or crash. In fact the signals are being used for that purpose already in my peripheral extension work in progress. (for instance if the ant stick is not shutdown properly calling the close() method it may not start on the next start up, or BLE advertisement is kept on after ORM is terminated my watch connecting to it).

Anyway for the GpioTimerService.js I would imagine something like this:

try {
  createGpioTimerService()
} catch (e) {
  log.error('Error when starting pigpio:', e)
  pigpio.terminate()
  process.exit(1)
}

process.on('SIGHUP', shutdown)
process.on('SIGINT', shutdown)
process.on('SIGCONT', shutdown)
process.on('SIGTERM', shutdown)

function shutdown (signal) {
  console.log(signal)
  pigpio.terminate()
  process.exit(0)
}

For some reason the uncoughtException event does not fire while I was quickly testing this so I used a try finally block (and I was not able to quickly find an answer for the reason).

I understand that one can rely on the systemd process management but I am not sure that running a ORM under a systemd service when for instance initial calibration is being done and logging is important (logging as you mentioned with systemd is not really straight forward...) is feasible. So I think there is a real use case for handling pigpio termination within code we control instead of relying on solely a different service (that may be bypassed easily for convince in certain cases).

I haven't encountered this specific error yet, but a listener is useless when the app crashes in an uncontrolled way. Couldn't we do a killall for the specific processes when doing a service restart?

I personally would not go for an automatic service restart when the app is in an unstable state. For instance this error can only be killed if I restart my Pi, so eventually it would just go to a dead loop...

Abasz commented 1 year ago

I did more tests and I think I understand the main problem:

When I send SIGINT via ctrl+c to the main node process, sometimes it fails to stop the forked process of ./app/gpio/GpioTimerService.js.

So when send npm start again the pigpio is already in use. So it is not sufficient to add the signal handing logic to the GpioTimerService.js but also to the main process in server.js, that in case of app crash or any termination signal that forked process is killed (this is what you might have meant in one of your comments but I may misunderstood, sorry about that).

Alternative is that no additional process handling logic is included (i.e. the program remains to be run under systemd that actually manages shutdowns of forked processes when the main process is closed), but that would leave the app with potential bugs when run standalone via npm start

JaapvanEkris commented 1 year ago

I would not mix ANT+ things here as that is a very different kind of service. Actually it is very important that all peripherals are shutdown properly even in case of app termination or crash. In fact the signals are being used for that purpose already in my peripheral extension work in progress. (for instance if the ant stick is not shutdown properly calling the close() method it may not start on the next start up, or BLE advertisement is kept on after ORM is terminated my watch connecting to it).

My point is that server.js needs to clean up all forked process on a SIGINT, preferably in a consistent manner. So the GPIO-service and the ANT+ server (and BLE thingy ?) all serve different purposes, but are all forked from the main thread, and could lock up a subsequent start of server.js.

I personally would not go for an automatic service restart when the app is in an unstable state. For instance this error can only be killed if I restart my Pi, so eventually it would just go to a dead loop...

I noticed some intelligence on the systemctl part as well: after five reboots, it stops rebooting a failing process automatically (see https://freshman.tech/snippets/linux/auto-restart-systemd-service/). I don't know if that is the default config for ORM (see https://github.com/laberning/openrowingmonitor/blob/v1beta/install/openrowingmonitor.service), but that is what it did when I messed up my settings.

Alternative is that no additional process handling logic is included (i.e. the program remains to be run under systemd that actually manages shutdowns of forked processes when the main process is closed), but that would leave the app with potential bugs when run standalone via npm start

Yes and no, it is just that the developer is in full control, which might be handy. So you have to remember to kill the gpio-thread as well. I typically kill all node threads to clean up the mess after a crash. I normally debug via two SSH sessions: one to restart the services, and one tail -f /var/log/syslog | grep npm, where the later part is optional (as I am still looking at the effects of housekeeping jobs on ORM). Benefit is that a wobbly wifi signal doesn't disturb a rowing session. Only when logs are missing stuff (app really crashes), then I run it directly from the console to get stack traces.

But your point that someone has to clean up the mess after a reboot is a valid one. One way or another this has to be addressed. Benefit of the systemd approach is that it will get the job done, regardless of how horrific the crash of the app is. But your approach is more elegant and could indeed also allow the ANT+ stick to be reset as well.

Abasz commented 1 year ago

I noticed some intelligence on the systemctl part as well: after five reboots, it stops rebooting a failing process automatically (see https://freshman.tech/snippets/linux/auto-restart-systemd-service/). I don't know if that is the default config for ORM (see https://github.com/laberning/openrowingmonitor/blob/v1beta/install/openrowingmonitor.service), but that is what it did when I messed up my settings.

The way the orm service is setup it should do restart indefenately (you can see that systemd it selfs shuts down due to too quick request, but only after trying to start 262 times :)

nrowingmonitor.service: Service RestartSec=100ms expired, scheduling restart.
jan 02 21:08:50 rowingmonitor systemd[1]: openrowingmonitor.service: Scheduled restart job, restart counter is at 262.
jan 02 21:08:50 rowingmonitor systemd[1]: Stopped Open Rowing Monitor.
jan 02 21:08:50 rowingmonitor systemd[1]: openrowingmonitor.service: Start request repeated too quickly.
jan 02 21:08:50 rowingmonitor systemd[1]: openrowingmonitor.service: Failed with result 'exit-code'.

There is a way to set a limit for this in the service file, systemd config is quite straight forward. I will link it here. You can rate limit the restart in a way that you tell systemd that if n time restart is done within t time, no more retry should be done. You need to add StartLimitIntervalSec and Limit burst to the Unit part of the file.

[Unit]
StartLimitIntervalSec=1
StartLimitBurst=5

Basically the above tells systemd that if 5 tries fails within 1sec it will not try a 6th time.

Yes and no, it is just that the developer is in full control, which might be handy. So you have to remember to kill the gpio-thread as well. I typically kill all node threads to clean up the mess after a crash. I normally debug via two SSH sessions: one to restart the services, and one tail -f /var/log/syslog | grep npm, where the later part is optional (as I am still looking at the effects of housekeeping jobs on ORM). Benefit is that a wobbly wifi signal doesn't disturb a rowing session. Only when logs are missing stuff (app really crashes), then I run it directly from the console to get stack traces.

I think with a complex systemd structure it is better to use systemd notify any way (for instance it allows proper handling of log messages with log levels). I never done that for nodejs only for dotnet core (for that it is very simple). I will look into this.

JaapvanEkris commented 1 year ago

The way the orm service is setup it should do restart indefenately (you can see that systemd it selfs shuts down due to too quick request, but only after trying to start 262 times :)

Ouch.

[Unit]
StartLimitIntervalSec=1
StartLimitBurst=5

I'll add this to the service definition, but with a StartLimitIntervalSec of 60 seconds.

I think with a complex systemd structure it is better to use systemd notify any way (for instance it allows proper handling of log messages with log levels). I never done that for nodejs only for dotnet core (for that it is very simple). I will look into this.

Sounds great!

JaapvanEkris commented 1 year ago

There is a way to set a limit for this in the service file, systemd config is quite straight forward. I will link it here. You can rate limit the restart in a way that you tell systemd that if n time restart is done within t time, no more retry should be done. You need to add StartLimitIntervalSec and Limit burst to the Unit part of the file.

[Unit]
StartLimitIntervalSec=1
StartLimitBurst=5

Basically the above tells systemd that if 5 tries fails within 1sec it will not try a 6th time.

I added this as part of the robustness improvements with Pull Request https://github.com/laberning/openrowingmonitor/pull/114 , with a restart Limit Interval of 1 minute, and a burst of 5.