lwa-project / ng_digital_processor

The Next Generation Digital Processor for LWA North Arm
Apache License 2.0
0 stars 0 forks source link

`ndp-control` randomly died #21

Open jaycedowell opened 10 months ago

jaycedowell commented 10 months ago

This afternoon the ndp-control processed died with a:

Nov 22 16:00:42 ndp ndp-control[204392]: Assertion failed: check () (src/msg.cpp:347)
Nov 22 16:00:42 ndp systemd[1]: ndp-control.service: Main process exited, code=kiilled, status=6/ABRT
Nov 22 16:00:42 ndp systemd[1]: ndp-control.service: Failed with result 'signal'.
Nov 22 16:00:42 ndp systemd[1]: ndp-control.service: Scheduled restart job, restart counter is at 1.

I'm not sure what src/msg.cpp refers to. Maybe 0MQ?

jaycedowell commented 8 months ago

This hasn't happened again, closing.

jaycedowell commented 7 months ago

There was another random ndp-control death today. From journalctl:

Mar 14 09:36:33 ndp ndp-control[1977]: Bad address (src/pipe.cpp:380)
Mar 14 09:36:33 ndp systemd[1]: ndp-control.service: Main process exited, code=killed, status=6/ABRT
Mar 14 09:36:33 ndp systemd[1]: ndp-control.service: Failed with result 'signal'.
Mar 14 09:36:33 ndp systemd[1]: ndp-control.service: Scheduled restart job, restart counter is at 1.

Not exactly the same as the previous failure but it could still be a 0MQ thing.

jaycedowell commented 7 months ago

We are running libzmq version 4.3.2. Looking at that tag in the repo I see:

The pipe.cpp error occurs right after the code tries to delete a message so these might be related. I also see that in the commit history for this file that there has been some work done on dealing with communication problems on pipes. These were all after the release of 4.3.2 (in 2019) so maybe we need to upgrade libzmq/python-zmq?

jaycedowell commented 7 months ago

Another one for msg.cpp:

Mar 18 16:22:00 ndp ndp-control[1982]: Assertion failed: check () (src/msg.cpp:347)
Mar 18 16:22:00 ndp systemd[1]: ndp-control.service: Main process exited, code=killed, status=6/ABRT
Mar 18 16:22:00 ndp systemd[1]: ndp-control.service: Failed with result 'signal'.
Mar 18 16:22:00 ndp systemd[1]: ndp-control.service: Scheduled restart job, restart counter is at 1.
jaycedowell commented 7 months ago

I've installed libzmq version 4.3.5 from https://download.opensuse.org/repositories/network:/messaging:/zeromq:/release-stable/xUbuntu_20.04/amd64/

Update: Well, that seems to have not been a great idea. Maybe it was just a bad restart on the software after the upgrade?

Update: Yeah, it doesn't like the new version. Maybe I need to rebuild the Python zmq module?

Update: I've now done a pip install --user pyzmq==25.1.2 (up from 18.1.1).

Update: I'm just going to restart servers.

jaycedowell commented 6 months ago

Looks like another one from this morning:

Mar 26 13:25:00 ndp ndp-control[1824]: Assertion failed: check () (src/msg.cpp:414)
Mar 26 13:25:00 ndp systemd[1]: ndp-control.service: Main process exited, code=killed, status=6/ABRT
Mar 26 13:25:00 ndp systemd[1]: ndp-control.service: Failed with result 'signal'.
Mar 26 13:25:00 ndp systemd[1]: ndp-control.service: Scheduled restart job, restart counter is at 1.

Similar error but a different line. I guess that reflects the change in version?