New "code=killed, status=11/SEGV" behavior.

dethpickle commented 5 years ago

Just recently off of issue #57. I'm not sure this could be related, but since putting on that update I've had the aqualinkd service quit on me twice, right when a new command is sent to it. In each case, it had been running successfully for several days and then SEGV'd when a command came in.

I see closed issue #52 looks identical. I'm going to try enabling Restart=on-failure which I currently have commented out. Just adding my observation of the same issue so that it's not considered a one-off.

First time I noticed - I'm turning on a pump at 10am:

May 18 14:45:53 raspberrypi aqualinkd[1994]: Starting web server on port 80
May 18 14:45:53 raspberrypi aqualinkd[1994]: Starting MQTT client to 10.1.2.5:1883
May 18 14:45:54 raspberrypi aqualinkd[1994]: Listening to Aqualink RS8 on serial port: /dev/ttyUSB0
May 18 14:45:54 raspberrypi systemd[1]: Started Aqualink RS daemon.
May 18 14:46:01 raspberrypi aqualinkd[1994]: Control Panel B0029223 REV T.2
May 18 14:46:01 raspberrypi aqualinkd[1994]: Getting freeze protection setpoints
May 18 14:46:08 raspberrypi aqualinkd[1994]: Getting pool & spa heat setpoints from aqualink
May 24 10:00:00 raspberrypi systemd[1]: aqualinkd.service: Main process exited, code=killed, status=11/SEGV
May 24 10:00:00 raspberrypi systemd[1]: aqualinkd.service: Unit entered failed state.
May 24 10:00:00 raspberrypi systemd[1]: aqualinkd.service: Failed with result 'signal'.

I restarted it and it again ran for a few days - the daytime low-speed shutoff is at 14:00:

May 25 03:25:01 raspberrypi aqualinkd[8799]: Starting MQTT client to 10.1.2.5:1883
May 25 03:46:10 raspberrypi aqualinkd[8799]: MQTT Connection closed
May 25 03:46:20 raspberrypi aqualinkd[8799]: Starting MQTT client to 10.1.2.5:1883
May 26 00:14:02 raspberrypi aqualinkd[8799]: MQTT Connection closed
May 26 00:14:11 raspberrypi aqualinkd[8799]: Starting MQTT client to 10.1.2.5:1883
May 26 07:25:44 raspberrypi aqualinkd[8799]: MQTT Connection closed
May 26 07:25:53 raspberrypi aqualinkd[8799]: Starting MQTT client to 10.1.2.5:1883
May 28 14:00:00 raspberrypi systemd[1]: aqualinkd.service: Main process exited, code=killed, status=11/SEGV
May 28 14:00:00 raspberrypi systemd[1]: aqualinkd.service: Unit entered failed state.
May 28 14:00:00 raspberrypi systemd[1]: aqualinkd.service: Failed with result 'signal'.

Previously, I had extremely rare shutdowns of the aqualinkd service. I "handled" them simply by having the systemctl that runs the service have a restart in it. My aqualinkd.service file:

[Unit]
Description=Aqualink RS daemon
ConditionPathExists=/etc/aqualinkd.conf
After=network.target multi-user.target
Requires=network.target

[Service]
Type=forking
#RemainAfterExit=no
#StartLimitInterval=0
PIDFile=/run/aqualinkd.pid
EnvironmentFile=/etc/default/aqualinkd
ExecStart=/usr/local/bin/aqualinkd $OPTS
ExecReload=/bin/kill -HUP $MAINPID
#Restart=on-failure
KillMode=process
RestartSec=2

[Install]
WantedBy=multi-user.target

Any suggestions or places I could gather a better picture/log to submit, please let me know.

ballle98 commented 5 years ago

Interesting. I was seeing cores then I enabled MQTT and they stopped. I disabled MQTT and they came back. I'm on a fork of the code so I'm not 100% sure I'm seeing the same crash.

https://github.com/ballle98/AqualinkD/issues/20

You can follow these instructions to enable core files https://pve.proxmox.com/wiki/Enable_Core_Dump_systemd

rebuild the code with debug symbols, and turn on address sanitize (we're looking for memory corruption)

make DBG="-g -O0 -fsanitize=address -static-libasan" make install

when it crashes "sudo -i" to become root, run gbd and dump the back trace with the bt command cd /var/lib/coredumps/ ls gdb /home/pi/git/AqualinkD/release/aqualinkd core-aqualinkd-sig11-user0-group0-pid386-time1554347133 bt

sfeakes commented 5 years ago

What Version of AqualinkD are you running? When fixing the MQTT / Mosquitto 1.6 issue, I posted a bad version that would core dump. If your using v1.2.6e, please update of downgrade to a previous version.

dethpickle commented 5 years ago

I'm on v1.2.6f, from when you fixed MQTT. Thanks for that, btw. I'm still up from the the restart on the 28th. I'll still keep an eye on it.

niharmehta commented 5 years ago

Just had the status=11/SEGV happen this evening. I am not sure a specific command was being sent at the time as I was away from the house. Running 1.3.1

Jun 4 19:42:24 aquad systemd[1]: aqualinkd.service: Main process exited, code=killed, status=11/SEGV Jun 4 19:42:24 aquad systemd[1]: aqualinkd.service: Unit entered failed state. Jun 4 19:42:24 aquad systemd[1]: aqualinkd.service: Failed with result 'signal'. Jun 4 20:17:01 aquad CRON[13258]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)

sfeakes commented 5 years ago

I just posted 1.3.2, I fixed a few places where there was a potential for a core dump.

dethpickle commented 5 years ago

Got it. Updated and I'll check it again in three days. Seemed to be dying about every two days. Thank you!

niharmehta commented 5 years ago

Just had another crash. 1.3.2 . I think my child may have turned on the slide and waterfall at approximately the time it crashed. However, not 100% sure.

Jun 7 15:01:33 aquad aqualinkd: Serial packet too large Jun 7 15:01:33 aquad aqualinkd: Bad receive packet | HEX: 0x10|0x02|0x33|0x71|0x06|0x01|0x02|0x03|0x04|0x05|0x06|0x00|0x07|0x41|0x6c|0x6c|0x20|0x4f|0x46|0x46|0x00|0x0a|0x55|0x70|0x70|0x65|0x72|0x20|0x50|0x6f|0x6f|0x6c|0x00|0x0a|0x4c|0x6f|0x77|0x65|0x72|0x20|0x50|0x6f|0x6f|0x6c|0x00|0x0a|0x57|0x61|0x74|0x65|0x72|0x66|0x61|0x6c|0x6c|0x73|0x00|0x05|0x50|0x61|0x72|0x74|0x79|0x00| Jun 7 15:01:33 aquad systemd[1]: aqualinkd.service: Main process exited, code=killed, status=11/SEGV Jun 7 15:01:33 aquad systemd[1]: aqualinkd.service: Unit entered failed state. Jun 7 15:01:33 aquad systemd[1]: aqualinkd.service: Failed with result 'signal'.

sfeakes commented 5 years ago

This was very helpful, I think I've located the problem. Wait for 1.3.2b to be posted within the hour.

Just out of interest, can you list what you have on your RS485 bus? I ask as you can have Jandy and Pentair working on the same bus as the protocols are somewhat compatible, but what I'm seeing there is either another protocol I'm not familiar with or you have some other form of errors on your bus. (ie something introducing noise, like a bad termination). It's not a big issue for you unless you are noticing a lot of "press the button twice as the first time it didn't work" kind-a things.

niharmehta commented 5 years ago

Hayward Goldline running AL-5 emulation for Jandy AquaPure. Pentair VS Pump iAqualink 2.0 Aqualink RS control. RPI running AqualinkD

Not seeing too many errors like I did before I cleaned up the wires. However, I just got another crash with similar error. Jun 7 15:20:06 aquad aqualinkd: Bad receive packet | HEX: 0x10|0x02|0x33|0x72|0x08|0x01|0x02|0x03|0x04|0x05|0x06|0x07|0x20|0x01|0x01|0x00|0x00|0x09|0x57|0x61|0x74|0x65|0x72|0x66|0x61|0x6c|0x6c|0x00|0x01|0x00|0x00|0x05|0x4c|0x69|0x67|0x68|0x74|0x00|0x01|0x00|0x00|0x09|0x53|0x70|0x69|0x6c|0x6c|0x6f|0x76|0x65|0x72|0x01|0x01|0x00|0x00|0x05|0x53|0x6c|0x69|0x64|0x65|0x00|0x01|0x00|

I am testing some stuff with HomeAssistant, but that is with MQTT and I doubt it is related at this point. I will wait for your update.

sfeakes commented 5 years ago

That crash is identical to the other. (I hope the 1.3.2b fixes that, I've posted the update now). The new version can also read the Pentair protocol (in very early beta), so if it works for you, you can also add read_pentair_packets=yes to your config and make sure read_all_devices = yes is also their, and it should report RPM and WATTS for the pump to the Web UI and MQTT.

But there is nothing their that produces a protocol AqualinkD doesn't read, (that I know of). But none if this stuff is documented, it's all reversed engineered from myself and others posting info on the net.

dethpickle commented 5 years ago

I've been up and running on v1.3.2. (don't see a b on it) since Jun 05 15:38:25 without a restart. This is right about the time it would be doing it. 2-3 days. I'll still keep an eye on it.

dethpickle commented 5 years ago

I've had no restarts at all since upgrading to v1.3.2.

dethpickle commented 5 years ago

Just had one (same 11/SEGV error) but it was after 14 days. I've noticed you have a 1.3.3 and have upgraded to that. I'll keep you posted.

niharmehta commented 5 years ago

Since upgrade to 1.3.3, the service has not crashed in 20 days.

sfeakes commented 5 years ago

That's great news. Glad it seems to be stable for you now.

sfeakes / AqualinkD

New "code=killed, status=11/SEGV" behavior. #63