merbanan / rtl_433

Program to decode radio transmissions from devices on the ISM bands (and other frequencies)
GNU General Public License v2.0
6.07k stars 1.31k forks source link

new rtl_433 hanging/deadlocked? #2426

Closed rct closed 10 months ago

rct commented 1 year ago

In the past 24 hours, I've had rtl_433 hang twice on one host and once on a different host. Unfortunately, I've yet to capture a core file -- SIGABORT isn't generating one. Will add more details when I get them. Posting this now in case anyone is seeing anything similar.

Strace shows the main process is blocked in futex()

$ strace: Process 9581 attached
futex(0x740c64a8, FUTEX_WAIT, 9587, NULL, 

In all 3 cases, the last thing that logged by rtl_433 was the stats message:

{"time" : "2023-03-15 18:50:05.628602", "enabled" : 200, "since" : "2023-03-15T18:46:13", "frames" : {"count" : 148, "fsk" : 0, "events" : 107}, "stats" : [{"device" : 20, "name" : "Ambient Weather F007TH, TFA 30.3208.02, SwitchDocLabs F016TH temperature sensor", "events" : 191, "ok" : 27, "messages" : 27, "fail_other" : 163, "fail_mic" : 1}, {"device" : 40, "name" : "Acurite 592TXR Temp/Humidity, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas", "events" : 148, "ok" : 67, "messages" : 198, "fail_other" : 81}, {"device" : 84, "name" : "Thermopro TP11 Thermometer", "events" : 108, "ok" : 18, "messages" : 18, "abort_length" : 42, "abort_early" : 48}]}

Code is: rtl_433 version nightly-3-g376f1b02 branch master at 202303101654 inputs file rtl_tcp RTL-SDR

OS is Raspbian GNU/Linux 11 (bullseye) on armv7l, kernel is 5.15.76-v7+ #1597.

(edit: there are no kernel messages logged before or after the hang)

One thing to note about these two rtl_433 instances is they are started by ssh and their STDOUT/STDERR goes through that ssh redirected to a file on the controlling host. (This is the configuration I've been running with RPIs for a number of years to avoid writing to the RPI's sd.).

I mention the above because it is possible the process might temporarily block writing to STDOUT/STDERR. Maybe this could cause some deadlock with logging? Unfortunately, I haven't yet caught any info about any threads but the main process.

Let me know if there is anything I should be doing if I see another hang, besides trying to capture a core dump.

gdt commented 1 year ago

Is this bug still valid?

rct commented 1 year ago

Unfortunately yes this bug, a potential deadlock when trying to exit due to a stall or other error exists,

The most common cause that I found is blocked write to STDOUT which is a pipe (ssh or other). I’ve worked around this by writing locally but I’ve still seen this occur.

I believe @zuckschwerdt had mentioned this in another issue as something he was still intending to address.

rct commented 1 year ago

@zuckschwerdt - I know this is a low priority item. But is there any chance you could please add some semi-permanent diagnostic output that could help you to the main branch, so that people hacking in extra debugging isn't necessary with update?

The out of date changes for debugging are here: https://github.com/merbanan/rtl_433/compare/master...rct:rtl_433:dbg-pthread-hang

I know you don't want debugging output to be normally enabled, but I think many of the points identified are for when fatal exception conditions occur that it would be good to have some permanent diagnostic output for.

Thank you for your work on this.

zuckschwerdt commented 1 year ago

Yes, I do want that extra debugging output. And this is not low prio, just not-rewarding-work ;) I'll get on it. This is a blocker for the next release after all.

aogriffiths commented 1 year ago

I believe I’m experiencing this issue too. I am using rtl_433 in the home assistant add on https://github.com/pbkhrv/rtl_433-hass-addons to monitor the oil level in my oil tank.

It normally receives updates every 18 minutes but regularly cuts out for hours. The update is still being sent, rtl_433 seems to just not being reviving it.

I reboot fixes the problem every time.

Is there anything I can do to help find the root cause and fix the bug? I’m limited to what can be done inside a home assistant add on, but willing to test any scenarios to help debug.

sheilbronn commented 11 months ago

Having written an enhanced MQTT wrapper script (in bash) around rtl_433 I had noticed hangs on my Raspi, too.

In my wrapper rtl2mqtt I dealt with this by killing rtl_433 process and restarting the wrapper if there has been no successful decoding output during the last 3 hours. But that is ugly, offcourse.

What is the recommened way to help debugging where rtl_433 hangs?

zuckschwerdt commented 11 months ago

There is a branch to try now: https://github.com/merbanan/rtl_433/tree/wip-syncstop It will disable the handshake on passing SDR data along. This should stop the acquire thread from getting stuck (when the main thread won't process frames anymore because it's already shutting down). If this approach works out then we'll redo the message passing properly.

zuckschwerdt commented 11 months ago

The fix from #2705 might also help here, not with the deadlock on shutdown but with the cause for the abort (SIGPIPE).

rct commented 10 months ago

@zuckschwerdt - Sorry I missed that you pushed fixes to a new branch. testing now.

FYI, I don't know if it is my compiler (rpi os/bullseye 11), but got this warning:

rtl_433/merbanan/rtl_433/src/decoder_util.c: In function ‘decoder_log_bitrow.part.0’:
rtl_433/merbanan/rtl_433/src/decoder_util.c:221:9: warning: double-‘free’ of ‘row_bits’ [CWE-415] [-Wanalyzer-double-free]

Also w.r.t the fis from #2705 https://github.com/merbanan/rtl_433/commit/613eb635bfcdd3886d325e84fc3dd8b9d6ad8447 Should we be testing with that merged in to?

would it be worth fast-forwarding or pulling in that change to your wip-syncstop branch?

Thanks for your work on this.

zuckschwerdt commented 10 months ago

The wip-syncstop and the change from #2705 are alternatives. The PR should fix the root cause for the termination (needs to be verified) and the branch is an idea to stop any termination hanging (on the broadcast notificatin).

zuckschwerdt commented 10 months ago

got this warning: [...]

That's a false positive from static analysis (-fanalyzer). GCC supports this since GCC 11, but with too many false positives. We now only enable this in GCC13.2.0+ where it's wrong in very few places. I guess you saw that with an older compiler before we limited analysis to GCC 13?

zuckschwerdt commented 10 months ago

With the change to properly ignore SIGPIPE there should be no more exits on network trouble (the trigger for this issue). And the change 7e596c2 to broadcast async prevents getting stuck in the exit procedure (the observed bug in this issue).

I'm going to close for now. Not sure how ignoring SIGPIPE affects a stalled (SSH) output. If someone finds out or has ideas to further improve handling please reopen.