rct commented 1 year ago

In the past 24 hours, I've had rtl_433 hang twice on one host and once on a different host. Unfortunately, I've yet to capture a core file -- SIGABORT isn't generating one. Will add more details when I get them. Posting this now in case anyone is seeing anything similar.

Strace shows the main process is blocked in futex()

$ strace: Process 9581 attached
futex(0x740c64a8, FUTEX_WAIT, 9587, NULL,

In all 3 cases, the last thing that logged by rtl_433 was the stats message:

{"time" : "2023-03-15 18:50:05.628602", "enabled" : 200, "since" : "2023-03-15T18:46:13", "frames" : {"count" : 148, "fsk" : 0, "events" : 107}, "stats" : [{"device" : 20, "name" : "Ambient Weather F007TH, TFA 30.3208.02, SwitchDocLabs F016TH temperature sensor", "events" : 191, "ok" : 27, "messages" : 27, "fail_other" : 163, "fail_mic" : 1}, {"device" : 40, "name" : "Acurite 592TXR Temp/Humidity, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas", "events" : 148, "ok" : 67, "messages" : 198, "fail_other" : 81}, {"device" : 84, "name" : "Thermopro TP11 Thermometer", "events" : 108, "ok" : 18, "messages" : 18, "abort_length" : 42, "abort_early" : 48}]}

Code is: rtl_433 version nightly-3-g376f1b02 branch master at 202303101654 inputs file rtl_tcp RTL-SDR

OS is Raspbian GNU/Linux 11 (bullseye) on armv7l, kernel is 5.15.76-v7+ #1597.

(edit: there are no kernel messages logged before or after the hang)

One thing to note about these two rtl_433 instances is they are started by ssh and their STDOUT/STDERR goes through that ssh redirected to a file on the controlling host. (This is the configuration I've been running with RPIs for a number of years to avoid writing to the RPI's sd.).

I mention the above because it is possible the process might temporarily block writing to STDOUT/STDERR. Maybe this could cause some deadlock with logging? Unfortunately, I haven't yet caught any info about any threads but the main process.

Let me know if there is anything I should be doing if I see another hang, besides trying to capture a core dump.

zuckschwerdt commented 1 year ago

I testes and improved #1978 for a comically long time but these edge-cases were bound to happen. Thanks for the help in figuring this out.

zuckschwerdt commented 1 year ago

Perhaps we should replace the instances of pthread_mutex_lock() (5x in sdr.c) with pthread_mutex_timedlock() and terminate after a second. Except on sdr_stop() the lock is only held for a single variable access. So I guess something else got stuck first.

rct commented 1 year ago

I'm sure you took great care with the threading changes. Unfortunately getting every little case handled is very challenging.

Your proposed change to have a timeout sounds pretty reasonable, but I really don't know pthreads.

What do you do if the mutex times out? Retry one or a few times times? Log it and continue? Log it and exit?

zuckschwerdt commented 1 year ago

The code normally won't hold the mutex lock for more than one single variable copy. If it takes longer than a few milliseconds I would deem than unrecoverable and terminate.

gdt commented 1 year ago

We should be careful to stick to POSIX, or at least not add use of any more non-POSIX interfaces that weren't already used. rtl_433 ought to build on any reasonable mostly-POSIX system that can run rtl-sdr. I'm not saying any particular interface mentioned is problematic, but particularly when developing/testing on Linux it's easy to add nonportable things.

gdt commented 1 year ago

The code normally won't hold the mutex lock for more than one single variable copy. If it takes longer than a few milliseconds I would deem than unrecoverable and terminate.

Do you mean exit with a fatal error? I would be very scared of papering over locking errors.

zuckschwerdt commented 1 year ago

The timeout variant of pthread_mutex_lock() is POSIX, but I've never used it. (And on Windows we don't even demand POSIX threads but map to native synchronisation: https://github.com/merbanan/rtl_433/blob/master/include/compat_pthread.h)

But I'm not even sure that those locks are the problem. We don't know which thread was waiting. The problem I observed once was the event loop getting stuck on the mg_broadcast as the mechanism (pipe) can only hold a single "in flight" message.

I'll hack a debug version that will alert which thread got stuck and where with a clean crash, if that something you'd like to try @rct?

@gdt you closed this by accident, right? :)

rct commented 1 year ago

It's late and it happened again. I'm not sure if I'm following the args of the futex correctly, but it looks like the 3rd arg is a pid for a thread it is waiting on.

In both cases that I just looked at the thread that it is waiting on is blocked in a recv(4, ...)

 $ strace -p 5943
strace: Process 5943 attached
futex(0x7405f4a8, FUTEX_WAIT, 5950, NULL^Cstrace: Process 5943 detached
 <detached ...>

$ strace -p 5950
strace: Process 5950 attached
recv(4, ^Cstrace: Process 5950 detached
 <detached ...>

$ lsof -p 5943
COMMAND  PID USER   FD      TYPE DEVICE SIZE/OFF   NODE NAME
rtl_433 5943  rct  cwd       DIR  179,2     4096   1007 /home/rct
rtl_433 5943  rct  rtd       DIR  179,2     4096      2 /
rtl_433 5943  rct  txt       REG  179,2   741080 266311 /home/rct/rtl_433/merbanan/rtl_433/build/src/rtl_433
rtl_433 5943  rct  mem       CHR  189,3             134 /dev/bus/usb/001/004
rtl_433 5943  rct  mem       REG  179,2   145008   5663 /usr/lib/arm-linux-gnueabihf/libudev.so.1.7.0
rtl_433 5943  rct  mem       REG  179,2  1319784   2260 /usr/lib/arm-linux-gnueabihf/libc-2.31.so
rtl_433 5943  rct  mem       REG  179,2   386572   2262 /usr/lib/arm-linux-gnueabihf/libm-2.31.so
rtl_433 5943  rct  mem       REG  179,2    96192   5671 /usr/lib/arm-linux-gnueabihf/libusb-1.0.so.0.3.0
rtl_433 5943  rct  mem       REG  179,2    55592  50986 /usr/lib/arm-linux-gnueabihf/librtlsdr.so.0.6.0
rtl_433 5943  rct  mem       REG  179,2   137364   2271 /usr/lib/arm-linux-gnueabihf/libpthread-2.31.so
rtl_433 5943  rct  mem       REG  179,2    17708   5335 /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so
rtl_433 5943  rct  mem       REG  179,2   146888   2256 /usr/lib/arm-linux-gnueabihf/ld-2.31.so
rtl_433 5943  rct    0r     FIFO   0,12      0t0 180424 pipe
rtl_433 5943  rct    1w     FIFO   0,12      0t0 180425 pipe
rtl_433 5943  rct    2w     FIFO   0,12      0t0 180426 pipe
rtl_433 5943  rct    3u     IPv4 179752      0t0    UDP localhost:58693->localhost:53998
rtl_433 5943  rct    4u     IPv4 179753      0t0    UDP localhost:53998->localhost:58693
rtl_433 5943  rct    6u     IPv4 179755      0t0    TCP *:8433 (LISTEN)
rtl_433 5943  rct    7u  netlink             0t0 179758 KOBJECT_UEVENT
rtl_433 5943  rct    8u  a_inode   0,13        0   7580 [eventfd]
rtl_433 5943  rct    9u  a_inode   0,13        0   7580 [eventfd]
rtl_433 5943  rct   10u  a_inode   0,13        0   7580 [timerfd]
rtl_433 5943  rct   11u      CHR  189,3      0t0    134 /dev/bus/usb/001/004
rtl_433 5943  rct   12u     sock    0,8      0t0 179759 protocol: TCP

Fd 4 looks like a UDP socket with both ends in rtl_433 according to lsof. You mentioned a pipe, but this appears to be a UDP socket, so is this different?

I'm happen to try running a debug version if that helps collect data. I also have some core dumps but haven't had time to look at them.

(EDIT I didn't realize anyone could close issues that aren't theirs. I don't see how to reopen this issue.)

rct commented 1 year ago

Possibly interesting, or misleading. I've been tracking events and frames since those are easily accessible from MQTT and easy to turn into sensors in Home Assistant.

When the graphs flat line there are no updates. So It is relatively easy to see when each instance hung.

For 03/16, both processes hung close to the same time (13:47 and 13:51) and after they were restored around 20:00, the both hung again around 21:02 and 20:58.

Those rtl_433 are running on two different RPIs, but they send data back to the same host over the SSH connect. So possibly something related to slowness with the network connect or host might trigger the hang.

zuckschwerdt commented 1 year ago

My mistake, I thought it was a a pipe, but mg_broadcast() https://github.com/merbanan/rtl_433/blob/master/src/mongoose.c#L3449 actually uses a socket pair like you observed. The sdr acquire thread writes to a socket and the main thread select()s on all kinds of different inputs. If blocked in recv() then a select falsely returned a read-ok or the recv only has partial data? And no mutex is involved?

Uncomment some debug messages in acquire_callback() (the send() part) https://github.com/merbanan/rtl_433/blob/master/src/rtl_433.c#L1406 and sdr_handler() (the recv() part) https://github.com/merbanan/rtl_433/blob/master/src/rtl_433.c#L1353

rct commented 1 year ago

First, Can someone (@gdt @zuckschwerdt) reopen this so anyone else running into this can find this issue?

So what is going on with Mongo here? I’m not using it. Should I try a build without it?

Where does the mongoose code/library come from - is there possibly an old version involved?

Curiously there were no hangs overnight on the one rtl_433 still running the new code. I rolled the other one back to the commit prior to the SDR acquire one.

If blocked in recv() then a select falsely returned a read-ok or the recv only has partial data? And no mutex is involved?

It seemed like the last thing the processes did before hanging was to write out a stats message.

maybe difference in terminology but the main process (lowest numbered PID) is the one blocked in futex() while waiting on the other thread that is blocked in recv()

gdt commented 1 year ago

Sorry, apparently I hit 'close with comment' instead of 'comment' earlier.

zuckschwerdt commented 1 year ago

Not MongoDB but Mongoose -- it's our event loop and network code, an embedded project from Cesanta.

I'm running on a Raspi with control over SSH now too. And I'm disturbing the network with tc qdisc. No hang so far. Can you share the command line args used?

The other thread (newer PID) is the SDR acquire thread. It yielded control to rtlsdr_read_async(). It will only get control back briefly when a SDR frame arrives by callback and the only action is to post that event using mg_broadcast() to send() on the UDP socket. The blocking in recv() then must be in librtlsdr or libusb?

zuckschwerdt commented 1 year ago

Put e.g. this near the end of include/compat_pthread.h, before the two #endifs:

#define pthread_mutex_lock(mp) \
    do { \
        struct timespec abs_timeout; \
        clock_gettime(CLOCK_REALTIME, &abs_timeout); \
        abs_timeout.tv_sec += 60; \
        int lock_ret = pthread_mutex_timedlock(mp, &abs_timeout); \
        if (lock_ret != 0) { \
            fprintf(stderr, "%s: mutex lock error: %d\n", __func__, lock_ret); \
            exit(1); \
        } \
    } while (0)

It will only work on Linux but should be a good starting point to get more info about the mutex that blocked.

rct commented 1 year ago

@zuckschwerdt - will dig into your comments/suggestions next, but wanted to provide the following. I'm not sure how much handling SIGABRT changed where things were on the stack, but here's some info from the cores I captured:

First core from 2023-03-16 19:53 (rpi2)

(gdb) thread apply all bt

Thread 3 (Thread 0x748bf440 (LWP 733)):
#0  __GI___poll (timeout=-1, nfds=2, fds=0x748bee1c) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  __GI___poll (fds=0x748bee1c, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:26
#2  0x76ecd7b4 in ?? () from /lib/arm-linux-gnueabihf/libusb-1.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 0x740be440 (LWP 734)):
#0  0x76f1523c in __libc_recv (fd=4, buf=0x740bbbbc, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:26
#1  0x00022824 in mg_broadcast ()
#2  0x00040d48 in rtlsdr_read_cb ()
#3  0x76ee8bb4 in ?? () from /lib/arm-linux-gnueabihf/librtlsdr.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0x76f66240 (LWP 728)):
#0  __pthread_clockjoin_ex (threadid=1946936384, thread_return=0x0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, block=block@entry=true) at pthread_join_common.c:145
#1  0x76f097ac in __pthread_join (threadid=<optimized out>, thread_return=<optimized out>) at pthread_join.c:24
#2  0x000422a0 in sdr_stop ()
#3  0x000130bc in main ()

2nd core generated 2023-03-16 22:17

Program terminated with signal SIGABRT, Aborted.
#0  __pthread_clockjoin_ex (threadid=1946547264, thread_return=0x0, clockid=clockid@entry=0,
    abstime=abstime@entry=0x0, block=block@entry=true) at pthread_join_common.c:145
145     pthread_join_common.c: No such file or directory.
[Current thread is 1 (Thread 0x76f07240 (LWP 5943))]
(gdb) thread apply all bt

Thread 3 (Thread 0x74860440 (LWP 5948)):
#0  __GI___poll (timeout=-1, nfds=2, fds=0x7485fe1c) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  __GI___poll (fds=0x7485fe1c, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:26
#2  0x76e6e7b4 in ?? () from /lib/arm-linux-gnueabihf/libusb-1.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 0x7405f440 (LWP 5950)):
#0  0x76eb623c in __libc_recv (fd=4, buf=0x7405cbbc, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:26
#1  0x00022824 in mg_broadcast ()
#2  0x00040d48 in rtlsdr_read_cb ()
#3  0x76e89bb4 in ?? () from /lib/arm-linux-gnueabihf/librtlsdr.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0x76f07240 (LWP 5943)):
#0  __pthread_clockjoin_ex (threadid=1946547264, thread_return=0x0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, block=block@entry=true) at pthread_join_common.c:145
#1  0x76eaa7ac in __pthread_join (threadid=<optimized out>, thread_return=<optimized out>) at pthread_join.c:24
#2  0x000422a0 in sdr_stop ()
#3  0x000130bc in main ()

3rd core - different host, OS raspbian 10 buster, 2023-03-16 22:22

Program terminated with signal SIGABRT, Aborted.
#0  0xb6f52a3c in __GI___pthread_timedjoin_ex (threadid=3018085440, thread_return=0x0,
    abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
89      pthread_join_common.c: No such file or directory.
[Current thread is 1 (Thread 0xb6a64040 (LWP 19808))]
(gdb) thread apply all bt

Thread 3 (Thread 0xb3e45440 (LWP 19815)):
#0  __libc_recv (flags=0, len=1, buf=0xb3e42bd4, fd=4) at ../sysdeps/unix/sysv/linux/recv.c:26
#1  __libc_recv (fd=4, buf=0xb3e42bd4, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23
#2  0x00023bf8 in mg_broadcast ()
#3  0x000440f4 in rtlsdr_read_cb ()
#4  0xb6ca0a94 in ?? () from /usr/lib/arm-linux-gnueabihf/librtlsdr.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 0xb4646440 (LWP 19813)):
#0  __GI___poll (timeout=-1, nfds=2, fds=0xb4645e1c) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  __GI___poll (fds=0xb4645e1c, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:26
#2  0xb6c86a3c in ?? () from /lib/arm-linux-gnueabihf/libusb-1.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0xb6a64040 (LWP 19808)):
#0  0xb6f52a3c in __GI___pthread_timedjoin_ex (threadid=3018085440, thread_return=0x0,
    abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
#1  0x00045368 in sdr_stop ()
#2  0x00013a1c in main ()

rct commented 1 year ago

Can you share the command line args used?

ssh -o ServerAliveInterval=300 -o ServerAliveCountMax=2 <rpihostname> merbanan/rtl_433/build/src/rtl_433 -p 56 -F mqtt://<mqtthost,user,pass> -f 433900000 -C native -M time:usec -M level -M noise -M stats:1:300 -Y autolevel -Y magest -F json,v=4 -F log -F http -R -25 -R -53 -R -54 -R -60 -R -81 -R -82 -R -83 -R -90 -R -95 -R -153 -R -156 -R -186 -R -201 -R -207 -R -228

rct commented 1 year ago

I tried to do a debug build with cmake -DCMAKE_BUILD_TYPE=Debug ../ -- but when I ran it I got

==500==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

Unfortunately that doesn't tell me which library it is complaining about -- guessing I'm missing some debug library from something distro related like openssl?

EDIT: Sorry, it does tell me what library, I just didn't recognize it -- Asan (address sanitizer?). just not sure what to do do about it.

I'm now running today's head nightly-7-gecb0b361 ( commit ecb0b361 ) with your pthread_mutex_lock #defined wrapper, but as a regular build, not a debug build.

I'm running on a Raspi with control over SSH now too. And I'm disturbing the network with tc qdisc. No hang so far.

Right now it is a mystery why I got several hangs yesterday and then things ran fine overnight for 13 hours with no hangs.

zuckschwerdt commented 1 year ago

Yes, that ASAN bug is strange, likely a fault of the distro. You can use LD_PRELOAD=/usr/lib/arm-linux-gnueabihf/libasan.so.6 rtl_43.3..

So it's consistently stuck in mg_broadcast() doing a recv() I dont know why there is even recv in there?

But more importantly: things were already winding down: pthread join is only run in sdr_stop() which is apparently called, from main(). This means the program hangs in the very last lines of main, trying to exit. So something broke, rtl_433 tried to exit cleanly but got stuck in sdr_stop() because the event loop blocked. This is similar to the case I already observed: somehow we posted to the socket pair when we were already finished with the event loop (near the last lines of main, trying to exit).

You could try -D restart to just revive the SDR and not exit. Not a fix but should work.

rct commented 1 year ago

But more importantly: things were already winding down: pthread join is only run in sdr_stop() which is apparently called, from main(). This means the program hangs in the very last lines of main, trying to exit.

I'm not sure I haven't tainted the stack traces for those core dumps by either trying to interrupt them (SIGQUIT doesn't work in this case) or whether some of the threads were reacting to the SIGABRT on the main process before the core dump was finished being taken.

zuckschwerdt commented 1 year ago

It is however a very likely place for a bug. If we are not running the event loop anymore then mg_broadcast() will get stuck. Which here then blocks the pthread_join() and that might be the futex you see. Perhaps you could change the LOG_DEBUG in sdr.c's sdr_stop() to LOG_NOTICE to see exactly when this occurred?

rct commented 1 year ago

2432 has the code that I'm running to capture more info if this occurs again.

The problem hasn't reoccurred in 2 days since I've been monitoring, so possible heisenbugs

zuckschwerdt commented 1 year ago

I did find a related (stats output) bug with overflow and unaligned access on 32-bit systems. It's read only thus should not be an issue. Fix coming shortly.

rct commented 1 year ago

That's interesting. The last message in the logs was a stats message.

Right now I've got 3 x RPIs running current code. The two that had the hang are both 32-bit armv7l kernels (one is Rasbian 10 in need of an upgrade, the other is an RPI 2B/Debian 11, so it's stuck on 32 bits).

The RPI 3B running an aarch64 kernel hasn't had a problem so far.

EDIT: not sure if this is a valid test, but I tried sending a SIGBUS to each of the 3 threads I saw. In each case, the process exited cleanly

zuckschwerdt commented 1 year ago

The fix a7c1d0c is in master now. Maybe this changes something.

rct commented 1 year ago

Nearly two weeks since running the code with the extra debugging and the fix for unaligned access https://github.com/merbanan/rtl_433/commit/a7c1d0c6a61467da7fed7647ffa1acd9feec5f38 and no hangs.

I haven't seen any hangs. I haven't noticed any unexpected restarts.

zuckschwerdt commented 1 year ago

Thanks. It's interesting how such drastic effects (the deadlock) result from reading off-by-one (actually 4 bytes, the size difference between double and a 32-bit pointer). A page fault should have caused SIGV, perhaps it was something about the unaligned access and ARM specifically? But the problem has been in the code for 4 years, since 313fdfe, and only showed with threading, strange. Closing for now then.

rct commented 1 year ago

I may have spoken too soon, got another hang today.

Even with the added output, nothing in the log (maybe it didn't get flushed/)

{"time" : "2023-04-02 13:09:28.654626", "model" : "Acurite-Atlas", "id" : 924, "channel" : "C", "sequence_num" : 1, "battery_ok" : 1, "message_type" : 37, "wind_avg_mi_h" : 19.000, "temperature_F" : 45.400, "humidity" : 37, "strike_count" : 107, "strike_distance" : 5, "exception" : 0, "raw_msg" : "079c6509c656a59a65d1", "mod" : "ASK", "freq" : 433.940, "rssi" : -0.242, "snr" : 20.341, "noise" : -20.583}
{"time" : "2023-04-02 13:09:28.654626", "model" : "Acurite-Atlas", "id" : 924, "channel" : "C", "sequence_num" : 2, "battery_ok" : 1, "message_type" : 37, "wind_avg_mi_h" : 19.000, "temperature_F" : 45.400, "humidity" : 37, "strike_count" : 107, "strike_distance" : 5, "exception" : 0, "raw_msg" : "0b9c6509c656a59a65d5", "mod" : "ASK", "freq" : 433.940, "rssi" : -0.242, "snr" : 20.341, "noise" : -20.583}
{"time" : "2023-04-02 13:09:30.304797", "src" : "Auto Level", "lvl" : 4, "msg" : "Current noise level -20.1 dB, estimated noise -19.4 dB"}
Auto Level: Current noise level -20.1 dB, estimated noise -19.4 dB
{"time" : "2023-04-02 13:09:36.071596", "src" : "Auto Level", "lvl" : 4, "msg" : "Estimated noise level is -20.2 dB, adjusting minimum detection level to -17.2 dB"}
Auto Level: Estimated noise level is -20.2 dB, adjusting minimum detection level to -17.2 dB
{"time" : "2023-04-02 13:09:36.583359", "model" : "Ambientweather-F007TH", "id" : 166, "channel" : 8, "battery_ok" : 1, "temperature_F" : 63.400, "humidity" : 40, "mic" : "CRC", "mod" : "ASK", "freq" : 433.934, "rssi" : -0.248, "snr" : 19.633, "noise" : -19.881}
{"time" : "2023-04-02 13:09:37.120379", "enabled" : 194, "since" : "2023-04-02T13:05:31", "frames" : {"count" : 204, "fsk" : 0, "events" : 119}, "stats" : [{"device" : 20, "name" : "Ambient Weather F007TH, TFA 30.3208.02, SwitchDocLabs F016TH temperature sensor", "events" : 214, "ok" : 31, "messages" : 31, "fail_other" : 183}, {"device" : 23, "name" : "DSC Security Contact", "events" : 121, "ok" : 4, "messages" : 4, "abort_early" : 117}, {"device" : 40, "name" : "Acurite 592TXR Temp/Humidity, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas", "events" : 205, "ok" : 86, "messages" : 256, "fail_other" : 119}]}
{"time" : "2023-04-02 13:13:21.912016", "model" : "Acurite-Tower", "id" : 12053, "channel" : "C", "battery_ok" : 0, "temperature_C" : 19.500, "humidity" : 32, "mic" : "CHECKSUM", "mod" : "ASK", "freq" : 433.955, "rssi" : -0.192, "snr" : 20.329, "noise" : -20.521}
{"time" : "2023-04-02 13:13:21.912016", "model" : "Acurite-Tower", "id" : 12053, "channel" : "C", "battery_ok" : 0, "temperature_C" : 19.500, "humidity" : 32, "mic" : "CHECKSUM", "mod" : "ASK", "freq" : 433.955, "rssi" : -0.192, "snr" : 20.329, "noise" : -20.521}
{"time" : "2023-04-02 13:13:21.912016", "model" : "Acurite-Tower", "id" : 12053, "channel" : "C", "battery_ok" : 0, "temperature_C" : 19.500, "humidity" : 32, "mic" : "CHECKSUM", "mod" : "ASK", "freq" : 433.955, "rssi" : -0.192, "snr" : 20.329, "noise" : -20.521}

Ran gcore before doing anything else:

$ gcore 21080
[New LWP 21085]
[New LWP 21086]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
0xb6ed9a3c in __GI___pthread_timedjoin_ex (threadid=3017589824, thread_return=0x0, abstime=<opt
imized out>, block=<optimized out>) at pthread_join_common.c:89
89      pthread_join_common.c: No such file or directory.
warning: target file /proc/21080/cmdline contained unexpected null characters
Saved corefile core.21080
[Inferior 1 (process 21080) detached]

$ gdb ./rtl_433/rct/rtl_433/build/src/rtl_433 core.21080
GNU gdb (Raspbian 8.2.1-2) 8.2.1

Reading symbols from ./rtl_433/rct/rtl_433/build/src/rtl_433...(no debugging symbols found)...done.
[New LWP 21080]
[New LWP 21085]
[New LWP 21086]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `/home/rct/rtl_433/rct/rtl_433/build/src/rtl_433'.
#0  0xb6ed9a3c in __GI___pthread_timedjoin_ex (threadid=3017589824, thread_return=0x0,
    abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
89      pthread_join_common.c: No such file or directory.
[Current thread is 1 (Thread 0xb69eb080 (LWP 21080))]
(gdb) thread apply all bt

Thread 3 (Thread 0xb3dcc440 (LWP 21086)):
#0  __libc_recv (flags=0, len=1, buf=0xb3dc9bd4, fd=4) at ../sysdeps/unix/sysv/linux/recv.c:26
#1  __libc_recv (fd=4, buf=0xb3dc9bd4, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23
#2  0x00023cf0 in mg_broadcast ()
#3  0x00044334 in rtlsdr_read_cb ()
#4  0xb6c27a94 in ?? () from /usr/lib/arm-linux-gnueabihf/librtlsdr.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 0xb45cd440 (LWP 21085)):
#0  __GI___poll (timeout=-1, nfds=2, fds=0xb45cce1c) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  __GI___poll (fds=0xb45cce1c, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:26
#2  0xb6c0da3c in ?? () from /lib/arm-linux-gnueabihf/libusb-1.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0xb69eb080 (LWP 21080)):
#0  0xb6ed9a3c in __GI___pthread_timedjoin_ex (threadid=3017589824, thread_return=0x0,
    abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
#1  0x0004567c in sdr_stop ()
#2  0x00013ab4 in main ()
(gdb)

Current system calls for each LWP:

$ strace -p 21080
strace: Process 21080 attached
futex(0xb3dcc4a8, FUTEX_WAIT, 21086, NULL^Cstrace: Process 21080 detached
 <detached ...>

$ strace -p 21086
strace: Process 21086 attached
recv(4, ^Cstrace: Process 21086 detached
 <detached ...>

$ strace -p 21085
strace: Process 21085 attached
restart_syscall(<... resuming interrupted restart_syscall ...>^Cstrace: Process 21085 detached
 <detached ...>

rct commented 1 year ago

@zuckschwerdt - can you reopen this?

rct commented 1 year ago

and a 2nd hang a few hours later on the same system

Reading symbols from ./rtl_433/rct/rtl_433/build/src/rtl_433...(no debugging symbols found)...done.
[New LWP 26053]
[New LWP 26058]
[New LWP 26059]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `/home/rct/rtl_433/rct/rtl_433/build/src/rtl_433'.
#0  0xb6f94a3c in __GI___pthread_timedjoin_ex (threadid=3018355776, thread_return=0x0,
    abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
89      pthread_join_common.c: No such file or directory.
[Current thread is 1 (Thread 0xb6aa6080 (LWP 26053))]
(gdb) thread apply all bt

Thread 3 (Thread 0xb3e87440 (LWP 26059)):
#0  __libc_recv (flags=0, len=1, buf=0xb3e84bd4, fd=4) at ../sysdeps/unix/sysv/linux/recv.c:26
#1  __libc_recv (fd=4, buf=0xb3e84bd4, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23
#2  0x00023cf0 in mg_broadcast ()
#3  0x00044334 in rtlsdr_read_cb ()
#4  0xb6ce2a94 in ?? () from /usr/lib/arm-linux-gnueabihf/librtlsdr.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 0xb4688440 (LWP 26058)):
#0  __GI___poll (timeout=-1, nfds=2, fds=0xb4687e1c) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  __GI___poll (fds=0xb4687e1c, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:26
#2  0xb6cc8a3c in ?? () from /lib/arm-linux-gnueabihf/libusb-1.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0xb6aa6080 (LWP 26053)):
#0  0xb6f94a3c in __GI___pthread_timedjoin_ex (threadid=3018355776, thread_return=0x0,
    abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
#1  0x0004567c in sdr_stop ()
#2  0x00013ab4 in main ()

For what it is worth, there don't seem to be be any kernel messages or anything else that seems relevant in syslog around the time the process hung. So I have no idea why this would have been in the process of shutting down.

Unlike the others, this one didn't seem to have output a stats message very close to the hang/attempted shutdown.

"time" : "2023-04-02 18:45:56.502121", "model" : "Acurite-Atlas", "id" : 924, "channel" : "C", "sequence_num" : 0, "battery_ok" : 1, "message_type" : 37,
 "wind_avg_mi_h" : 18.000, "temperature_F" : 45.800, "humidity" : 33, "strike_count" : 107, "strike_distance" : 5, "exception" : 0, "raw_msg" : "039c65090
65a219a658d", "mod" : "ASK", "freq" : 433.940, "rssi" : -0.234, "snr" : 19.737, "noise" : -19.970}
{"time" : "2023-04-02 18:45:56.502121", "model" : "Acurite-Atlas", "id" : 924, "channel" : "C", "sequence_num" : 1, "battery_ok" : 1, "message_type" : 37,
 "wind_avg_mi_h" : 18.000, "temperature_F" : 45.800, "humidity" : 33, "strike_count" : 107, "strike_distance" : 5, "exception" : 0, "raw_msg" : "079c65090
65a219a6591", "mod" : "ASK", "freq" : 433.940, "rssi" : -0.234, "snr" : 19.737, "noise" : -19.970}
{"time" : "2023-04-02 18:45:56.502121", "model" : "Acurite-Atlas", "id" : 924, "channel" : "C", "sequence_num" : 2, "battery_ok" : 1, "message_type" : 37,
 "wind_avg_mi_h" : 18.000, "temperature_F" : 45.800, "humidity" : 33, "strike_count" : 107, "strike_distance" : 5, "exception" : 0, "raw_msg" : "0b9c65090
65a219a6595", "mod" : "ASK", "freq" : 433.940, "rssi" : -0.234, "snr" : 19.737, "noise" : -19.970}
{"time" : "2023-04-02 18:45:56.685377", "model" : "Acurite-Tower", "id" : 12053, "channel" : "C", "battery_ok" : 0, "temperature_C" : 19.200, "humidity" :
 32, "mic" : "CHECKSUM", "mod" : "ASK", "freq" : 433.924, "rssi" : -0.186, "snr" : 20.534, "noise" : -20.720}

zuckschwerdt commented 1 year ago

This confirms what we suspected, the main thread is in sdr_stop() (exited the event loop) the aquire thread is blocked in mg_broadcast() via rtlsdr_read_cb() and librtlsdr is still doing poll.

It's not clear why the program wanted to exit, but it's clear that the sequencing to exit is wrong.

rct commented 1 year ago

I added some additional print_log()s to see where sdr_stop() was being called to try to get a clue about why it is exiting and why I'm not seeing any cause for exiting in the logs.

rtl_433.c around line 1989:

    // add dummy socket to receive broadcasts
    struct mg_add_sock_opts opts = {.user_data = cfg};
    struct mg_connection *nc = mg_add_sock_opt(get_mgr(cfg), INVALID_SOCKET, timer_handler, op\
ts);
    // Send us MG_EV_TIMER event after 2.5 seconds
    mg_set_timer(nc, mg_time() + 2.5);

    while (!cfg->exit_async) {
        mg_mgr_poll(cfg->mgr, 500);
    }
    if (cfg->verbosity >= LOG_INFO)
        print_log(LOG_INFO, "rtl_433", "stopping...");
    // final polls to drain the broadcast
    //while (cfg->exit_async < 2) {
    //    mg_mgr_poll(cfg->mgr, 100);
    //}

    print_log(LOG_WARNING, "rtl_433", "stopping 2...");
    sdr_stop(cfg->dev);

When it hung, it printed "stopping 2...", The above LOG_INFO didn't get printed -- get cfg->verbosity isn't set?

Unfortunately I'm not running a debug build, so I don't have enough of a local symbol table to look into cfg.

So aside from figuring out how to exit cleanly without hanging, it would be good if I can figure out why it is trying to exit. Things where quiet for two weeks or so and now I have an instance that seems to keep trying to exit

zuckschwerdt commented 1 year ago

To set cfg->verbosity a -v or -vv would be needed.

It's really puzzling why the exit is effected. It should be a clue why things get stuck. In normal conditions I would guess it to be a stall of the input -- the 2.5 seconds timer quoted is the watchdog for that. Maybe plaster that with print's. You might get around that with -D restart

rct commented 1 year ago

I just had two instances (two separate RPIs each with their own single RTL-SDR) hang at the same point. EDIT same point in this case means same point in time and had both output stopping 2... as their last line before hanging.

What they have in common is they both pipe STDOUT/STDERR (-F json and -F log at this point) back to the same host to be persisted on disk. They also connect to the same mosquitto instance.

So I'm guessing it is more likely the output blocked rather than the input, but I could be wrong.

zuckschwerdt commented 1 year ago

I did try to block output for a few minutes but could no reproduce this. But it is plausible: Output buffers are full, the main thread gets blocked, frames from the acquire thread are not processed. Things get going again, the watchdog timer sees excessive time between processed frames and causes an exit.

It would fit the broadcast pipe being stuck. I need to find a way to force output to stall with full buffers.

rct commented 1 year ago

This is probably what you expect, but additional output from running with -v:

{"time" : "2023-04-05 02:31:36.340914", "model" : "Acurite-Tower", "id" : 12053, "channel" : "C", "battery_ok" : 0, "temperature_C" : 20.700, "humidity" : 34, "mic" : "CHECKSUM", "mod" : "ASK", "freq" : 433.957, "rssi" : -0.218, "snr" : 20.370, "noise" : -20.589}
{"time" : "2023-04-05 02:31:36.557333", "src" : "rtl_433", "lvl" : 4, "msg" : "stopping 2..."}
rtl_433: stopping 2...
{"time" : "2023-04-05 02:31:36.557333", "src" : "sdr_stop", "lvl" : 5, "msg" : "EXITING..."}
sdr_stop: EXITING...
{"time" : "2023-04-05 02:31:36.557333", "src" : "sdr_stop", "lvl" : 5, "msg" : "JOINING..."}
sdr_stop: JOINING...

williamstlr commented 1 year ago

Just as another data point I seem to be hitting this as well running an rpi on commit 24e6f8823377ff8cde5a6ad96878dae4efe710cb. I'm hopping between two frequencies and after a couple hours the process is still live, but no output to kv, influx, or mqtt is reported, it looks like it's just locked up.

rct commented 1 year ago

Saw a hang on an instance that never had a hang before (rpi 3b listening on 345Mhz). The RPI was just rebooted and patched a few hours before the hang. This time there was an Async read stalled reported. The other instances that hang more frequently haven't given a clue as to why they are exiting.

{"time" : "2023-04-08 15:07:50.425573", "src" : "Auto Level", "lvl" : 4, "msg" : "Current signal level -6.3 dB, estimated noise -9.9 dB"}
Auto Level: Current signal level -6.3 dB, estimated noise -9.9 dB
{"time" : "2023-04-08 15:08:00.386872", "src" : "Auto Level", "lvl" : 4, "msg" : "Current signal level -5.5 dB, estimated noise -10.4 dB"}
Auto Level: Current signal level -5.5 dB, estimated noise -10.4 dB
{"time" : "2023-04-08 15:08:10.348938", "src" : "Auto Level", "lvl" : 4, "msg" : "Current signal level -5.3 dB, estimated noise -10.4 dB"}
Auto Level: Current signal level -5.3 dB, estimated noise -10.4 dB
{"time" : "2023-04-08 15:08:18.737997", "src" : "Input", "lvl" : 3, "msg" : "Async read stalled, exiting!"}
Input: Async read stalled, exiting!
{"time" : "2023-04-08 15:08:18.737997", "src" : "timer_handler", "lvl" : 4, "msg" : "Calling sdr_stop"}
timer_handler: Calling sdr_stop

zuckschwerdt commented 1 year ago

I see only one cause for an exit, the sdr acquire stall. Perhaps change that into just a warning and observe?

Either use -D restart or disable timer_handler() with a return; after the prints: https://github.com/merbanan/rtl_433/blob/93f0f30c28cfb6b82b8cc3753415b01a85bee91d/src/rtl_433.c#L1522

rburkholder commented 1 year ago

I have a similar or identical scenario with a RTL-SDR v3. It will run for a few minutes, or maybe an hour or two, before hanging with no messages. My command line:

/usr/local/bin/rtl_433 -f 433.92M -f 433.92M -f 915M -H 11 -F \ "mqtt://10.55.80.18:1883,user=xxx,pass=xxx,retain=0,events=rtl433/1" -M level -vvv -F log

There isn't any identifying error output other than the decoding state representations. I have code from master as of today.

This is running in an ssh console on a BeagleBone Green.

Prior to tthis, it was running nicely on a linux laptop running from a console window.

If rtl_433 could be made to crash out on this problem (someone mentioned this in a prior message), that would be better, then it could be automatically restarted.

rburkholder commented 1 year ago

Your proposed change to have a timeout sounds pretty reasonable, but I really don't know pthreads. What do you do if the mutex times out? Retry one or a few times times? Log it and continue? Log it and exit?

In the interim, if you retry then log/exit, that would be helpful, then I could have a script simply restart it.

zuckschwerdt commented 1 year ago

Put a return 0; somewhere at the start of sdr_stop() and see if that terminates on error. https://github.com/merbanan/rtl_433/blob/master/src/sdr.c#L1754 Problem should be the pthread_join() as the acquire thread is stuck posting to the (terminated) event loop.

rburkholder commented 1 year ago

Put a return 0; somewhere at the start of sdr_stop() and see if that terminates on error.

As the first line of the function, it did not terminate. Maybe an exit() of some fashion?

I've put in an fprint to see if it actually is reached.

rburkholder commented 1 year ago

It took several hours this time. There was no output from my print statement. There was nothing in kern.log, syslog, or daemon.log.

This would imply sdr_stop() is not reached?

rburkholder commented 1 year ago

Put a return 0; somewhere at the start of sdr_stop() and see if that terminates on error. https://github.com/merbanan/rtl_433/blob/master/src/sdr.c#L1754 Problem should be the pthread_join() as the acquire thread is stuck posting to the (terminated) event loop.

I should be able to see the output from:

print_log(LOG_DEBUG, __func__, "JOINING..."); int r = pthread_join(dev->thread, NULL); if (r) { fprintf(stderr, "%s: error in pthread_join, rc: %d\n", __func__, r); }

What would be the best command line for tracing this?

zuckschwerdt commented 1 year ago

print_log() goes through the whole stack -- it better to just fprintf(stderr, "...", ...)

zuckschwerdt commented 1 year ago

Another option: if this is a broken write (through ssh) then maybe a SIGPIPE causes an exit that then gets stuck?

In sighandler() (https://github.com/merbanan/rtl_433/blob/master/src/rtl_433.c#L1328) add to the SIGPIPE case

        write_err("Broken pipe caught, ignoring!\n");
        return;

Please report if you see the message and it still works afterwards.

rburkholder commented 1 year ago

I'm not sure if it is any of that. When I reported my problem, the config was switching between two frequencies: 433 & 915. It may have been hanging on the transition between the switch.

I am running solely with 915 now for at least a day.

So maybe something in the logic which transitions between frequencies? Do you still want me to try that 'broken pipe' thing while running with two frequencies?

zuckschwerdt commented 1 year ago

What we have seen so far is rtl_433 getting stuck trying to exit. Not too sure what caused the exit. The SIGPIPE seems to be a possibility with SSH? But it could be unrelated. Next free evening I'm going to pepper the code with asserts and dead-lock watchers to solve this.

rburkholder commented 1 year ago

In sighandler() (https://github.com/merbanan/rtl_433/blob/master/src/rtl_433.c#L1328) add to the SIGPIPE case
        write_err("Broken pipe caught, ignoring!\n");
        return;

I inserted the write statement, tried my command line which switches frequencies, but had no diagnostic output. Just the hang after a period of time. I'll await your flavour with asserts, etc.

rct commented 1 year ago

What we have seen so far is rtl_433 getting stuck trying to exit. Not too sure what caused the exit. The SIGPIPE seems to be a possibility with SSH?

From my experiences writing output to SSH does seem to cause this hang were output gets blocked.

FWIW, SIGPIPE is a possibility not just with SSH, but anytime a process is writing to a file descriptor that is a pipe. Usually it is a source of bugs once someone tries to catch it but doesn't handle it correctly: the most common pitfall I've seen is STDOUT/STDERR go to a pipe and the receiving process dies. The sending process writes an error message when it catches the signal, which in turn causes another SIGPIPE and you wind up in a loop.

I haven't seen that happening here.

I had 3 RPI+rtl_433 instances that were experiencing this error. All were using SSH to write log files to a central server instead of the RPI. Two of those RPIs have been converted to write locally to a USB attached SSD and have rarely experienced this hang in the month(s) since I changed to write locally.

On the 3rd RPI, a 3B with only an SD card, I'm still using SSH and I occasionally experience the hang. I believe the write output is getting blocked which makes it look like there was an async read stall. Unwinding from that is were it usually hangs.

merbanan / rtl_433

new rtl_433 hanging/deadlocked? #2426

2432 has the code that I'm running to capture more info if this occurs again.