Closed rct closed 10 months ago
I testes and improved #1978 for a comically long time but these edge-cases were bound to happen. Thanks for the help in figuring this out.
Perhaps we should replace the instances of pthread_mutex_lock()
(5x in sdr.c) with pthread_mutex_timedlock()
and terminate after a second. Except on sdr_stop()
the lock is only held for a single variable access. So I guess something else got stuck first.
I'm sure you took great care with the threading changes. Unfortunately getting every little case handled is very challenging.
Your proposed change to have a timeout sounds pretty reasonable, but I really don't know pthreads.
What do you do if the mutex times out? Retry one or a few times times? Log it and continue? Log it and exit?
The code normally won't hold the mutex lock for more than one single variable copy. If it takes longer than a few milliseconds I would deem than unrecoverable and terminate.
We should be careful to stick to POSIX, or at least not add use of any more non-POSIX interfaces that weren't already used. rtl_433 ought to build on any reasonable mostly-POSIX system that can run rtl-sdr. I'm not saying any particular interface mentioned is problematic, but particularly when developing/testing on Linux it's easy to add nonportable things.
The code normally won't hold the mutex lock for more than one single variable copy. If it takes longer than a few milliseconds I would deem than unrecoverable and terminate.
Do you mean exit with a fatal error? I would be very scared of papering over locking errors.
The timeout variant of pthread_mutex_lock()
is POSIX, but I've never used it.
(And on Windows we don't even demand POSIX threads but map to native synchronisation: https://github.com/merbanan/rtl_433/blob/master/include/compat_pthread.h)
But I'm not even sure that those locks are the problem. We don't know which thread was waiting. The problem I observed once was the event loop getting stuck on the mg_broadcast as the mechanism (pipe) can only hold a single "in flight" message.
I'll hack a debug version that will alert which thread got stuck and where with a clean crash, if that something you'd like to try @rct?
@gdt you closed this by accident, right? :)
It's late and it happened again. I'm not sure if I'm following the args of the futex correctly, but it looks like the 3rd arg is a pid for a thread it is waiting on.
In both cases that I just looked at the thread that it is waiting on is blocked in a recv(4, ...)
$ strace -p 5943
strace: Process 5943 attached
futex(0x7405f4a8, FUTEX_WAIT, 5950, NULL^Cstrace: Process 5943 detached
<detached ...>
$ strace -p 5950
strace: Process 5950 attached
recv(4, ^Cstrace: Process 5950 detached
<detached ...>
$ lsof -p 5943
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
rtl_433 5943 rct cwd DIR 179,2 4096 1007 /home/rct
rtl_433 5943 rct rtd DIR 179,2 4096 2 /
rtl_433 5943 rct txt REG 179,2 741080 266311 /home/rct/rtl_433/merbanan/rtl_433/build/src/rtl_433
rtl_433 5943 rct mem CHR 189,3 134 /dev/bus/usb/001/004
rtl_433 5943 rct mem REG 179,2 145008 5663 /usr/lib/arm-linux-gnueabihf/libudev.so.1.7.0
rtl_433 5943 rct mem REG 179,2 1319784 2260 /usr/lib/arm-linux-gnueabihf/libc-2.31.so
rtl_433 5943 rct mem REG 179,2 386572 2262 /usr/lib/arm-linux-gnueabihf/libm-2.31.so
rtl_433 5943 rct mem REG 179,2 96192 5671 /usr/lib/arm-linux-gnueabihf/libusb-1.0.so.0.3.0
rtl_433 5943 rct mem REG 179,2 55592 50986 /usr/lib/arm-linux-gnueabihf/librtlsdr.so.0.6.0
rtl_433 5943 rct mem REG 179,2 137364 2271 /usr/lib/arm-linux-gnueabihf/libpthread-2.31.so
rtl_433 5943 rct mem REG 179,2 17708 5335 /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so
rtl_433 5943 rct mem REG 179,2 146888 2256 /usr/lib/arm-linux-gnueabihf/ld-2.31.so
rtl_433 5943 rct 0r FIFO 0,12 0t0 180424 pipe
rtl_433 5943 rct 1w FIFO 0,12 0t0 180425 pipe
rtl_433 5943 rct 2w FIFO 0,12 0t0 180426 pipe
rtl_433 5943 rct 3u IPv4 179752 0t0 UDP localhost:58693->localhost:53998
rtl_433 5943 rct 4u IPv4 179753 0t0 UDP localhost:53998->localhost:58693
rtl_433 5943 rct 6u IPv4 179755 0t0 TCP *:8433 (LISTEN)
rtl_433 5943 rct 7u netlink 0t0 179758 KOBJECT_UEVENT
rtl_433 5943 rct 8u a_inode 0,13 0 7580 [eventfd]
rtl_433 5943 rct 9u a_inode 0,13 0 7580 [eventfd]
rtl_433 5943 rct 10u a_inode 0,13 0 7580 [timerfd]
rtl_433 5943 rct 11u CHR 189,3 0t0 134 /dev/bus/usb/001/004
rtl_433 5943 rct 12u sock 0,8 0t0 179759 protocol: TCP
Fd 4 looks like a UDP socket with both ends in rtl_433 according to lsof. You mentioned a pipe, but this appears to be a UDP socket, so is this different?
I'm happen to try running a debug version if that helps collect data. I also have some core dumps but haven't had time to look at them.
(EDIT I didn't realize anyone could close issues that aren't theirs. I don't see how to reopen this issue.)
Possibly interesting, or misleading. I've been tracking events and frames since those are easily accessible from MQTT and easy to turn into sensors in Home Assistant.
When the graphs flat line there are no updates. So It is relatively easy to see when each instance hung.
For 03/16, both processes hung close to the same time (13:47 and 13:51) and after they were restored around 20:00, the both hung again around 21:02 and 20:58.
Those rtl_433 are running on two different RPIs, but they send data back to the same host over the SSH connect. So possibly something related to slowness with the network connect or host might trigger the hang.
My mistake, I thought it was a a pipe, but mg_broadcast()
https://github.com/merbanan/rtl_433/blob/master/src/mongoose.c#L3449 actually uses a socket pair like you observed.
The sdr acquire thread writes to a socket and the main thread select()
s on all kinds of different inputs. If blocked in recv()
then a select falsely returned a read-ok or the recv only has partial data? And no mutex is involved?
Uncomment some debug messages in acquire_callback()
(the send() part) https://github.com/merbanan/rtl_433/blob/master/src/rtl_433.c#L1406
and sdr_handler()
(the recv() part) https://github.com/merbanan/rtl_433/blob/master/src/rtl_433.c#L1353
First, Can someone (@gdt @zuckschwerdt) reopen this so anyone else running into this can find this issue?
So what is going on with Mongo here? I’m not using it. Should I try a build without it?
Where does the mongoose code/library come from - is there possibly an old version involved?
Curiously there were no hangs overnight on the one rtl_433 still running the new code. I rolled the other one back to the commit prior to the SDR acquire one.
If blocked in recv() then a select falsely returned a read-ok or the recv only has partial data? And no mutex is involved?
It seemed like the last thing the processes did before hanging was to write out a stats message.
maybe difference in terminology but the main process (lowest numbered PID) is the one blocked in futex() while waiting on the other thread that is blocked in recv()
Sorry, apparently I hit 'close with comment' instead of 'comment' earlier.
Not MongoDB but Mongoose -- it's our event loop and network code, an embedded project from Cesanta.
I'm running on a Raspi with control over SSH now too. And I'm disturbing the network with tc qdisc
. No hang so far. Can you share the command line args used?
The other thread (newer PID) is the SDR acquire thread. It yielded control to rtlsdr_read_async()
.
It will only get control back briefly when a SDR frame arrives by callback and the only action is to post that event using mg_broadcast()
to send()
on the UDP socket.
The blocking in recv()
then must be in librtlsdr or libusb?
Put e.g. this near the end of include/compat_pthread.h
, before the two #endif
s:
#define pthread_mutex_lock(mp) \
do { \
struct timespec abs_timeout; \
clock_gettime(CLOCK_REALTIME, &abs_timeout); \
abs_timeout.tv_sec += 60; \
int lock_ret = pthread_mutex_timedlock(mp, &abs_timeout); \
if (lock_ret != 0) { \
fprintf(stderr, "%s: mutex lock error: %d\n", __func__, lock_ret); \
exit(1); \
} \
} while (0)
It will only work on Linux but should be a good starting point to get more info about the mutex that blocked.
@zuckschwerdt - will dig into your comments/suggestions next, but wanted to provide the following. I'm not sure how much handling SIGABRT changed where things were on the stack, but here's some info from the cores I captured:
First core from 2023-03-16 19:53 (rpi2)
(gdb) thread apply all bt
Thread 3 (Thread 0x748bf440 (LWP 733)):
#0 __GI___poll (timeout=-1, nfds=2, fds=0x748bee1c) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 __GI___poll (fds=0x748bee1c, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:26
#2 0x76ecd7b4 in ?? () from /lib/arm-linux-gnueabihf/libusb-1.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 2 (Thread 0x740be440 (LWP 734)):
#0 0x76f1523c in __libc_recv (fd=4, buf=0x740bbbbc, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:26
#1 0x00022824 in mg_broadcast ()
#2 0x00040d48 in rtlsdr_read_cb ()
#3 0x76ee8bb4 in ?? () from /lib/arm-linux-gnueabihf/librtlsdr.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 1 (Thread 0x76f66240 (LWP 728)):
#0 __pthread_clockjoin_ex (threadid=1946936384, thread_return=0x0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, block=block@entry=true) at pthread_join_common.c:145
#1 0x76f097ac in __pthread_join (threadid=<optimized out>, thread_return=<optimized out>) at pthread_join.c:24
#2 0x000422a0 in sdr_stop ()
#3 0x000130bc in main ()
2nd core generated 2023-03-16 22:17
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_clockjoin_ex (threadid=1946547264, thread_return=0x0, clockid=clockid@entry=0,
abstime=abstime@entry=0x0, block=block@entry=true) at pthread_join_common.c:145
145 pthread_join_common.c: No such file or directory.
[Current thread is 1 (Thread 0x76f07240 (LWP 5943))]
(gdb) thread apply all bt
Thread 3 (Thread 0x74860440 (LWP 5948)):
#0 __GI___poll (timeout=-1, nfds=2, fds=0x7485fe1c) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 __GI___poll (fds=0x7485fe1c, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:26
#2 0x76e6e7b4 in ?? () from /lib/arm-linux-gnueabihf/libusb-1.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 2 (Thread 0x7405f440 (LWP 5950)):
#0 0x76eb623c in __libc_recv (fd=4, buf=0x7405cbbc, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:26
#1 0x00022824 in mg_broadcast ()
#2 0x00040d48 in rtlsdr_read_cb ()
#3 0x76e89bb4 in ?? () from /lib/arm-linux-gnueabihf/librtlsdr.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 1 (Thread 0x76f07240 (LWP 5943)):
#0 __pthread_clockjoin_ex (threadid=1946547264, thread_return=0x0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, block=block@entry=true) at pthread_join_common.c:145
#1 0x76eaa7ac in __pthread_join (threadid=<optimized out>, thread_return=<optimized out>) at pthread_join.c:24
#2 0x000422a0 in sdr_stop ()
#3 0x000130bc in main ()
3rd core - different host, OS raspbian 10 buster, 2023-03-16 22:22
Program terminated with signal SIGABRT, Aborted.
#0 0xb6f52a3c in __GI___pthread_timedjoin_ex (threadid=3018085440, thread_return=0x0,
abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
89 pthread_join_common.c: No such file or directory.
[Current thread is 1 (Thread 0xb6a64040 (LWP 19808))]
(gdb) thread apply all bt
Thread 3 (Thread 0xb3e45440 (LWP 19815)):
#0 __libc_recv (flags=0, len=1, buf=0xb3e42bd4, fd=4) at ../sysdeps/unix/sysv/linux/recv.c:26
#1 __libc_recv (fd=4, buf=0xb3e42bd4, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23
#2 0x00023bf8 in mg_broadcast ()
#3 0x000440f4 in rtlsdr_read_cb ()
#4 0xb6ca0a94 in ?? () from /usr/lib/arm-linux-gnueabihf/librtlsdr.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 2 (Thread 0xb4646440 (LWP 19813)):
#0 __GI___poll (timeout=-1, nfds=2, fds=0xb4645e1c) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 __GI___poll (fds=0xb4645e1c, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:26
#2 0xb6c86a3c in ?? () from /lib/arm-linux-gnueabihf/libusb-1.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 1 (Thread 0xb6a64040 (LWP 19808)):
#0 0xb6f52a3c in __GI___pthread_timedjoin_ex (threadid=3018085440, thread_return=0x0,
abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
#1 0x00045368 in sdr_stop ()
#2 0x00013a1c in main ()
Can you share the command line args used?
ssh -o ServerAliveInterval=300 -o ServerAliveCountMax=2 <rpihostname> merbanan/rtl_433/build/src/rtl_433 -p 56 -F mqtt://<mqtthost,user,pass> -f 433900000 -C native -M time:usec -M level -M noise -M stats:1:300 -Y autolevel -Y magest -F json,v=4 -F log -F http -R -25 -R -53 -R -54 -R -60 -R -81 -R -82 -R -83 -R -90 -R -95 -R -153 -R -156 -R -186 -R -201 -R -207 -R -228
I tried to do a debug build with cmake -DCMAKE_BUILD_TYPE=Debug ../
-- but when I ran it I got
==500==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
Unfortunately that doesn't tell me which library it is complaining about -- guessing I'm missing some debug library from something distro related like openssl?
EDIT: Sorry, it does tell me what library, I just didn't recognize it -- Asan (address sanitizer?). just not sure what to do do about it.
I'm now running today's head nightly-7-gecb0b361
( commit ecb0b361
) with your pthread_mutex_lock #defined wrapper, but as a regular build, not a debug build.
I'm running on a Raspi with control over SSH now too. And I'm disturbing the network with tc qdisc. No hang so far.
Right now it is a mystery why I got several hangs yesterday and then things ran fine overnight for 13 hours with no hangs.
Yes, that ASAN bug is strange, likely a fault of the distro. You can use LD_PRELOAD=/usr/lib/arm-linux-gnueabihf/libasan.so.6 rtl_43.3..
So it's consistently stuck in mg_broadcast()
doing a recv()
I dont know why there is even recv in there?
But more importantly: things were already winding down: pthread join is only run in sdr_stop()
which is apparently called, from main()
. This means the program hangs in the very last lines of main, trying to exit.
So something broke, rtl_433 tried to exit cleanly but got stuck in sdr_stop()
because the event loop blocked. This is similar to the case I already observed: somehow we posted to the socket pair when we were already finished with the event loop (near the last lines of main, trying to exit).
You could try -D restart
to just revive the SDR and not exit. Not a fix but should work.
But more importantly: things were already winding down: pthread join is only run in sdr_stop() which is apparently called, from main(). This means the program hangs in the very last lines of main, trying to exit.
I'm not sure I haven't tainted the stack traces for those core dumps by either trying to interrupt them (SIGQUIT doesn't work in this case) or whether some of the threads were reacting to the SIGABRT on the main process before the core dump was finished being taken.
It is however a very likely place for a bug. If we are not running the event loop anymore then mg_broadcast()
will get stuck. Which here then blocks the pthread_join()
and that might be the futex you see.
Perhaps you could change the LOG_DEBUG
in sdr.c's sdr_stop()
to LOG_NOTICE
to see exactly when this occurred?
The problem hasn't reoccurred in 2 days since I've been monitoring, so possible heisenbugs
I did find a related (stats output) bug with overflow and unaligned access on 32-bit systems. It's read only thus should not be an issue. Fix coming shortly.
That's interesting. The last message in the logs was a stats message.
Right now I've got 3 x RPIs running current code. The two that had the hang are both 32-bit armv7l kernels (one is Rasbian 10 in need of an upgrade, the other is an RPI 2B/Debian 11, so it's stuck on 32 bits).
The RPI 3B running an aarch64 kernel hasn't had a problem so far.
EDIT: not sure if this is a valid test, but I tried sending a SIGBUS to each of the 3 threads I saw. In each case, the process exited cleanly
The fix a7c1d0c is in master now. Maybe this changes something.
Nearly two weeks since running the code with the extra debugging and the fix for unaligned access https://github.com/merbanan/rtl_433/commit/a7c1d0c6a61467da7fed7647ffa1acd9feec5f38 and no hangs.
I haven't seen any hangs. I haven't noticed any unexpected restarts.
Thanks. It's interesting how such drastic effects (the deadlock) result from reading off-by-one (actually 4 bytes, the size difference between double and a 32-bit pointer). A page fault should have caused SIGV, perhaps it was something about the unaligned access and ARM specifically? But the problem has been in the code for 4 years, since 313fdfe, and only showed with threading, strange. Closing for now then.
I may have spoken too soon, got another hang today.
Even with the added output, nothing in the log (maybe it didn't get flushed/)
{"time" : "2023-04-02 13:09:28.654626", "model" : "Acurite-Atlas", "id" : 924, "channel" : "C", "sequence_num" : 1, "battery_ok" : 1, "message_type" : 37, "wind_avg_mi_h" : 19.000, "temperature_F" : 45.400, "humidity" : 37, "strike_count" : 107, "strike_distance" : 5, "exception" : 0, "raw_msg" : "079c6509c656a59a65d1", "mod" : "ASK", "freq" : 433.940, "rssi" : -0.242, "snr" : 20.341, "noise" : -20.583}
{"time" : "2023-04-02 13:09:28.654626", "model" : "Acurite-Atlas", "id" : 924, "channel" : "C", "sequence_num" : 2, "battery_ok" : 1, "message_type" : 37, "wind_avg_mi_h" : 19.000, "temperature_F" : 45.400, "humidity" : 37, "strike_count" : 107, "strike_distance" : 5, "exception" : 0, "raw_msg" : "0b9c6509c656a59a65d5", "mod" : "ASK", "freq" : 433.940, "rssi" : -0.242, "snr" : 20.341, "noise" : -20.583}
{"time" : "2023-04-02 13:09:30.304797", "src" : "Auto Level", "lvl" : 4, "msg" : "Current noise level -20.1 dB, estimated noise -19.4 dB"}
Auto Level: Current noise level -20.1 dB, estimated noise -19.4 dB
{"time" : "2023-04-02 13:09:36.071596", "src" : "Auto Level", "lvl" : 4, "msg" : "Estimated noise level is -20.2 dB, adjusting minimum detection level to -17.2 dB"}
Auto Level: Estimated noise level is -20.2 dB, adjusting minimum detection level to -17.2 dB
{"time" : "2023-04-02 13:09:36.583359", "model" : "Ambientweather-F007TH", "id" : 166, "channel" : 8, "battery_ok" : 1, "temperature_F" : 63.400, "humidity" : 40, "mic" : "CRC", "mod" : "ASK", "freq" : 433.934, "rssi" : -0.248, "snr" : 19.633, "noise" : -19.881}
{"time" : "2023-04-02 13:09:37.120379", "enabled" : 194, "since" : "2023-04-02T13:05:31", "frames" : {"count" : 204, "fsk" : 0, "events" : 119}, "stats" : [{"device" : 20, "name" : "Ambient Weather F007TH, TFA 30.3208.02, SwitchDocLabs F016TH temperature sensor", "events" : 214, "ok" : 31, "messages" : 31, "fail_other" : 183}, {"device" : 23, "name" : "DSC Security Contact", "events" : 121, "ok" : 4, "messages" : 4, "abort_early" : 117}, {"device" : 40, "name" : "Acurite 592TXR Temp/Humidity, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas", "events" : 205, "ok" : 86, "messages" : 256, "fail_other" : 119}]}
{"time" : "2023-04-02 13:13:21.912016", "model" : "Acurite-Tower", "id" : 12053, "channel" : "C", "battery_ok" : 0, "temperature_C" : 19.500, "humidity" : 32, "mic" : "CHECKSUM", "mod" : "ASK", "freq" : 433.955, "rssi" : -0.192, "snr" : 20.329, "noise" : -20.521}
{"time" : "2023-04-02 13:13:21.912016", "model" : "Acurite-Tower", "id" : 12053, "channel" : "C", "battery_ok" : 0, "temperature_C" : 19.500, "humidity" : 32, "mic" : "CHECKSUM", "mod" : "ASK", "freq" : 433.955, "rssi" : -0.192, "snr" : 20.329, "noise" : -20.521}
{"time" : "2023-04-02 13:13:21.912016", "model" : "Acurite-Tower", "id" : 12053, "channel" : "C", "battery_ok" : 0, "temperature_C" : 19.500, "humidity" : 32, "mic" : "CHECKSUM", "mod" : "ASK", "freq" : 433.955, "rssi" : -0.192, "snr" : 20.329, "noise" : -20.521}
Ran gcore before doing anything else:
$ gcore 21080
[New LWP 21085]
[New LWP 21086]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
0xb6ed9a3c in __GI___pthread_timedjoin_ex (threadid=3017589824, thread_return=0x0, abstime=<opt
imized out>, block=<optimized out>) at pthread_join_common.c:89
89 pthread_join_common.c: No such file or directory.
warning: target file /proc/21080/cmdline contained unexpected null characters
Saved corefile core.21080
[Inferior 1 (process 21080) detached]
$ gdb ./rtl_433/rct/rtl_433/build/src/rtl_433 core.21080
GNU gdb (Raspbian 8.2.1-2) 8.2.1
Reading symbols from ./rtl_433/rct/rtl_433/build/src/rtl_433...(no debugging symbols found)...done.
[New LWP 21080]
[New LWP 21085]
[New LWP 21086]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `/home/rct/rtl_433/rct/rtl_433/build/src/rtl_433'.
#0 0xb6ed9a3c in __GI___pthread_timedjoin_ex (threadid=3017589824, thread_return=0x0,
abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
89 pthread_join_common.c: No such file or directory.
[Current thread is 1 (Thread 0xb69eb080 (LWP 21080))]
(gdb) thread apply all bt
Thread 3 (Thread 0xb3dcc440 (LWP 21086)):
#0 __libc_recv (flags=0, len=1, buf=0xb3dc9bd4, fd=4) at ../sysdeps/unix/sysv/linux/recv.c:26
#1 __libc_recv (fd=4, buf=0xb3dc9bd4, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23
#2 0x00023cf0 in mg_broadcast ()
#3 0x00044334 in rtlsdr_read_cb ()
#4 0xb6c27a94 in ?? () from /usr/lib/arm-linux-gnueabihf/librtlsdr.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 2 (Thread 0xb45cd440 (LWP 21085)):
#0 __GI___poll (timeout=-1, nfds=2, fds=0xb45cce1c) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 __GI___poll (fds=0xb45cce1c, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:26
#2 0xb6c0da3c in ?? () from /lib/arm-linux-gnueabihf/libusb-1.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 1 (Thread 0xb69eb080 (LWP 21080)):
#0 0xb6ed9a3c in __GI___pthread_timedjoin_ex (threadid=3017589824, thread_return=0x0,
abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
#1 0x0004567c in sdr_stop ()
#2 0x00013ab4 in main ()
(gdb)
Current system calls for each LWP:
$ strace -p 21080
strace: Process 21080 attached
futex(0xb3dcc4a8, FUTEX_WAIT, 21086, NULL^Cstrace: Process 21080 detached
<detached ...>
$ strace -p 21086
strace: Process 21086 attached
recv(4, ^Cstrace: Process 21086 detached
<detached ...>
$ strace -p 21085
strace: Process 21085 attached
restart_syscall(<... resuming interrupted restart_syscall ...>^Cstrace: Process 21085 detached
<detached ...>
@zuckschwerdt - can you reopen this?
and a 2nd hang a few hours later on the same system
Reading symbols from ./rtl_433/rct/rtl_433/build/src/rtl_433...(no debugging symbols found)...done.
[New LWP 26053]
[New LWP 26058]
[New LWP 26059]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `/home/rct/rtl_433/rct/rtl_433/build/src/rtl_433'.
#0 0xb6f94a3c in __GI___pthread_timedjoin_ex (threadid=3018355776, thread_return=0x0,
abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
89 pthread_join_common.c: No such file or directory.
[Current thread is 1 (Thread 0xb6aa6080 (LWP 26053))]
(gdb) thread apply all bt
Thread 3 (Thread 0xb3e87440 (LWP 26059)):
#0 __libc_recv (flags=0, len=1, buf=0xb3e84bd4, fd=4) at ../sysdeps/unix/sysv/linux/recv.c:26
#1 __libc_recv (fd=4, buf=0xb3e84bd4, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23
#2 0x00023cf0 in mg_broadcast ()
#3 0x00044334 in rtlsdr_read_cb ()
#4 0xb6ce2a94 in ?? () from /usr/lib/arm-linux-gnueabihf/librtlsdr.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 2 (Thread 0xb4688440 (LWP 26058)):
#0 __GI___poll (timeout=-1, nfds=2, fds=0xb4687e1c) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 __GI___poll (fds=0xb4687e1c, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:26
#2 0xb6cc8a3c in ?? () from /lib/arm-linux-gnueabihf/libusb-1.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 1 (Thread 0xb6aa6080 (LWP 26053)):
#0 0xb6f94a3c in __GI___pthread_timedjoin_ex (threadid=3018355776, thread_return=0x0,
abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:89
#1 0x0004567c in sdr_stop ()
#2 0x00013ab4 in main ()
For what it is worth, there don't seem to be be any kernel messages or anything else that seems relevant in syslog around the time the process hung. So I have no idea why this would have been in the process of shutting down.
Unlike the others, this one didn't seem to have output a stats message very close to the hang/attempted shutdown.
"time" : "2023-04-02 18:45:56.502121", "model" : "Acurite-Atlas", "id" : 924, "channel" : "C", "sequence_num" : 0, "battery_ok" : 1, "message_type" : 37,
"wind_avg_mi_h" : 18.000, "temperature_F" : 45.800, "humidity" : 33, "strike_count" : 107, "strike_distance" : 5, "exception" : 0, "raw_msg" : "039c65090
65a219a658d", "mod" : "ASK", "freq" : 433.940, "rssi" : -0.234, "snr" : 19.737, "noise" : -19.970}
{"time" : "2023-04-02 18:45:56.502121", "model" : "Acurite-Atlas", "id" : 924, "channel" : "C", "sequence_num" : 1, "battery_ok" : 1, "message_type" : 37,
"wind_avg_mi_h" : 18.000, "temperature_F" : 45.800, "humidity" : 33, "strike_count" : 107, "strike_distance" : 5, "exception" : 0, "raw_msg" : "079c65090
65a219a6591", "mod" : "ASK", "freq" : 433.940, "rssi" : -0.234, "snr" : 19.737, "noise" : -19.970}
{"time" : "2023-04-02 18:45:56.502121", "model" : "Acurite-Atlas", "id" : 924, "channel" : "C", "sequence_num" : 2, "battery_ok" : 1, "message_type" : 37,
"wind_avg_mi_h" : 18.000, "temperature_F" : 45.800, "humidity" : 33, "strike_count" : 107, "strike_distance" : 5, "exception" : 0, "raw_msg" : "0b9c65090
65a219a6595", "mod" : "ASK", "freq" : 433.940, "rssi" : -0.234, "snr" : 19.737, "noise" : -19.970}
{"time" : "2023-04-02 18:45:56.685377", "model" : "Acurite-Tower", "id" : 12053, "channel" : "C", "battery_ok" : 0, "temperature_C" : 19.200, "humidity" :
32, "mic" : "CHECKSUM", "mod" : "ASK", "freq" : 433.924, "rssi" : -0.186, "snr" : 20.534, "noise" : -20.720}
This confirms what we suspected, the main thread is in sdr_stop()
(exited the event loop) the aquire thread is blocked in mg_broadcast()
via rtlsdr_read_cb()
and librtlsdr is still doing poll
.
It's not clear why the program wanted to exit, but it's clear that the sequencing to exit is wrong.
I added some additional print_log()s to see where sdr_stop() was being called to try to get a clue about why it is exiting and why I'm not seeing any cause for exiting in the logs.
rtl_433.c around line 1989:
// add dummy socket to receive broadcasts
struct mg_add_sock_opts opts = {.user_data = cfg};
struct mg_connection *nc = mg_add_sock_opt(get_mgr(cfg), INVALID_SOCKET, timer_handler, op\
ts);
// Send us MG_EV_TIMER event after 2.5 seconds
mg_set_timer(nc, mg_time() + 2.5);
while (!cfg->exit_async) {
mg_mgr_poll(cfg->mgr, 500);
}
if (cfg->verbosity >= LOG_INFO)
print_log(LOG_INFO, "rtl_433", "stopping...");
// final polls to drain the broadcast
//while (cfg->exit_async < 2) {
// mg_mgr_poll(cfg->mgr, 100);
//}
print_log(LOG_WARNING, "rtl_433", "stopping 2...");
sdr_stop(cfg->dev);
When it hung, it printed "stopping 2...", The above LOG_INFO didn't get printed -- get cfg->verbosity isn't set?
Unfortunately I'm not running a debug build, so I don't have enough of a local symbol table to look into cfg
.
So aside from figuring out how to exit cleanly without hanging, it would be good if I can figure out why it is trying to exit. Things where quiet for two weeks or so and now I have an instance that seems to keep trying to exit
To set cfg->verbosity
a -v
or -vv
would be needed.
It's really puzzling why the exit is effected. It should be a clue why things get stuck. In normal conditions I would guess it to be a stall of the input -- the 2.5 seconds timer quoted is the watchdog for that. Maybe plaster that with print's. You might get around that with -D restart
I just had two instances (two separate RPIs each with their own single RTL-SDR) hang at the same point. EDIT same point in this case means same point in time and had both output stopping 2...
as their last line before hanging.
What they have in common is they both pipe STDOUT/STDERR (-F json and -F log at this point) back to the same host to be persisted on disk. They also connect to the same mosquitto instance.
So I'm guessing it is more likely the output blocked rather than the input, but I could be wrong.
I did try to block output for a few minutes but could no reproduce this. But it is plausible: Output buffers are full, the main thread gets blocked, frames from the acquire thread are not processed. Things get going again, the watchdog timer sees excessive time between processed frames and causes an exit.
It would fit the broadcast pipe being stuck. I need to find a way to force output to stall with full buffers.
This is probably what you expect, but additional output from running with -v
:
{"time" : "2023-04-05 02:31:36.340914", "model" : "Acurite-Tower", "id" : 12053, "channel" : "C", "battery_ok" : 0, "temperature_C" : 20.700, "humidity" : 34, "mic" : "CHECKSUM", "mod" : "ASK", "freq" : 433.957, "rssi" : -0.218, "snr" : 20.370, "noise" : -20.589}
{"time" : "2023-04-05 02:31:36.557333", "src" : "rtl_433", "lvl" : 4, "msg" : "stopping 2..."}
rtl_433: stopping 2...
{"time" : "2023-04-05 02:31:36.557333", "src" : "sdr_stop", "lvl" : 5, "msg" : "EXITING..."}
sdr_stop: EXITING...
{"time" : "2023-04-05 02:31:36.557333", "src" : "sdr_stop", "lvl" : 5, "msg" : "JOINING..."}
sdr_stop: JOINING...
Just as another data point I seem to be hitting this as well running an rpi on commit 24e6f8823377ff8cde5a6ad96878dae4efe710cb. I'm hopping between two frequencies and after a couple hours the process is still live, but no output to kv
, influx
, or mqtt
is reported, it looks like it's just locked up.
Saw a hang on an instance that never had a hang before (rpi 3b listening on 345Mhz). The RPI was just rebooted and patched a few hours before the hang. This time there was an Async read stalled
reported. The other instances that hang more frequently haven't given a clue as to why they are exiting.
{"time" : "2023-04-08 15:07:50.425573", "src" : "Auto Level", "lvl" : 4, "msg" : "Current signal level -6.3 dB, estimated noise -9.9 dB"}
Auto Level: Current signal level -6.3 dB, estimated noise -9.9 dB
{"time" : "2023-04-08 15:08:00.386872", "src" : "Auto Level", "lvl" : 4, "msg" : "Current signal level -5.5 dB, estimated noise -10.4 dB"}
Auto Level: Current signal level -5.5 dB, estimated noise -10.4 dB
{"time" : "2023-04-08 15:08:10.348938", "src" : "Auto Level", "lvl" : 4, "msg" : "Current signal level -5.3 dB, estimated noise -10.4 dB"}
Auto Level: Current signal level -5.3 dB, estimated noise -10.4 dB
{"time" : "2023-04-08 15:08:18.737997", "src" : "Input", "lvl" : 3, "msg" : "Async read stalled, exiting!"}
Input: Async read stalled, exiting!
{"time" : "2023-04-08 15:08:18.737997", "src" : "timer_handler", "lvl" : 4, "msg" : "Calling sdr_stop"}
timer_handler: Calling sdr_stop
I see only one cause for an exit, the sdr acquire stall. Perhaps change that into just a warning and observe?
Either use -D restart
or disable timer_handler()
with a return;
after the prints: https://github.com/merbanan/rtl_433/blob/93f0f30c28cfb6b82b8cc3753415b01a85bee91d/src/rtl_433.c#L1522
I have a similar or identical scenario with a RTL-SDR v3. It will run for a few minutes, or maybe an hour or two, before hanging with no messages. My command line:
/usr/local/bin/rtl_433 -f 433.92M -f 433.92M -f 915M -H 11 -F \ "mqtt://10.55.80.18:1883,user=xxx,pass=xxx,retain=0,events=rtl433/1" -M level -vvv -F log
There isn't any identifying error output other than the decoding state representations. I have code from master as of today.
This is running in an ssh console on a BeagleBone Green.
Prior to tthis, it was running nicely on a linux laptop running from a console window.
If rtl_433 could be made to crash out on this problem (someone mentioned this in a prior message), that would be better, then it could be automatically restarted.
Your proposed change to have a timeout sounds pretty reasonable, but I really don't know pthreads. What do you do if the mutex times out? Retry one or a few times times? Log it and continue? Log it and exit?
In the interim, if you retry then log/exit, that would be helpful, then I could have a script simply restart it.
Put a return 0;
somewhere at the start of sdr_stop()
and see if that terminates on error. https://github.com/merbanan/rtl_433/blob/master/src/sdr.c#L1754
Problem should be the pthread_join()
as the acquire thread is stuck posting to the (terminated) event loop.
Put a
return 0;
somewhere at the start ofsdr_stop()
and see if that terminates on error.
As the first line of the function, it did not terminate. Maybe an exit() of some fashion?
I've put in an fprint to see if it actually is reached.
It took several hours this time. There was no output from my print statement. There was nothing in kern.log, syslog, or daemon.log.
This would imply sdr_stop() is not reached?
Put a
return 0;
somewhere at the start ofsdr_stop()
and see if that terminates on error. https://github.com/merbanan/rtl_433/blob/master/src/sdr.c#L1754 Problem should be thepthread_join()
as the acquire thread is stuck posting to the (terminated) event loop.
I should be able to see the output from:
print_log(LOG_DEBUG, __func__, "JOINING..."); int r = pthread_join(dev->thread, NULL); if (r) { fprintf(stderr, "%s: error in pthread_join, rc: %d\n", __func__, r); }
What would be the best command line for tracing this?
print_log()
goes through the whole stack -- it better to just fprintf(stderr, "...", ...)
Another option: if this is a broken write (through ssh) then maybe a SIGPIPE causes an exit that then gets stuck?
In sighandler()
(https://github.com/merbanan/rtl_433/blob/master/src/rtl_433.c#L1328) add to the SIGPIPE
case
write_err("Broken pipe caught, ignoring!\n");
return;
Please report if you see the message and it still works afterwards.
I'm not sure if it is any of that. When I reported my problem, the config was switching between two frequencies: 433 & 915. It may have been hanging on the transition between the switch.
I am running solely with 915 now for at least a day.
So maybe something in the logic which transitions between frequencies? Do you still want me to try that 'broken pipe' thing while running with two frequencies?
What we have seen so far is rtl_433 getting stuck trying to exit. Not too sure what caused the exit. The SIGPIPE seems to be a possibility with SSH? But it could be unrelated. Next free evening I'm going to pepper the code with asserts and dead-lock watchers to solve this.
In
sighandler()
(https://github.com/merbanan/rtl_433/blob/master/src/rtl_433.c#L1328) add to theSIGPIPE
casewrite_err("Broken pipe caught, ignoring!\n"); return;
I inserted the write statement, tried my command line which switches frequencies, but had no diagnostic output. Just the hang after a period of time. I'll await your flavour with asserts, etc.
What we have seen so far is rtl_433 getting stuck trying to exit. Not too sure what caused the exit. The SIGPIPE seems to be a possibility with SSH?
From my experiences writing output to SSH does seem to cause this hang were output gets blocked.
FWIW, SIGPIPE is a possibility not just with SSH, but anytime a process is writing to a file descriptor that is a pipe. Usually it is a source of bugs once someone tries to catch it but doesn't handle it correctly: the most common pitfall I've seen is STDOUT/STDERR go to a pipe and the receiving process dies. The sending process writes an error message when it catches the signal, which in turn causes another SIGPIPE and you wind up in a loop.
I haven't seen that happening here.
I had 3 RPI+rtl_433 instances that were experiencing this error. All were using SSH to write log files to a central server instead of the RPI. Two of those RPIs have been converted to write locally to a USB attached SSD and have rarely experienced this hang in the month(s) since I changed to write locally.
On the 3rd RPI, a 3B with only an SD card, I'm still using SSH and I occasionally experience the hang. I believe the write output is getting blocked which makes it look like there was an async read stall. Unwinding from that is were it usually hangs.
In the past 24 hours, I've had rtl_433 hang twice on one host and once on a different host. Unfortunately, I've yet to capture a core file -- SIGABORT isn't generating one. Will add more details when I get them. Posting this now in case anyone is seeing anything similar.
Strace shows the main process is blocked in
futex()
In all 3 cases, the last thing that logged by rtl_433 was the stats message:
Code is:
rtl_433 version nightly-3-g376f1b02 branch master at 202303101654 inputs file rtl_tcp RTL-SDR
OS is
Raspbian GNU/Linux 11 (bullseye)
on armv7l, kernel is 5.15.76-v7+ #1597.(edit: there are no kernel messages logged before or after the hang)
One thing to note about these two rtl_433 instances is they are started by ssh and their STDOUT/STDERR goes through that ssh redirected to a file on the controlling host. (This is the configuration I've been running with RPIs for a number of years to avoid writing to the RPI's sd.).
I mention the above because it is possible the process might temporarily block writing to STDOUT/STDERR. Maybe this could cause some deadlock with logging? Unfortunately, I haven't yet caught any info about any threads but the main process.
Let me know if there is anything I should be doing if I see another hang, besides trying to capture a core dump.