Random segfaults on logouts/playback starts/ends (Rpi 3B+) [Fixed?]

priitohlo commented 6 years ago

shairport-sync tends to randomly segfault when logging in/out via SSH, starting or ending playback. This behaviour is observable both when starting it via systemd or directly on terminal, however is not reproducible all the time.

Is there a way to compile the binary to produce a little more error messages beyond "Segmentation fault", as this is all that is printed on crash?

Environment in use: Raspberry Pi Model 3B+ HiFiBerry Digi+ (output via TOSLINK) 3.2RC11-mbedTLS-Avahi-ALSA-soxr-metadata Latest stable Raspbian

mikebrady commented 6 years ago

Thanks for the post. This is very unexpected behaviour. Offhand I can’t think of a way to catch it, and I’m not sure it gdb can be used in this situation.

A possible help would be to set log verbosity to 2 and after it has crashed see if there is anything consistent about what is written in the log.

Is anyone else seeing this?

priitohlo commented 6 years ago

I'm seeing this in the end of the log with -vv:

prsm
waiting for a mutex, maximum expected time of 20000 microseconds exceeded "player.c:633".
debug_mutex_lock at "player.c:633" expected max wait: 0.020000000, actual wait: 0.020514457 sec.
waiting for a mutex, maximum expected time of 30000 microseconds exceeded "player.c:802".
debug_mutex_lock at "player.c:802" expected max wait: 0.030000000, actual wait: 0.031411682 sec.
waiting for a mutex, maximum expected time of 30000 microseconds exceeded "player.c:802".
debug_mutex_lock at "player.c:802" expected max wait: 0.030000000, actual wait: 0.030332886 sec.
waiting for a mutex, maximum expected time of 20000 microseconds exceeded "player.c:633".
debug_mutex_lock at "player.c:633" expected max wait: 0.020000000, actual wait: 0.020542113 sec.
waiting for a mutex, maximum expected time of 30000 microseconds exceeded "player.c:802".
debug_mutex_lock at "player.c:802" expected max wait: 0.030000000, actual wait: 0.031989698 sec.
waiting for a mutex, maximum expected time of 20000 microseconds exceeded "player.c:633".
debug_mutex_lock at "player.c:633" expected max wait: 0.020000000, actual wait: 0.022577518 sec.
waiting for a mutex, maximum expected time of 20000 microseconds exceeded "player.c:633".
debug_mutex_lock at "player.c:633" expected max wait: 0.020000000, actual wait: 0.021945438 sec.
waiting for a mutex, maximum expected time of 20000 microseconds exceeded "player.c:633".
debug_mutex_lock at "player.c:633" expected max wait: 0.020000000, actual wait: 0.021029922 sec.
waiting for a mutex, maximum expected time of 30000 microseconds exceeded "player.c:802".
debug_mutex_lock at "player.c:802" expected max wait: 0.030000000, actual wait: 0.030318564 sec.
waiting for a mutex, maximum expected time of 30000 microseconds exceeded "player.c:802".
debug_mutex_lock at "player.c:802" expected max wait: 0.030000000, actual wait: 0.031000278 sec.
Segmentation fault

This seems to have been caused by simply disconnecting from ssh session.

mikebrady commented 6 years ago

Thanks for this. Using gdb, the built-in debugger, might be helpful, but it could be tedious. But, if you're willing, here is what you could do instead of just using $ shairport-sync -vv:

$ gdb --args shairport-sync -vv
... lots of stuff
(gdb) run
... lots more stuff

Play stuff to the service, and when you get a segmentation fault, enter:

(gdb) bt

bt stands for backtrace, and it will try to list the call stack, which might help track down the problem.

priitohlo commented 6 years ago

Here's the output:

Thread 11 "shairport-sync" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x728ff320 (LWP 12479)]
0x76977298 in ?? () from /lib/arm-linux-gnueabihf/libdbus-1.so.3
(gdb) bt
#0  0x76977298 in ?? () from /lib/arm-linux-gnueabihf/libdbus-1.so.3
#1  0x769775c8 in ?? () from /lib/arm-linux-gnueabihf/libdbus-1.so.3
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

mikebrady commented 6 years ago

That's weird. The configuration of Shairport Sync is:

3.2RC11-mbedTLS-Avahi-ALSA-soxr-metadata

from above, right? That means that it should have no dealings withlibdbus. I'm going to have to double check that, as D-Bus is used if one optionally includes the native D-Bus or MPRIS interfaces. If this is indeed the case -- i.e. if, as I believe, there is no D-Bus involvement in your build of Shairport Sync (there shouldn't be) -- then it's a system-wide problem. First let me check. Also, if you can capture another seg fault that reproduces this, that would be interesting indeed.

priitohlo commented 6 years ago

No, I haven't built it with dbus. The configure args are as such: --with-alsa --with-avahi --with-ssl=mbedtls --with-soxr --with-metadata --with-apple-alac --with-systemd

I tried this again for a couple of times now and the gdb backtrace always points to the dbus libraries.

mikebrady commented 6 years ago

Just continuing, the Avahi subsystem does use D-Bus alright, but not Shairport Sync itself, in this configuration:

3.2RC11-mbedTLS-Avahi-ALSA-soxr-metadata-sysconfdir:/etc

mikebrady commented 6 years ago

Have you any unusual settings in the configuration file?

priitohlo commented 6 years ago

Here are my uncommented lines in the configuration

interpolation = "soxr"; ignore_volume_control = "yes"; alac_decoder = "apple"; allow_session_interruption = "yes"; log_verbosity = 1;

mikebrady commented 6 years ago

Okay, thanks. I'll reproduce them here, just not (yet) on a 3B+.

mikebrady commented 6 years ago

Running this configuration on a Pi Zero W, and it gets easily overwhelmed when soxr-based stuffing is needed, but it otherwise behaving itself. Hmm.

priitohlo commented 6 years ago

I'm also keeping the firmware and kernel up-to-date fairly frequently (currently at 4.14.49-v7+), perhaps there are some incompatibilities in there?

mikebrady commented 6 years ago

Are you using Raspbian or Raspbian Lite? Are you using the Pi's GUI? (All my experimentation is based on Raspbian Lite, no GUI.)

priitohlo commented 6 years ago

Raspbian Stretch Lite (no X).

mikebrady commented 6 years ago

Dunno about the possible incompatibilities. I'll get a 3B+ in the next few days. Meanwhile I'll update the Pi Zero W -- currently at 4.9.44+.

priitohlo commented 6 years ago

As a sidenote, I compiled Shairport Sync to use tinysvcmdns instead of Avahi and everything seems a lot more stable now.

mikebrady commented 6 years ago

Very interesting. Is it possible that your system -- your installation -- is just borked?

priitohlo commented 6 years ago

Doubtful -- fresh installation, only ever used for Shairport Sync. I have disabled the TV service, Bluetooth and the LEDs, however. That is the extent of deviation from default setup.

mikebrady commented 6 years ago

Fair enough. Just up to 4.14.49+ on the Pi Zero W BTW. Let's see what happens.

mikebrady commented 6 years ago

How hard would it be to make it completely default -- it's just a thought, as Bluetooth uses D-Bus?

priitohlo commented 6 years ago

Will take a few ticks, will try ASAP.

mikebrady commented 6 years ago

When I get a chance, I'll check the Avahi code in Shairport Sync. Most of it is there a long time, and I'm not sure how much error checking is going on -- it might take a little while though, and hasn't been a problem before...

priitohlo commented 6 years ago

I have now completely reinstalled everything from scratch with everything up to date and at their defaults (except for sound configuration to use the HiFiBerry card, local users, partition size, wireless networking) but the behaviour is exactly the same right down to the gdb backtrace.

mikebrady commented 6 years ago

Most peculiar!

mikebrady commented 6 years ago

Sorry for the rather obvious question, but have you used older versions of Shairport Sync? And if so, have they had this issue?

priitohlo commented 6 years ago

I have used versions since 3.2RC10, which had the same issue. I just got around submitting this only now.

mikebrady commented 6 years ago

Thanks again. I'll be digging in over the next few days; I might have a development version with enhanced debug messages for you to try.

mikebrady commented 6 years ago

Are you getting any shairport sync log entries relating to avahi or DACP?

priitohlo commented 6 years ago

I am, but mostly things that indicate success of setup and connection, for example:

Jun 19 17:24:03 muse shairport-sync[9686]: avahi: service 'E3FF49F13A4E@Muse' group is not yet committed.
Jun 19 17:24:03 muse shairport-sync[9686]: avahi: request to add "_raop._tcp" service without metadata
Jun 19 17:24:03 muse shairport-sync[9686]: avahi: service 'E3FF49F13A4E@Muse' group is registering.
Jun 19 17:24:03 muse shairport-sync[9686]: avahi: service 'E3FF49F13A4E@Muse' successfully added.

Jun 19 17:25:27 muse shairport-sync[9686]: DACP monitor successfully started

No errors anywhere.

mikebrady commented 6 years ago

Taking a careful look through the avahi code in Shairport Sync yesterday, I could only find two places where a return code wasn't being checked, and those places are almost never called. I'll push the code with the extra checks later today.

I expect to get a 3B+ in the next few days, but I don't have a HiFiBerry Digi+. So, my next question is this: if you run the audio into the built-in DAC, does the problem persist? If it does, then I guess there might be a chance I could reproduce the problem here...

priitohlo commented 6 years ago

Same behaviour with hifiberry disabled and built-in audio interface enabled.

mikebrady commented 6 years ago

Okay, so with a 3B+ driving the built-in DAC on the very latest development version (with MQTT and with a slight improvement in the timed mutex lock code that pretty much removes those debug_mutex_lock at ... messages) I'm afraid I'm not getting any misbehaviour.

mikebrady commented 6 years ago

Would you be able to run it with statistics enabled please? Just in case it shows up any particular network issues, like high levels of packet loss.

priitohlo commented 6 years ago

Jul 06 00:16:29 muse shairport-sync[3136]:       -0.9,       0.0,       0.0,        1003,      0,      0,      0,      0,   6293,  263,  264
Jul 06 00:16:37 muse shairport-sync[3136]:       -1.9,     192.6,     192.6,        2006,      0,      0,      0,      0,   6246,  263,  264
Jul 06 00:16:39 muse shairport-sync[3136]: Packet reception interval stats: mean, standard deviation and max for the last 2,500 packets in microseconds:     7978.7,     1242.7,    36650.0.
Jul 06 00:16:45 muse shairport-sync[3136]:       -2.0,     218.1,     218.1,        3009,      0,      0,      0,      0,   6059,  259,  264
Jul 06 00:16:53 muse shairport-sync[3136]:       -2.0,     271.9,     271.9,        4012,      0,      0,      0,      0,   6062,  259,  264
Jul 06 00:16:59 muse shairport-sync[3136]: Packet reception interval stats: mean, standard deviation and max for the last 2,500 packets in microseconds:     7981.7,     2123.5,    52222.0.
Jul 06 00:17:01 muse shairport-sync[3136]:       -1.9,     181.3,     181.3,        5015,      0,      0,      0,      0,   6267,  263,  264
Jul 06 00:17:09 muse shairport-sync[3136]:       -2.0,     223.8,     223.8,        6018,      0,      0,      0,      0,   6162,  263,  264
Jul 06 00:17:17 muse shairport-sync[3136]:       -2.0,     246.4,     246.4,        7021,      0,      0,      0,      0,   5990,  261,  265
Jul 06 00:17:19 muse shairport-sync[3136]: Packet reception interval stats: mean, standard deviation and max for the last 2,500 packets in microseconds:     7982.4,     1991.1,    59701.0.
Jul 06 00:17:25 muse shairport-sync[3136]:       -2.0,     215.3,     215.3,        8024,      0,      0,      0,      0,   6125,  254,  264
Jul 06 00:17:33 muse shairport-sync[3136]:       -2.0,     195.4,     195.4,        9027,      0,      0,      0,      0,   6094,  261,  264
Jul 06 00:17:39 muse shairport-sync[3136]: Packet reception interval stats: mean, standard deviation and max for the last 2,500 packets in microseconds:     7981.7,     1223.6,    26680.0.
Jul 06 00:17:41 muse shairport-sync[3136]:       -2.0,     271.9,     271.9,       10030,      0,      0,      0,      0,   6059,  258,  265
Jul 06 00:17:42 muse systemd[1]: shairport-sync.service: Main process exited, code=killed, status=11/SEGV
Jul 06 00:17:42 muse systemd[1]: shairport-sync.service: Unit entered failed state.
Jul 06 00:17:42 muse systemd[1]: shairport-sync.service: Failed with result 'signal'.

Doesn't seem to be much of anything interesting here. The behaviour is still the same -- as soon as I log out any user, the service segfaults, as seen here.

mikebrady commented 6 years ago

Thanks for the update. When you say “log out a user”, what do you mean?

priitohlo commented 6 years ago

As I noted in my initial report, the behaviour is observable when a user were to log out of the system (e.g. ending su session, terminating SSH connection, etc.). Very rarely will the segfault occur when playback were to be stopped.

mikebrady commented 6 years ago

Thanks again for the clarification. It is such a weird behaviour. I'll keep trying to provoke a crash, but no luck so far...

mikebrady commented 6 years ago

I only ever got this to happen once, but have recently fixed a memory-corruption bug -- see #722 -- that might have a bearing on this. Would you be kind enough to try the latest development build -- 3.2.d67 or later?

tuomashamalainen commented 6 years ago

I was seeing the same issue with 3.2RC12, particularly in the form of crashing when play session ends. Just installed 3.2.d67 and will try it out actively over the next few days.

Setup: Raspberry Pi 3 (non-B+), Raspbian Stretch, BeoCreate 4-Channel Amplifier as DAC.

mikebrady commented 6 years ago

Many thanks.

priitohlo commented 6 years ago

Seems to be a lot more stable now. I'll monitor for a few days and close if it stays that way.

tuomashamalainen commented 6 years ago

I also have observed no crashes of any kind since installing 3.2.d67. It's been rock-solid.

priitohlo commented 6 years ago

Same here. Closing.

mikebrady / shairport-sync

Random segfaults on logouts/playback starts/ends (Rpi 3B+) [Fixed?] #714