mikebrady / shairport-sync

AirPlay and AirPlay 2 audio player
Other
7.29k stars 574 forks source link

[Problem]: shairport-sync segfault #1571

Closed dasl- closed 1 year ago

dasl- commented 2 years ago

What happened?

shairport-sync got a segfault. At the time I was playing multiroom audio through 2 SPS receivers, and I was cycling through pressing pause + resume "in a loop" (manually) to stress test audio synchronization bugs. The client was an iphone using the "radio swiss classic" app.

logs

Here are the last 5000 log lines of the SPS receiver that got the segfault: https://gist.github.com/dasl-/3dba2f7557de5282cbcf083101fd3324

As a comparison, here are the last 5000 log lines of the second SPS receiver I was connected to at the time, which did not get a segfault: https://gist.github.com/dasl-/456ff4d11bbb7a9ddde891aef8e02a5d

coredump

Here is the gdb backtrace: https://gist.github.com/dasl-/7b750f40317eca040ec65c2b5c63fecd

Let me know if you want me to run any other GDB commands.

Relevant log output

See above

Configuration Information.

Here is the configuration of the SPS receiver that had a segfault. Note that I removed my long audio_backend_buffer_desired_length_in_seconds configuration setting that I had in https://github.com/mikebrady/shairport-sync/issues/1565

% shairport-sync --displayConfig
         0.001000129 "shairport.c:1420" default metadata_pipename is "/tmp/shairport-sync-metadata".
         0.000160945 "shairport.c:2057" >> Display Config Start.
         0.012261648 "shairport.c:2057"
         0.000091555 "shairport.c:2057" From "uname -a":
         0.000019426 "shairport.c:2057"  Linux piwall10 5.10.103-v7l+ #1529 SMP Tue Mar 8 12:24:00 GMT 2022 armv7l GNU/Linux
         0.029955463 "shairport.c:2057"
         0.000075500 "shairport.c:2057" From /etc/os-release:
         0.000019722 "shairport.c:2057"  Raspbian GNU/Linux 10 (buster)
         0.016658871 "shairport.c:2057"
         0.000134000 "shairport.c:2057" From /sys/firmware/devicetree/base/model:
         0.000071592 "shairport.c:2057"  Raspberry Pi 4 Model B Rev 1.2
         0.000100611 "shairport.c:2057"
         0.000023630 "shairport.c:2057" Shairport Sync Version String:
         0.000015463 "shairport.c:2057"  4.1-rc0-164-ga4118ee9-AirPlay2-OpenSSL-Avahi-ALSA-soxr-metadata-dbus-sysconfdir:/etc
         0.000017537 "shairport.c:2057"
         0.000013667 "shairport.c:2057" Command Line:
         0.000014074 "shairport.c:2057"  shairport-sync --displayConfig
         0.000160333 "shairport.c:2057"
         0.000029722 "shairport.c:2057" Configuration File:
         0.000015389 "shairport.c:2057"  /etc/shairport-sync.conf
         0.000014297 "shairport.c:2057"
         0.000255074 "shairport.c:2057" Configuration File Settings:
         0.000039685 "shairport.c:2057"  general :
         0.000015185 "shairport.c:2057"  {
         0.000014352 "shairport.c:2057"    volume_range_db = 40;
         0.000014981 "shairport.c:2057"    volume_max_db = 0.0;
         0.000014815 "shairport.c:2057"    name = "piwall";
         0.000016926 "shairport.c:2057"  };
         0.000016611 "shairport.c:2057"  alsa :
         0.000014259 "shairport.c:2057"  {
         0.000014352 "shairport.c:2057"    output_device = "hw:Headphones";
         0.000014797 "shairport.c:2057"    mixer_control_name = "Headphone";
         0.000036426 "shairport.c:2057"  };
         0.000020037 "shairport.c:2057"  diagnostics :
         0.000015185 "shairport.c:2057"  {
         0.000016500 "shairport.c:2057"    log_verbosity = 2;
         0.000014759 "shairport.c:2057"    statistics = "yes";
         0.000057389 "shairport.c:2057"  };
         0.000014796 "shairport.c:2057"
         0.000013611 "shairport.c:2057" >> Display Config End.
         0.000014741 "shairport.c:2059" >> Goodbye!
         0.000041130 "shairport.c:1534" exit function called...
         0.000016740 "shairport.c:1552" Stopping D-Bus service
         0.000071056 "dbus-service.c:1131" stopping dbus service
         0.000027852 "shairport.c:1554" Stopping D-Bus service done
         0.000015315 "shairport.c:1570" Stopping DACP Monitor
         0.000065018 "shairport.c:1572" Stopping DACP Monitor Done
         0.000022463 "shairport.c:1576" Stopping metadata hub
         0.000014241 "shairport.c:1578" Stopping metadata done
         0.000013592 "shairport.c:1582" Stopping metadata
         0.000040834 "shairport.c:1584" Stopping metadata done
         0.000050907 "shairport.c:1586" Stopping the activity monitor.
         0.000018704 "shairport.c:1588" Stopping the activity monitor done.
         0.000070444 "shairport.c:1670" normal exit

How did you install Shairport Sync?

Built from source

Check previous issues

mikebrady commented 2 years ago

Super, thanks. I'll have a dig through...

mikebrady commented 2 years ago

I had a look at the logs and the very interesting gdb backtrace, and on the face of it, it looks as if the crash was inside the Apple ALAC decoder. So, let me suggest that you select the hammerton decoder in the configuration settings and try again.

    alac_decoder = "hammerton"; // This can be "hammerton" or "apple". This advanced setting allows you to choose

Interestingly, the ALAC decoder in FFmpeg is by the same guy, Dave Hammerton, and is probably maintained better there. It would take a little while to incorporate in SPS.

Of course, this doesn't mean that the decoder is where the problem is, but it is suggestive. OTOH, the value of the outbuffer looks odd, so maybe there is a memory allocation issue somewhere. I'll keep digging.

dasl- commented 2 years ago

So, let me suggest that you select the hammerton decoder in the configuration settings and try again.

I can certainly do that, but I'm a little confused. My understanding was that I am already using the hammerton decoder by default. The logs say "player.c:169" Hammerton Decoder used on encrypted audio.

The sample config suggests hammerton will be chosen by default too:

% cat /etc/shairport-sync.conf.sample | grep -A2 alac_decoder
//  alac_decoder = "hammerton"; // This can be "hammerton" or "apple". This advanced setting allows you to choose
//      the original Shairport decoder by David Hammerton or the Apple Lossless Audio Codec (ALAC) decoder written by Apple.
//      If you build Shairport Sync with the flag --with-apple-alac, the Apple ALAC decoder will be chosen by default.

I do not build SPS with the --with-apple-alac flag. I might be misunderstanding though.

mikebrady commented 2 years ago

Apologies -- the confusion is mine 😕. You are right -- the crash is in the Hammerton decoder. Unfortunately, we know that it can crash on illegal input, so maybe I'll start looking at the FFmpeg version, in case there are fixes.

But it's a bit uncool, not to say improper, to blame someone else's code without evidence. With that in mind, let me suggest a way to wring more information out of a backtrace: turn off compiler optimisations. This can be done by preceding the ./configure step with a few definitions, like this:

$ CFLAGS="-O0 -g" CXXFLAGS="-O0 -g" ../configure

Then do a $ make clean followed by a $ make.

It would be most interesting to see the contents of the buffer when the crash occurred (it's at 0xa85f72f8 in the backtrace) and its length, which is 32 in this case. The buffer contains the packet to be decoded, and It's just possible that it is causing the crash, i.e. we have a smoking gun. Unfortunately it's been optimised out, so maybe turning off optimisation would help.

Meanwhile, I'll try some other diagnostics here. I've used valgrind before, but not for an extended period; perhaps it will reveal something. That outbuffer value does look a bit odd.

dasl- commented 2 years ago

Apologies -- the confusion is mine 😕

No worries!

With that in mind, let me suggest a way to wring more information out of a backtrace: turn off compiler optimisations. This can be done by preceding the ./configure step with a few definitions, like this:

It looks like I am already turning off compiler optimizations. This is how I've been building SPS.

However, I had not been adding make clean in my installation script. I have just added that step and will rebuild, hoping to eventually capture another segfault coredump.

github-actions[bot] commented 1 year ago

This issue has been inactive for 45 days so will be closed 7 days from now. To prevent this, please remove the "stale" label or post a comment.

dasl- commented 1 year ago

Updating this thread just note an interesting problem I had with using GDB on a recent coredump. I managed to capture a coredump of another shairport-sync segfault, but I was unable to show much of a backtrace in GDB: https://gist.github.com/dasl-/685be88dade7c208c139061c227f0eb7

GDB was complaining about stuff related to libthread_db. As suggested in the GDB output, I tried adding a line to this file: /root/.gdbinit. I created that file and added the following lines:

set auto-load safe-path /
set debug libthread-db 1

I tried running GDB again after making these changes, but I still was unable to get anything useful: https://gist.github.com/dasl-/594cc029028440f73708617f7742add4

It was still complaining about: warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.. I'm not sure how to resolve that.

I've added automatic restarting to my systemd configuration to work around the very rare segfaults :)

https://github.com/dasl-/pitools/commit/0022671fdfec49a3ab90549e193f82dbc45b01f0

nickolay commented 9 months ago

I've just got 3 crashes in a few days after running 4.2-1-g12ad72c4-AirPlay2-smi9-OpenSSL-Avahi-ALSA-pa-pipe-soxr-metadata-mqtt-sysconfdir:/etc on RPI4 (Raspbian 11) for a better part of the year without an issue. Curiously first two crashes happened on the first tracks of act 2 & 3 of tchaikovsky's swan lake (Apple music on ipadOS 17.2), but I haven't been able to reproduce. The one I've got the coredump for happened when I was not around.

My backtrace is almost identical to the one here, but I barely found this report because the stack traces are on a separate gist. I'll paste mine here, so that it can be found via search.

If the crashes continue to happen regularly, I'll try the obvious steps, like upgrading and tweaking the optimization flags.

It would be most interesting to see the contents of the buffer when the crash occurred (it's at 0xa85f72f8 in the backtrace) and its length, which is 32 in this case. The buffer contains the packet to be decoded, and It's just possible that it is causing the crash, i.e. we have a smoking gun. Unfortunately it's been optimised out, so maybe turning off optimisation would help.

Not sure why you said it had been optimized out, but I tried to dump it anyway, see below.

Note that the crash is due to an invalid value of dest=0x1369650.

$ journalctl -xu shairport-sync.service | grep SEGV
Feb 04 19:09:45 argon systemd[1]: shairport-sync.service: Main process exited, code=killed, status=11/SEGV
Feb 04 21:12:49 argon systemd[1]: shairport-sync.service: Main process exited, code=killed, status=11/SEGV
Feb 07 12:37:27 argon systemd[1]: shairport-sync.service: Main process exited, code=dumped, status=11/SEGV

$ sudo gdb /usr/local/bin/shairport-sync /tmp/core
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0003acb8 in deinterlace_16 (interlacing_leftweight=<optimized out>, interlacing_shift=<optimized out>, numsamples=352, numchannels=<optimized out>, buffer_out=0x9,
    buffer_b=<optimized out>, buffer_a=<optimized out>) at alac.c:561
561     left = buffer_a[i];
[Current thread is 1 (Thread 0xe9afbc80 (LWP 29475))]

(gdb) bt
#0  0x0003acb8 in deinterlace_16
    (interlacing_leftweight=<optimized out>, interlacing_shift=<optimized out>, numsamples=352, numchannels=<optimized out>, buffer_out=0x9, buffer_b=<optimized out>, buffer_a=<optimized out>) at alac.c:561
#1  alac_decode_frame (alac=<optimized out>, inbuffer=inbuffer@entry=0xe9af9540 " ", outbuffer=outbuffer@entry=0x1369650, outputsize=outputsize@entry=0xe9af8c2c) at alac.c:974
#2  0x00034f6c in unencrypted_packet_decode
    (packet=packet@entry=0xe9af9540 " ", length=length@entry=32, dest=dest@entry=0x1369650, outsize=outsize@entry=0xe9af8c2c, size_limit=size_limit@entry=1408, conn=conn@entry=0x1170370)
    at player.c:172
#3  0x00035014 in audio_packet_decode (dest=0x1369650, destlen=0xe9af94cc, destlen@entry=0xe9af94c4, buf=0xe9af9540 " ", buf@entry=0x0, len=len@entry=32, conn=0x1170370,
    conn@entry=0xd3d62ace) at player.c:227
#4  0x00035b68 in player_put_packet (original_format=original_format@entry=1, seqno=<optimized out>, seqno@entry=14040, actual_timestamp=0, actual_timestamp@entry=3554028238, data=0x0,
    data@entry=0xe9af9538 "\365\303\004", len=32, conn=conn@entry=0x1170370) at player.c:501
#5  0x0002f1dc in decipher_player_put_packet (ciphered_audio_alt=ciphered_audio_alt@entry=0xe9afa66a "p\356", nread=nread@entry=66, conn=0x1170370,
    conn@entry=0x2f96c <rtp_realtime_audio_receiver+156>) at rtp.c:1707
#6  0x0002f96c in rtp_realtime_audio_receiver (arg=0x2f96c <rtp_realtime_audio_receiver+156>) at rtp.c:1935
#7  0xf5e4b310 in start_thread (arg=0xe9afbc80) at pthread_create.c:477
#8  0xf5d18da8 in  () at ../sysdeps/unix/sysv/linux/arm/clone.S:73

(gdb) info registers
r0             0x1369652           20354642
r1             0xeb3306a8          3945989800
r2             0x1                 1
r3             0xeb330120          3945988384
r4             0x4                 4
r5             0x0                 0
r6             0x0                 0
r7             0x0                 0
r8             0x4                 4
r9             0x160               352
r10            0x0                 0
r11            0x9                 9
r12            0x0                 0
sp             0xe9af8b18          0xe9af8b18
lr             0x0                 0
pc             0x3acb8             0x3acb8 <alac_decode_frame+3904>
cpsr           0x20800010          545259536
fpscr          0x60000010          1610612752

(gdb) x/i $pc
=> 0x3acb8 <alac_decode_frame+3904>:    strh    lr, [r0, #-2]

(gdb) x 0x1369650
0x1369650:  Cannot access memory at address 0x1369650

(gdb) x/32b 0xe9af9540
0xe9af9540: 0x20    0x00    0x00    0x04    0x00    0x13    0x08    0x0c
0xe9af9548: 0x2b    0xee    0x8e    0x0e    0x75    0xf8    0xf0    0x13
0xe9af9550: 0x08    0x09    0x1b    0xf4    0x98    0x09    0x0f    0xfc
0xe9af9558: 0x34    0xff    0x80    0xaf    0xbf    0xe0    0x2b    0xfc
mikebrady commented 9 months ago

Thanks for the post, and for the core dump. It appears that the crash occurred in the code that decodes an Apple Lossless Audio Codec (ALAC) packet. It's not impossible that some bad packet was sent, but it seems more likely to me that the data or the memory allocations were somehow corrupted. I think it's more likely because in 4.3.2 we have fixed quite a few bugs in the PulseAudio (pa) backend that might have caused memory allocation and usage errors. So, definitely worth updating to 4.3.2

BTW there is also now a pretty robust PipeWire (pw) backend which is recommended if you have a (recent) PipeWire system.