karlheyes / icecast-kh

KH branch of icecast
GNU General Public License v2.0
298 stars 106 forks source link

kh12 segfault #263

Closed DuncanFairley closed 4 years ago

DuncanFairley commented 5 years ago

I'm on the release version of 2.4.0-kh12 and received two segfaults since updating a week ago: Jun 25 05:59:42 srv1 kernel: [62456356.991999] icecast[16253]: segfault at 2dd ip 0000000000430220 sp 00007f7cafe22be8 error 4 in icecast[400000+55000] Jun 22 08:15:50 srv1 kernel: [62205324.731029] icecast[24053]: segfault at 0 ip (null) sp 00007f6ff5394d98 error 14 in icecast[400000+55000] I haven't got a coredump yet, but am just wondering if these seem related to what was fixed in https://github.com/karlheyes/icecast-kh/commit/96a500ea466c09429c4db5d662a9fb6f7dae69e2 ?

als2002 commented 4 years ago

Switched to new icecast server form 2.4.0 kh3 ( on new hardware , Debian 10 x64 ).

I've got segfault within 24 hour uptime with 2.4.0 kh12 with commit 96a500e applied. Configuration has quite a lot of relays defined ( about 60 from 108 sources, both local and external ).

Sep 1 17:41:29 stream kernel: [182496.651456] icecast2[13249]: segfault at 0 ip 0000000000000000 sp 00007efc7e9ec978 error 14 in icecast[ 559141ad7000+6000] Sep 1 17:41:29 stream kernel: [182496.651469] Code: Bad RIP value.

( Hardware is HP server with ECC memory, should not be an issue )

DuncanFairley commented 4 years ago

I did end up applying 96a500e but still get the segfaults every 1-3 days.

Aug 25 15:29:50 srv1 kernel: [67760965.090029] icecast[26389]: segfault at 31 ip 0000000000415481 sp 00007f2de8f73d90 error 4 in icecast[400000+55000]

I don't have any relays though. Just ~50 mountpoints.

als2002 commented 4 years ago

I ended up having segfaults every 10 minutes... Installed old version I was running previously. Will see how it goes. Do You use fallbacks?

DuncanFairley commented 4 years ago

I don't use any fallbacks. I do some introaudio though. But if you're able to reproduce it in only 10 minutes, you could try getting a trace so Karl can see exactly where it's faulting. See here

djamps commented 4 years ago

Downgraded from 12 to 11 now we're getting segfaults on KH11 as well. Trying KH10 now....

DuncanFairley commented 4 years ago

I can't seem to get a debug build working. In a fresh icecast copy, I did:

./configure
make debug
make install

When it crashed, it outputted this:

*** Error in `/usr/local/bin/icecast': malloc(): smallbin double linked list corrupted: 0x00007fbfcc99c280 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fbfdf64e7e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x82651)[0x7fbfdf659651]
/lib/x86_64-linux-gnu/libc.so.6(__libc_calloc+0xba)[0x7fbfdf65bdca]
/usr/local/bin/icecast[0x41dd37]
/usr/local/bin/icecast[0x4297ea]
/usr/local/bin/icecast[0x42b689]
/usr/local/bin/icecast[0x416f09]
/usr/local/bin/icecast[0x4180dc]
/usr/local/bin/icecast[0x41f135]
/usr/local/bin/icecast[0x43adeb]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fbfdf9a86ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fbfdf6de41d]

And gdb:

root@srv1:/tmp# gdb /usr/local/bin/icecast --core core-icecast.9892.srv1.1572983181
...
Reading symbols from /usr/local/bin/icecast...done.
[New LWP 9896]
[New LWP 8773]
[New LWP 9892]
[New LWP 9895]
[New LWP 9894]
Core was generated by `/usr/local/bin/icecast -c /home/icecast/currentconfig/icecast.xml'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fbfdf60c428 in ?? ()
[Current thread is 1 (LWP 9896)]
(gdb) bt
#0  0x00007fbfdf60c428 in ?? ()
#1  0x00007fbfdf60e02a in ?? ()
#2  0x0000000000000020 in ?? ()
#3  0x0000000000000000 in ?? ()
DuncanFairley commented 4 years ago

I finally got a proper trace, I hope:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/bin/icecast -c /home/icecast/currentconfig/icecast.xml'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f9bac43e428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f9ba47a7700 (LWP 13549))]
(gdb) bt
#0  0x00007f9bac43e428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f9bac44002a in __GI_abort () at abort.c:89
#2  0x0000000000419fa6 in source_queue_advance (client=0x7f9b90002f90) at source.c:883
#3  0x000000000041b487 in send_listener (source=0x7f9b981b0f90, client=0x7f9b90002f90) at source.c:1383
#4  0x000000000041ae5f in send_to_listener (client=0x7f9b90002f90) at source.c:1232
#5  0x0000000000425fc6 in worker (arg=0x157b580) at client.c:735
#6  0x000000000044ba1b in _start_routine (arg=0x157b370) at thread.c:727
#7  0x00007f9bac7da6ba in start_thread (arg=0x7f9ba47a7700) at pthread_create.c:333
#8  0x00007f9bac51041d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) bt full
#0  0x00007f9bac43e428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
        resultvar = 0
        pid = 13545
        selftid = 13549
#1  0x00007f9bac44002a in __GI_abort () at abort.c:89
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0xb19be18fcb503e00, sa_sigaction = 0xb19be18fcb503e00}, sa_mask = {__val = {
              11952406549267936580, 7632343802151250176, 14024600428130476874, 16319518678535185408, 6825180314748858317, 1254,
              140306249466032, 140306338231856, 140306344117392, 0, 140306314601760, 0, 22320800, 140306249040096, 1635678190688592640,
              140306456145296}}, sa_flags = 4414744, sa_restorer = 0x0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x0000000000419fa6 in source_queue_advance (client=0x7f9b90002f90) at source.c:883
        ret = 1254
        offset = 130 '\202'
        written = 1254
        source = 0x7f9b981b0f90
        refbuf = 0x7f9b9c5295d0
        lag = 1254
        __func__ = "source_queue_advance"
        loop = 47
#3  0x000000000041b487 in send_listener (source=0x7f9b981b0f90, client=0x7f9b90002f90) at source.c:1383
        bytes = 0
        loop = 40
        total_written = 0
        limiter = 8762
        ret = 0
        lag = 1254
        worker = 0x157b580
        now = 1573089315
        __func__ = "send_listener"
#4  0x000000000041ae5f in send_to_listener (client=0x7f9b90002f90) at source.c:1232
        source = 0x7f9b981b0f90
        ret = 0
#5  0x0000000000425fc6 in worker (arg=0x157b580) at client.c:735
        ret = 0
        nx = 0x7f9b90017970
        process = 1
        client = 0x7f9b90002f90
        sched_ms = 1573089315529
        worker = 0x157b580
        prev_count = 1120
        prevp = 0x7f9b90004ea8
        c = 7
        __func__ = "worker"

Edit: I'm running latest icecast commit eb66e8f. Ubuntu 16.04.

  <queue-size>524288</queue-size>
  <client-timeout>30</client-timeout>
  <header-timeout>15</header-timeout>
  <source-timeout>6</source-timeout>
  <burst-size>231072</burst-size>

(gdb) p source->mount
$1 = 0x7f9b981ba730 "/CFFM"
#This is a MP3 stream

(gdb) p client->connection
$6 = {id = 555, con_time = 1572983193, discon = {time = 0, offset = 0}, sent_bytes = 1273725733, sock = 179, chunk_pos = 0, error = 0 '\000',
  readchk = 108 'l', ssl = 0x0, ip = 0x7f9b900182a0 "redacted IP"}

The person with that IP is currently listening too, it seems. Is that con_time in milliseconds?

It looks like their audio player is very aggressive in reconnecting when the stream is down:

[06/Nov/2019:17:15:25 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:25 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:25 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:25 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:25 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:25 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:25 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:26 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:27 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:27 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:27 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:27 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:27 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:27 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:27 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
[06/Nov/2019:17:15:27 -0800] "GET /CFFM HTTP/1.0" 404 135 "-" "mpg123/1.25.7" 0
oriolj commented 4 years ago

I'm also facing this issue. Is there anything I can do to help debug this?

DuncanFairley commented 4 years ago

I got a new segfault:

(gdb) core-file core-icecast.9463.srv1.1573479075
[New LWP 9467]
[New LWP 4364]
[New LWP 9463]
[New LWP 9466]
[New LWP 9465]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/bin/icecast -c /home/icecast/currentconfig/icecast.xml'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007ffb2507c428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7ffb1d3e5700 (LWP 9467))]
(gdb) bt
#0  0x00007ffb2507c428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffb2507e02a in __GI_abort () at abort.c:89
#2  0x00007ffb250be7ea in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7ffb251d7ed8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffb250c4814 in malloc_printerr (action=<optimized out>, str=0x7ffb251d4c75 "corrupted size vs. prev_size", ptr=<optimized out>, ar_ptr=<optimized out>) at malloc.c:5006
#4  0x00007ffb250c9a03 in _int_malloc (av=av@entry=0x7ffb14000020, bytes=bytes@entry=1400) at malloc.c:3622
#5  0x00007ffb250cb184 in __GI___libc_malloc (bytes=1400) at malloc.c:2913
#6  0x000000000042423a in refbuf_new (size=1400) at refbuf.c:75
#7  0x0000000000436577 in validate_mpeg (source=0x7ffb10050810, refbuf=0x7ffb1406ea40) at format_mp3.c:946
#8  0x0000000000436687 in mp3_get_no_meta (source=0x7ffb10050810) at format_mp3.c:971
#9  0x0000000000418eaa in source_read (source=0x7ffb10050810) at source.c:575
#10 0x00000000004196d7 in source_client_read (client=0x7ffb08011640) at source.c:734
#11 0x0000000000425fc6 in worker (arg=0xaa9580) at client.c:735
#12 0x000000000044ba1b in _start_routine (arg=0xaa8c80) at thread.c:727
#13 0x00007ffb254186ba in start_thread (arg=0x7ffb1d3e5700) at pthread_create.c:333
#14 0x00007ffb2514e41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

I'm happy to send the corefile+binary if that'd help diagnose these @karlheyes

karlheyes commented 4 years ago

appologies for lack of responses, been travelling. The corruption cases that some have reported should be resolved now. A race between queue being pruned and a listener disconnecting was causing possible corruption leading to random effects. I've released kh13 with the fix.

karl