Closed gilou closed 6 years ago
I removed all outputs but the one streaming the piped stream, as I suspect it's the culprit, it froze again after 1h40, smart_crossfade managed to squeeze one more line:
2018/08/22 12:06:31 [smart_crossfade:3] new >= old + margin, old <= medium and new <= high.
This time I had a socket connected before, and it's responding...
uptime
0d 01h 53m 51s
restart
restart called
but it's not doing anything the external process is still running, hasn't crashed, doesn't complain on stderr If I attach strace to liq, I get:
sudo strace -fp 29444
strace: Process 29444 attached with 7 threads
[pid 29454] futex(0x56309c3428a0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 29449] futex(0x7f93d8000b48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 29448] futex(0x7f93dc000b48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 29447] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=21809} <unfinished ...>
[pid 29445] futex(0x56309c3428a0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 29444] futex(0x56309c3428a0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 29446] <... select resumed> ) = 0 (Timeout)
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)
If I kill the external process, liq uses 100% of CPU. I'll try and monitor the external process next time. I added merge_tracks and drop_metadata to try making stereotool happier, but I don't think that's the issue βΉ
if I attach strace to stereotool, it seems to be alive and waiting for data...
[pid 30093] read(0,
and interestingly (or not), liq honored USR1 in that state, because it truncated the log file upon receiving!
OK.. I got tired, I switched my install to ocaml 4.07.0, recompiled everything, and starting the hunt again..
aaaaaand it froze again, after 5 hours.. Same symptoms, liq is somehow frozen, stills responds on its socket but doesn't actually handle anything, yet it responds to SIGUSR1 (truncates the logfile) and closes on other signals (ABRT, HUP..).
@gilou, thanks for reporting this. You're on the latest code right? It seems that the streaming thread is blocked but the other treads are still responding. One thing you can try is to get a gdb
backtrace of the running process:
% gdb
> attach <pid>
> thread apply all bt
I'm back on sox for testing... If I only use output.external, it lasted for more than 4 hours which was a first... so I added pipe_metadata from https://gist.github.com/gilou/43133d2619202791e153211471aff6a1#file-functions-liq-L115 but it failed after 10+ hours...
Code was:
s = smart_crossfade(playlist("test.m3u"))
s = mksafe(amplify(1.0, s))
pipe_file = "test.pipe"
pipe_input = pipe_metadata("test.pipe", s)
ps = output.external(id='pipeout', %wav, "sox -t raw -b 16 -c 2 -r 44.1k -e unsigned - -t wav #{pipe_file} flanger 2>/tmp/stderr", fst(pipe_input))
output.icecast(%mp3.vbr(quality=0), mount="test_pipe", snd(pipe_input))
Backtrace once attached is here: gdb.txt
Will restart testing on latest code, namely with e10e047
Great idea. Your stack trace was helpful too and seems to point to a different direction but this is worth trying. Thks!
πͺ 10 hours uptime mark! I'll let it run a bit more before opening the champagne, and add a bit more complexity I removed to debunk this. I'll avoid adding cue_cut or other extreme gapless playout ideas I have in mind before the release (and I'll see how I can help with docs as well)...
And.. it failed again π after 20 hours, and it was the whole thing, piping to stereotool This is what the backtrace is once attached: gdb.txt Same symptoms: server still responding though quite braindead:
uptime
0d 21h 00m 40s
END
st.skip
Done
END
request.alive
5 9 10 16 0
END
restart
restart called
END
Last lines of the log are those:
2018/08/28 11:33:47 [smart_crossfade:3] new >= old + margin, old <= medium and new <= high.
2018/08/28 11:33:47 [smart_crossfade:3] Do not fade if it's already very low.
2018/08/28 11:33:47 [smart_crossfade:3] Transition: crossed, no fade.
2018/08/28 11:33:47 [source:4] Source add_441221 gets up.
2018/08/28 11:33:47 [add_441221:4] Content kind is {audio=2;video=0;midi=0}.
2018/08/28 11:33:47 [source:4] Source smart_cross_9081_after gets up.
2018/08/28 11:33:47 [smart_cross_9081_after:4] Content kind is {audio=2;video=0;midi=0}.
2018/08/28 11:33:47 [smart_cross_9081_after:4] Activations changed: static=[add_441221:smart_cross_9081], dynamic=[].
2018/08/28 11:33:47 [source:4] Source smart_cross_9081_before gets up.
2018/08/28 11:33:47 [smart_cross_9081_before:4] Content kind is {audio=2;video=0;midi=0}.
2018/08/28 11:33:47 [smart_cross_9081_before:4] Activations changed: static=[add_441221:smart_cross_9081], dynamic=[].
And liq answers the USR1 signal by truncating its log, but doesn't respond to ^C.
Browsing through the log, I get a weird feeling.. I have things like:
2018/08/28 11:00:08 [smart_cross_9081:3] Analysis: -nandB / -24.523810dB (3.04s / 3.04s)
So I'm adding a few lines before the freeze:
2018/08/28 11:33:46 [decoder:4] Decoding "/home/ubuntu/demo/chill/08 On the Couch.mp3" ended: Mad.End_of_stream.
2018/08/28 11:33:47 [chill(dot)m3u:3] Finished with "/home/ubuntu/demo/chill/08 On the Couch.mp3".
2018/08/28 11:33:47 [amplify_9063:3] End of the current overriding.
2018/08/28 11:33:47 [chill(dot)m3u:4] Remaining: 0.0s, queued: 30.0s, taking: 30.0s
2018/08/28 11:33:47 [chill(dot)m3u:3] Prepared "/home/ubuntu/demo/chill/08 Summertime Blues.mp3" (RID 16).
2018/08/28 11:33:47 [rotate_9070:3] Switch to amplify_9066 with forgetful transition.
2018/08/28 11:33:47 [amplify_9063:4] Activations changed: static=[], dynamic=[rotate_9074:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9072:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9070:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9068:switch_9076:fallback_9077:on_track_9079:smart_cross_9081].
2018/08/28 11:33:47 [amplify_9063:4] Disabling caching mode.
2018/08/28 11:33:47 [amplify_9066:4] Activations changed: static=[rotate_9070:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9074:switch_9076:fallback_9077:on_track_9079:smart_cross_9081], dynamic=[rotate_9074:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9072:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9070:switch_9076:fallback_9077:on_track_9079:smart_cross_9081].
2018/08/28 11:33:47 [amplify_9066:3] Overriding amplification: 0.331131.
2018/08/28 11:33:47 [lang:3] Playing (1535448827.18): annotate:liq_amplify="-9.60 dB",liq_cue_in="1.021678":/home/ubuntu/demo/party/01 Mano Negra.flac
2018/08/28 11:33:47 [lang:4] I have 42 tracks
2018/08/28 11:33:47 [smart_cross_9081:3] Analysis: -nandB / -29.007891dB (3.01s / 3.03s)
However, this file played many times before without issues during that run.
Here's what I've seen twice now from your backtrace:
Thread 8 (Thread 0x7f80f7fff700 (LWP 10561)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f8123031023 in __GI___pthread_mutex_lock (mutex=0x55a73a86d8a0) at ../nptl/pthread_mutex_lock.c:78
#2 0x000055a738528e3d in caml_mutex_lock ()
#3 0x000055a73848fd47 in camlDtools__mutexify_2152 () at dtools.ml:769
#4 0x000055a73854a3b4 in caml_start_program ()
#5 0x000055a738547da0 in caml_final_do_calls ()
#6 0x000055a738534eea in caml_gc_dispatch ()
#7 0x000055a738536136 in caml_alloc_string ()
#8 0x000055a7384abfa1 in camlBytes__sub_1234 () at bytes.ml:68
#9 0x000055a7384d0c3c in camlPrintf__k$27_1310 () at bytes.ml:73
#10 0x000055a73848fe64 in camlDtools__print_2156 () at dtools.ml:779
#11 0x000055a73848fd9c in camlDtools__mutexify_2152 () at dtools.ml:770
#12 0x000055a7383ab373 in camlSource__fun_2626 () at source.ml:354
#13 0x000055a738321760 in camlSmartcross__fun_2748 () at operators/smartcross.ml:356
#14 0x000055a73832224d in camlSmartcross__fun_2627 () at operators/smartcross.ml:191
It's a deadlock waiting on a mutex
that is triggered during a Gc
memory collection. It happens during a smartcross
transition to that makes it look like the bug is coming from smartcross
but it's only incidental. smartcross
is understandably pretty memory hungry so it's not a surprise a lot of Gc
cleanup will happen during its execution.
Now, I just have to look at finalise functions involving Tutils.mutexify
calls...
Do you want me to test without smart_crossfade? I'll probably have to rewrite it to have it deal more agressively with silences at some point so .. π
Nah, can you actually try with the branch deadlock-investigation
that I just pushed?
Here we go. Side note, maybe having the rev/branch included in the version might help knowing where we are βΊ (yes, I'll try to do a PR for configure.ac π)
Should be:
2018/08/28 08:03:43 [main:3] Liquidsoap 1.3.3+scm (git@github.com:savonet/liquidsoap.git@f9970afe6a89a38924084b1439c65685b6a65062:20180826:143248)
;-)
that was fast.. log froze at smart_crossfade as well in ~5 minutes, bt included.. gdb.txt
I hope you are suggesting the format rather than stating what should be displayed currently.. I have:
2018/08/28 15:27:49 [main:3] Liquidsoap 1.3.3+scm
2018/08/28 15:27:49 [main:3] Using: bytes=[distributed with OCaml 4.02 or above] pcre=7.3.4 dtools=0.4.0 duppy=0.7.1 duppy.syntax=0.7.1 cry=0.6.2 mm=0.4.0 xmlplaylist=0.1.4 ogg=0.5.2 vorbis=0.7.0 opus=0.1.2 mad=0.4.5 flac=0.1.3 flac.ogg=0.1.3 dynlink=[distributed with Ocaml] lame=0.3.3 fdkaac=0.2.1 samplerate=0.1.4 taglib=0.3.3 camomile=1.0.1 inotify=2.3 yojson=1.4.1 faad=0.4.0 soundtouch=0.1.8 ladspa=0.1.5
Hmm weird. I should be displayed.
@gilou, can we double check that you're indeed on the branch that I pushed? Just want to make sure, that fix really seems legit..
there is something fishy with how SVN.ml (sic) isn't working properly here...
ah OK.. opam sync the rep on a build directory... so no .git dir to get the info...
so I was running the proper branch, I'll compile it myself to keep track of that..
Thank you. Let me know.
2018/08/28 16:43:30 >>> LOG START
2018/08/28 16:43:30 [main:3] Liquidsoap 1.3.3+scm (https://github.com/savonet/liquidsoap.git@fd1cebd292824ca9f43d932e17c236c9aa674dc7:20180828:164057)
Will follow up..
uptime
0d 20h 57m 43s
Still streaming.. Maybe I can try going back to pipe() ;)
Told you that fix was legit ;-)
you're killing me ;) Thanks, I'll let it run a bit longer and as I'm such a daredevil, I added a cue_cut() on my lists and testing that along.
I'm running on a CPU craziness (will make an issue for it should I spot it more precisely) as well using replaygain extraction, does it use the same logic? I worked around it in the end, feeding the info using annotate, so I'm less concerned about it in the end...
uptime
0d 23h 37m 23s
Let's call this a proper fix, and another victory for M. Beauxis πΊπΊ π₯π₯
Hi, So running most components from master (cry, dtools, duppy, faad, mad, mm). Thing is, that script is quite complex, but if I remove each part, the bug is really hard to reproduce... All parts have been tested stand alone though.
The script combines:
at the moment I had it freeze, it was outputting 3 streams to icecast:
This is really frustrating, and I'm now back to investigating this again, trying to strip all parts... I'm dumping here as a "reference" until I can narrow it down. The "freeze" means stdout is stuck, the server doesn't answer the socket. The process is alive, but seems stuck polling for something, doesn't respond to QUIT/^C.
Script is run using liquidsoap -v full.liq, and is here with supporting functions.liq: https://gist.github.com/gilou/43133d2619202791e153211471aff6a1 liq.log contains the end of the log, timestamp is hours before I killed it.
bt.txt is the backtrace from the core I got using kill -ABORT on it
It's running on a cheap VPS, using Ubuntu 18.04 on KVM. On this version it crashes in less than an hour.