savonet / liquidsoap

Liquidsoap is a statically typed scripting general-purpose language with dedicated operators and backend for all thing media, streaming, file generation, automation, HTTP backend and more.
http://liquidsoap.info
GNU General Public License v2.0
1.4k stars 130 forks source link

Liquidsoap freezes on a complex script #609

Closed gilou closed 6 years ago

gilou commented 6 years ago

Hi, So running most components from master (cry, dtools, duppy, faad, mad, mm). Thing is, that script is quite complex, but if I remove each part, the bug is really hard to reproduce... All parts have been tested stand alone though.

The script combines:

at the moment I had it freeze, it was outputting 3 streams to icecast:

This is really frustrating, and I'm now back to investigating this again, trying to strip all parts... I'm dumping here as a "reference" until I can narrow it down. The "freeze" means stdout is stuck, the server doesn't answer the socket. The process is alive, but seems stuck polling for something, doesn't respond to QUIT/^C.

Script is run using liquidsoap -v full.liq, and is here with supporting functions.liq: https://gist.github.com/gilou/43133d2619202791e153211471aff6a1 liq.log contains the end of the log, timestamp is hours before I killed it.

bt.txt is the backtrace from the core I got using kill -ABORT on it

It's running on a cheap VPS, using Ubuntu 18.04 on KVM. On this version it crashes in less than an hour.

gilou commented 6 years ago

I removed all outputs but the one streaming the piped stream, as I suspect it's the culprit, it froze again after 1h40, smart_crossfade managed to squeeze one more line: 2018/08/22 12:06:31 [smart_crossfade:3] new >= old + margin, old <= medium and new <= high. This time I had a socket connected before, and it's responding...

uptime
0d 01h 53m 51s
restart
restart called

but it's not doing anything the external process is still running, hasn't crashed, doesn't complain on stderr If I attach strace to liq, I get:

sudo strace -fp 29444
strace: Process 29444 attached with 7 threads
[pid 29454] futex(0x56309c3428a0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>  
[pid 29449] futex(0x7f93d8000b48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>  
[pid 29448] futex(0x7f93dc000b48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>  
[pid 29447] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=21809} <unfinished ...>
[pid 29445] futex(0x56309c3428a0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 29444] futex(0x56309c3428a0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 29446] <... select resumed> )      = 0 (Timeout)
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)
[pid 29446] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)

If I kill the external process, liq uses 100% of CPU. I'll try and monitor the external process next time. I added merge_tracks and drop_metadata to try making stereotool happier, but I don't think that's the issue ☹

gilou commented 6 years ago

if I attach strace to stereotool, it seems to be alive and waiting for data... [pid 30093] read(0,

gilou commented 6 years ago

and interestingly (or not), liq honored USR1 in that state, because it truncated the log file upon receiving!

gilou commented 6 years ago

OK.. I got tired, I switched my install to ocaml 4.07.0, recompiled everything, and starting the hunt again..

gilou commented 6 years ago

aaaaaand it froze again, after 5 hours.. Same symptoms, liq is somehow frozen, stills responds on its socket but doesn't actually handle anything, yet it responds to SIGUSR1 (truncates the logfile) and closes on other signals (ABRT, HUP..).

toots commented 6 years ago

@gilou, thanks for reporting this. You're on the latest code right? It seems that the streaming thread is blocked but the other treads are still responding. One thing you can try is to get a gdb backtrace of the running process:

% gdb
> attach <pid>
> thread apply all bt
gilou commented 6 years ago

I'm back on sox for testing... If I only use output.external, it lasted for more than 4 hours which was a first... so I added pipe_metadata from https://gist.github.com/gilou/43133d2619202791e153211471aff6a1#file-functions-liq-L115 but it failed after 10+ hours...

Code was:

s = smart_crossfade(playlist("test.m3u"))
s = mksafe(amplify(1.0, s))

pipe_file = "test.pipe"

pipe_input = pipe_metadata("test.pipe", s)

ps = output.external(id='pipeout', %wav, "sox -t raw -b 16 -c 2 -r 44.1k -e unsigned - -t wav #{pipe_file} flanger 2>/tmp/stderr", fst(pipe_input))

output.icecast(%mp3.vbr(quality=0), mount="test_pipe", snd(pipe_input))

Backtrace once attached is here: gdb.txt

gilou commented 6 years ago

Will restart testing on latest code, namely with e10e047

toots commented 6 years ago

Great idea. Your stack trace was helpful too and seems to point to a different direction but this is worth trying. Thks!

gilou commented 6 years ago

πŸ’ͺ 10 hours uptime mark! I'll let it run a bit more before opening the champagne, and add a bit more complexity I removed to debunk this. I'll avoid adding cue_cut or other extreme gapless playout ideas I have in mind before the release (and I'll see how I can help with docs as well)...

toots commented 6 years ago

giphy 1

gilou commented 6 years ago

And.. it failed again 😭 after 20 hours, and it was the whole thing, piping to stereotool This is what the backtrace is once attached: gdb.txt Same symptoms: server still responding though quite braindead:

uptime
0d 21h 00m 40s
END
st.skip
Done
END
request.alive
5 9 10 16 0
END
restart
restart called
END

Last lines of the log are those:


2018/08/28 11:33:47 [smart_crossfade:3] new >= old + margin, old <= medium and new <= high.
2018/08/28 11:33:47 [smart_crossfade:3] Do not fade if it's already very low.
2018/08/28 11:33:47 [smart_crossfade:3] Transition: crossed, no fade.
2018/08/28 11:33:47 [source:4] Source add_441221 gets up.
2018/08/28 11:33:47 [add_441221:4] Content kind is {audio=2;video=0;midi=0}.
2018/08/28 11:33:47 [source:4] Source smart_cross_9081_after gets up.
2018/08/28 11:33:47 [smart_cross_9081_after:4] Content kind is {audio=2;video=0;midi=0}.
2018/08/28 11:33:47 [smart_cross_9081_after:4] Activations changed: static=[add_441221:smart_cross_9081], dynamic=[].
2018/08/28 11:33:47 [source:4] Source smart_cross_9081_before gets up.
2018/08/28 11:33:47 [smart_cross_9081_before:4] Content kind is {audio=2;video=0;midi=0}.
2018/08/28 11:33:47 [smart_cross_9081_before:4] Activations changed: static=[add_441221:smart_cross_9081], dynamic=[].

And liq answers the USR1 signal by truncating its log, but doesn't respond to ^C.

gilou commented 6 years ago

Browsing through the log, I get a weird feeling.. I have things like: 2018/08/28 11:00:08 [smart_cross_9081:3] Analysis: -nandB / -24.523810dB (3.04s / 3.04s)

So I'm adding a few lines before the freeze:

2018/08/28 11:33:46 [decoder:4] Decoding "/home/ubuntu/demo/chill/08 On the Couch.mp3" ended: Mad.End_of_stream.
2018/08/28 11:33:47 [chill(dot)m3u:3] Finished with "/home/ubuntu/demo/chill/08 On the Couch.mp3".
2018/08/28 11:33:47 [amplify_9063:3] End of the current overriding.
2018/08/28 11:33:47 [chill(dot)m3u:4] Remaining: 0.0s, queued: 30.0s, taking: 30.0s
2018/08/28 11:33:47 [chill(dot)m3u:3] Prepared "/home/ubuntu/demo/chill/08 Summertime Blues.mp3" (RID 16).
2018/08/28 11:33:47 [rotate_9070:3] Switch to amplify_9066 with forgetful transition.
2018/08/28 11:33:47 [amplify_9063:4] Activations changed: static=[], dynamic=[rotate_9074:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9072:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9070:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9068:switch_9076:fallback_9077:on_track_9079:smart_cross_9081].
2018/08/28 11:33:47 [amplify_9063:4] Disabling caching mode.
2018/08/28 11:33:47 [amplify_9066:4] Activations changed: static=[rotate_9070:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9074:switch_9076:fallback_9077:on_track_9079:smart_cross_9081], dynamic=[rotate_9074:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9072:switch_9076:fallback_9077:on_track_9079:smart_cross_9081, rotate_9070:switch_9076:fallback_9077:on_track_9079:smart_cross_9081].
2018/08/28 11:33:47 [amplify_9066:3] Overriding amplification: 0.331131.
2018/08/28 11:33:47 [lang:3] Playing (1535448827.18): annotate:liq_amplify="-9.60 dB",liq_cue_in="1.021678":/home/ubuntu/demo/party/01 Mano Negra.flac
2018/08/28 11:33:47 [lang:4] I have 42 tracks
2018/08/28 11:33:47 [smart_cross_9081:3] Analysis: -nandB / -29.007891dB (3.01s / 3.03s)

However, this file played many times before without issues during that run.

toots commented 6 years ago

giphy 2

toots commented 6 years ago

Here's what I've seen twice now from your backtrace:

Thread 8 (Thread 0x7f80f7fff700 (LWP 10561)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f8123031023 in __GI___pthread_mutex_lock (mutex=0x55a73a86d8a0) at ../nptl/pthread_mutex_lock.c:78
#2  0x000055a738528e3d in caml_mutex_lock ()
#3  0x000055a73848fd47 in camlDtools__mutexify_2152 () at dtools.ml:769
#4  0x000055a73854a3b4 in caml_start_program ()
#5  0x000055a738547da0 in caml_final_do_calls ()
#6  0x000055a738534eea in caml_gc_dispatch ()
#7  0x000055a738536136 in caml_alloc_string ()
#8  0x000055a7384abfa1 in camlBytes__sub_1234 () at bytes.ml:68
#9  0x000055a7384d0c3c in camlPrintf__k$27_1310 () at bytes.ml:73
#10 0x000055a73848fe64 in camlDtools__print_2156 () at dtools.ml:779
#11 0x000055a73848fd9c in camlDtools__mutexify_2152 () at dtools.ml:770
#12 0x000055a7383ab373 in camlSource__fun_2626 () at source.ml:354
#13 0x000055a738321760 in camlSmartcross__fun_2748 () at operators/smartcross.ml:356
#14 0x000055a73832224d in camlSmartcross__fun_2627 () at operators/smartcross.ml:191

It's a deadlock waiting on a mutex that is triggered during a Gc memory collection. It happens during a smartcross transition to that makes it look like the bug is coming from smartcross but it's only incidental. smartcross is understandably pretty memory hungry so it's not a surprise a lot of Gc cleanup will happen during its execution.

Now, I just have to look at finalise functions involving Tutils.mutexify calls...

gilou commented 6 years ago

Do you want me to test without smart_crossfade? I'll probably have to rewrite it to have it deal more agressively with silences at some point so .. 😜

toots commented 6 years ago

Nah, can you actually try with the branch deadlock-investigation that I just pushed?

gilou commented 6 years ago

Here we go. Side note, maybe having the rev/branch included in the version might help knowing where we are ☺ (yes, I'll try to do a PR for configure.ac πŸ˜‰)

toots commented 6 years ago

Should be:

2018/08/28 08:03:43 [main:3] Liquidsoap 1.3.3+scm (git@github.com:savonet/liquidsoap.git@f9970afe6a89a38924084b1439c65685b6a65062:20180826:143248)

;-)

gilou commented 6 years ago

that was fast.. log froze at smart_crossfade as well in ~5 minutes, bt included.. gdb.txt

gilou commented 6 years ago

I hope you are suggesting the format rather than stating what should be displayed currently.. I have:

2018/08/28 15:27:49 [main:3] Liquidsoap 1.3.3+scm
2018/08/28 15:27:49 [main:3] Using: bytes=[distributed with OCaml 4.02 or above] pcre=7.3.4 dtools=0.4.0 duppy=0.7.1 duppy.syntax=0.7.1 cry=0.6.2 mm=0.4.0 xmlplaylist=0.1.4 ogg=0.5.2 vorbis=0.7.0 opus=0.1.2 mad=0.4.5 flac=0.1.3 flac.ogg=0.1.3 dynlink=[distributed with Ocaml] lame=0.3.3 fdkaac=0.2.1 samplerate=0.1.4 taglib=0.3.3 camomile=1.0.1 inotify=2.3 yojson=1.4.1 faad=0.4.0 soundtouch=0.1.8 ladspa=0.1.5
toots commented 6 years ago

Hmm weird. I should be displayed.

toots commented 6 years ago

@gilou, can we double check that you're indeed on the branch that I pushed? Just want to make sure, that fix really seems legit..

gilou commented 6 years ago

there is something fishy with how SVN.ml (sic) isn't working properly here...

gilou commented 6 years ago

ah OK.. opam sync the rep on a build directory... so no .git dir to get the info...

gilou commented 6 years ago

so I was running the proper branch, I'll compile it myself to keep track of that..

toots commented 6 years ago

Thank you. Let me know.

gilou commented 6 years ago
2018/08/28 16:43:30 >>> LOG START           
2018/08/28 16:43:30 [main:3] Liquidsoap 1.3.3+scm (https://github.com/savonet/liquidsoap.git@fd1cebd292824ca9f43d932e17c236c9aa674dc7:20180828:164057)

Will follow up..

gilou commented 6 years ago
uptime
0d 20h 57m 43s

Still streaming.. Maybe I can try going back to pipe() ;)

toots commented 6 years ago

Told you that fix was legit ;-)

gilou commented 6 years ago

you're killing me ;) Thanks, I'll let it run a bit longer and as I'm such a daredevil, I added a cue_cut() on my lists and testing that along.

I'm running on a CPU craziness (will make an issue for it should I spot it more precisely) as well using replaygain extraction, does it use the same logic? I worked around it in the end, feeding the info using annotate, so I'm less concerned about it in the end...

gilou commented 6 years ago
uptime
0d 23h 37m 23s

Let's call this a proper fix, and another victory for M. Beauxis 🎺🎺 πŸ”₯πŸ”₯

toots commented 6 years ago

giphy