Unusually high CPU usage by Liquidsoap process

SpeedyCharly commented 2 years ago

Describe the bug I've been operating a radio station from a server running Azuracast for about 2 years now without any issues. In the last few weeks, Liquidsoap has been causing the CPU to redline, ultimately causing buffer overruns and killing the media stream altogether. I run 1 main playlist with 2 more on a 'play per x' basis. In addition, I have a few 'Jingles' playlists (90 secs. clips or less) which I call upon on a 'play per x' basis also to run Radio IDs, dj ads and the like. It seems that Liquidsoap redlines when handling selections from Jingles playlists and the buffer overruns end up killing the stream after about 1 hour of operation. The only way I can get the radio station to run without interruptions is to disable all Jingles which returns the CPU to normal loads. Not such CPU overloads occur when streaming live. I have raised the issue with Azuracast, but it seem that the issue lies with Liquidsoap itself, which is why I am now raising it here. You can view what transpired at Azuracast here ; [ https://github.com/AzuraCast/AzuraCast/issues/4783] which will give you acccess to all the info and the Liquidsoap logs ! have generated while trying to identify the source of the problem.

To Reproduce Pls see Azuracast issue thread above

Version details

OS: Ubuntu
Version 20.04

Install method Docker (Azuracast)

toots commented 2 years ago

Hi,

Thanks for the report. I will have a look at it as part of the 2.0.2 release cycle.

SpeedyCharly commented 2 years ago

For the fist time, yesterday I caught an episode during live stream... Here is an edited log of the occurrence; Edited Log.txt

SpeedyCharly commented 2 years ago

Here is an edited portion of LiquidSoap log showing unusual activity around items from jingle playlists; Raw Response.txt

Could that be related to/trigger of CPU overloads?

toots commented 2 years ago

Hi @SpeedyCharly. Thanks for your report. Can I used the script here: https://github.com/AzuraCast/AzuraCast/issues/4783#issuecomment-966149114 as reference?

My first suggestion would be to instrument the code around the URI query before log("AzuraCast Raw Response: #{uri}"). If this call is hanging for any reason, this will cause issues with the main streaming thread and, potentially, cause CPU to spike as liquidsoap is trying to catchup with the delay.

Would you mind adding log lines around the call for instance:

def autodj_next_song() =
    log("Starting call to AzuraCast API")
    uri = list.hd(process.read.lines(env=[("API_AUTH", !azuracast_api_auth)], 'curl -s --request POST --url http://web/api/internal/1/nextsong --form api_auth="$API_AUTH"'), default="")
    log("AzuraCast Raw Response: #{uri}")

Regarding your encoding issue, the character \233 that is coming through such as in "Radio ID - En route vers l'ouest - V\233ro en ondes" is actually \uFFFD, which is the UTF8 replacement character, which means that the character/string is getting corrupted before hitting us, for instance being consumed by a previous system that thinks it is UTF8 while it actually is another encoder.

SpeedyCharly commented 2 years ago

@toots Yes of course, you can reference the Azura issue here; I've raised the issue on both githubs to put the 2 development teams in synch on this. Not Linux savvy here, not sure what you need me to do with your suggested piece of script...

The logs I have provided are generated via the AzuraCast console and the query cannot be edited there... Not sure how to proceed on Putty or other client... Or, is that script to be added to LS config?

SpeedyCharly commented 2 years ago

@toots [Regarding your encoding issue, the character \233 that is coming through such as in "Radio ID - En route vers l'ouest - V\233ro en ondes" is actually \uFFFD, which is the UTF8 replacement character, which means that the character/string is getting corrupted before hitting us, for instance being consumed by a previous system that thinks it is UTF8 while it actually is another encoder.] Thanks for that info, but with close to 200,000 media files in my collection, with about 1/2 in French... Not about to start editing out all French special characters... LOL.

In mIRC, we get around the problem by disabling UTF8, can same be done in LS?

toots commented 2 years ago

@toots [Regarding your encoding issue, the character \233 that is coming through such as in "Radio ID - En route vers l'ouest - V\233ro en ondes" is actually \uFFFD, which is the UTF8 replacement character, which means that the character/string is getting corrupted before hitting us, for instance being consumed by a previous system that thinks it is UTF8 while it actually is another encoder.] Thanks for that info, but with close to 200,000 media files in my collection, with about 1/2 in French... Not about to start editing out all French special characters... LOL.

In mIRC, we get around the problem by disabling UTF8, can same be done in LS?

I'm sorry if that wasn't clear. What I'm saying is something in your processing chain is already corrupting the tags before they come to us. You can definitely disable UTF8 or tag recording but you will still have tags with the weird � in them.

toots commented 2 years ago

@toots Yes of course, you can reference the Azura issue here; I've raised the issue on both githubs to put the 2 development teams in synch on this. Not Linux savvy here, not sure what you need me to do with your suggested piece of script...

The logs I have provided are generated via the AzuraCast console and the query cannot be edited there... Not sure how to proceed on Putty or other client... Or, is that script to be added to LS config?

What I was trying to say is, do you have access to the liquidsoap script that your AzuraCast instance is running? Can we add more logging there to help debug the issue?

SpeedyCharly commented 2 years ago

@toots I have access to the LS config file which looks like this at the moment; LS Config.txt

toots commented 2 years ago

Great so, could you change the autodj_next_song to look like this:

# AutoDJ Next Song Script
def autodj_next_song() =
    log("Starting call to AzuraCast API")
    uri = list.hd(process.read.lines(env=[("API_AUTH", !azuracast_api_auth)], 'curl -s --request POST --url http://web/api/internal/1/nextsong --form api_auth="$API_AUTH"'), default="")
    log("AzuraCast Raw Response: #{uri}")

And report back the next time you see high CPU usage?

SpeedyCharly commented 2 years ago

ok a big 10-4 on that LOL

Edit; It seems that part of the config cannot be edited by lowly users... Queried the grand poobah of AzuraCast on this... Will keep you informed :P

BusterNeece commented 2 years ago

@SpeedyCharly I've updated the Rolling Release to add the relevant logging information requested by @toots above; if you don't mind updating your installation, that can help us with the debugging process.

SpeedyCharly commented 2 years ago

@SpeedyCharly I've updated the Rolling Release to add the relevant logging information requested by @toots above; if you don't mind updating your installation, that can help us with the debugging process.

Update done and put all jingle playlists full on with Smart crossover also on.... will report on how it goes ....

SpeedyCharly commented 2 years ago

Woke up to a totally dead stream this morning. Restart wouldn't even bring it back online. Killed all the JINGLE playlists save 1 then went to OVH console to reboot the server from there. Stream came back on just in time for the start of live programming.
Here is a log showing the last part of that story; Editted log.txt

toots commented 2 years ago

Hi. Sorry to hear. I think we'd need the part of the logs before it went dead, that would give us some hints about what caused it.

What I see about your restart is that the telnet port was still used. Most likely, a process was still hanging and using it.

SpeedyCharly commented 2 years ago

[What I see about your restart is that the telnet port was still used. Most likely, a process was still hanging and using it.] This is what I assumed also, which is why I did a hard reboot...

The log didn't contain anything save what you see prior to the restart. Guess it was pushed out the top by all these error lines...

All it showed was +16,000 lines of this; .... Called from file "src/dtools_impl.ml", line 458, characters 10-35 Re-raised at file "src/dtools_impl.ml", line 466, characters 8-15 Called from file "list.ml", line 110, characters 12-15 Called from file "src/dtools_impl.ml", line 458, characters 10-35 Re-raised at file "src/dtools_impl.ml", line 466, characters 8-15 Called from file "main.ml" (inlined), line 566, characters 15-51 Called from file "runner.ml", line 25, characters 9-22 Fatal error: exception Error while trying to bind server/telnet socket: Address already in use in bind() Raised at file "tools/server.ml", line 463, characters 16-59 Called from file "tools/server.ml", line 477, characters 19-34 Called from file "tools/lifecycle.ml", line 33, characters 8-14 .....

Vaalyn commented 2 years ago

@SpeedyCharly If the logs got rotated due to the huge amount of errors you can still get the previous versions of the log file.

The rotated log files will be named liquidsoap.log.1, liquidsoap.log.2, liquidsoap.log.3 etc... The higher the number the older the log file which goes up to 10.

Here is the command to copy the log files from the stations container to the server:

docker cp azuracast_stations:/var/azuracast/stations/yourstationname/config/liquidsoap.log.1 /var/azuracast/liquidsoap.log.1

You will need to replace yourstationname in that command with the path from your station which can be found in the station profile in the administration tab. (Added an image to better show what part I mean)

This will copy the respective log file (example liquidsoap.log.1) to the /var/azuracast directory of the server. Depending on when it crashed it might be that you'll need to get more than one old log file.

SpeedyCharly commented 2 years ago

LOL... having a lovely Holiday season thus far... You?
Permission denied :

Vaalyn commented 2 years ago

Oh, if you are not using the user root you'll need to add sudo to the command like this:

sudo docker cp azuracast_stations:/var/azuracast/stations/yourstationname/config/liquidsoap.log.1 /var/azuracast/liquidsoap.log.1

Or switch to the root user via sudo su and then run the command.

SpeedyCharly commented 2 years ago

I was sure my login was root... after sudo su I get another error...

root@vps-cff8157d:/home/ubuntu# docker cp azuracast_stations:/var/azuracast/stations/yourstationname/config/liquidsoap.log.1 /var/azuracast/liquidsoap.log.1 Error: No such container:path: azuracast_stations:/var/azuracast/stations/yourstationname/config/liquidsoap.log.1

This because my Azuracast instance isn't a standard install which I inherited from someone else, and I have no idea where things are .... took me forever to find the docker location to be able to do updates...

Vaalyn commented 2 years ago

root@vps-cff8157d:/home/ubuntu# docker cp azuracast_stations:/var/azuracast/stations/yourstationname/config/liquidsoap.log.1 /var/azuracast/liquidsoap.log.1

You need to change yourstationname to radio_des_festivals in the command like in your screenshot:

docker cp azuracast_stations:/var/azuracast/stations/radio_des_festivals/config/liquidsoap.log.1 /var/azuracast/liquidsoap.log.1

SpeedyCharly commented 2 years ago

Same difference....

And, there is no response when I input var/azuracast/ which is an empty folder in my install

EDIT.... figured it out... scanning logs now

SpeedyCharly commented 2 years ago

Here is the entire log for the night of 18th to 19th; hopefully you will find a clue in there somewhere Night of 18th to 19th.txt

Vaalyn commented 2 years ago

I think we'd need the part of the logs before it went dead, that would give us some hints about what caused it.

@toots this is most likely the part of the logs that you are looking for:

2021/12/19 03:30:55 [next_song:3] Prepared "/var/azuracast/stations/radio_des_festivals/media/Vedettes/Vicky Chagnon  Cowboys sweetheart.mp3" (RID 2).
2021/12/19 03:30:55 [cue_next_song:3] Cueing in...
2021/12/19 03:30:55 [lang:3] autodj_next_song: Sending AzuraCast API Call...
2021/12/19 03:30:55 [crossfade_0:3] Analysis: -17.829847dB / -infdB (1.36s / 1.36s)
2021/12/19 03:30:55 [crossfade_0:3] Simple transition: crossed, fade-in, fade-out.
2021/12/19 03:30:55 [lang:3] AzuraCast Feedback Response: OK
2021/12/19 03:30:55 [lang:3] autodj_next_song: AzuraCast API Response: annotate:title="Une poupÃ©e pour NoÃ«l",artist="Joane Bluteau",duration="136.00",song_id="aad27972f0ae3ce729a9f504c2e3560c",media_id="2843",liq_amplify="0.00dB",playlist_id="12":/var/azuracast/stations/radio_des_festivals/media/Noel Country Fr/11 Une poupee pour Noel.mp3
[mp3 @ 0x7efca0045300] Estimating duration from bitrate, this may be inaccurate
[mp3 @ 0x7efca0048200] Estimating duration from bitrate, this may be inaccurate
2021/12/19 03:30:55 [decoder.id3v2:2] Error while decoding file tags: (Invalid_argument "String.sub / Bytes.sub")
2021/12/19 03:33:26 [decoder:2] Decoding "/var/azuracast/stations/radio_des_festivals/media/Vedettes/Vicky Chagnon  Cowboys sweetheart.mp3" ended: Ffmpeg_decoder.End_of_file.
[mp3 @ 0x7efc9c4298c0] Estimating duration from bitrate, this may be inaccurate
[mp3 @ 0x7efc9c4298c0] Estimating duration from bitrate, this may be inaccurate
2021/12/19 03:33:27 [next_song:3] Prepared "/var/azuracast/stations/radio_des_festivals/media/Noel Country Fr/11 Une poupee pour Noel.mp3" (RID 1).
2021/12/19 03:33:27 [cue_next_song:3] Cueing in...
2021/12/19 03:33:27 [lang:3] autodj_next_song: Sending AzuraCast API Call...
2021/12/19 03:33:27 [crossfade_0:3] Analysis: -infdB / -21.413634dB (1.48s / 1.48s)
2021/12/19 03:33:27 [crossfade_0:3] Simple transition: crossed, fade-in, fade-out.
2021/12/19 03:33:27 [lang:3] AzuraCast Feedback Response: OK
2021/12/19 03:33:27 [lang:3] autodj_next_song: AzuraCast API Response: annotate:title="L'hiver a chasser l Hirondelle",artist="Georges Hamel",duration="198.00",song_id="d3d7d344481807b9c16923b21b804193",media_id="2805",liq_amplify="0.00dB",playlist_id="12":/var/azuracast/stations/radio_des_festivals/media/Noel Country Fr/08 Lhiver a chasser l Hirondelle.mp3
[mp3 @ 0x7efca0048200] Estimating duration from bitrate, this may be inaccurate
[mp3 @ 0x7efca0048200] Estimating duration from bitrate, this may be inaccurate
Fatal error: exception Error while trying to bind server/telnet socket: Address already in use in bind()
Raised at file "tools/server.ml", line 463, characters 16-59
Called from file "tools/server.ml", line 477, characters 19-34
Called from file "tools/lifecycle.ml", line 33, characters 8-14
Called from file "tools/lifecycle.ml", line 33, characters 8-14
Called from file "tools/lifecycle.ml", line 33, characters 8-14
Called from file "tools/lifecycle.ml", line 33, characters 8-14
Called from file "tools/lifecycle.ml", line 33, characters 8-14
Called from file "src/dtools_impl.ml", line 455, characters 10-16
Re-raised at file "src/dtools_impl.ml", line 466, characters 8-15
Called from file "list.ml", line 110, characters 12-15
Called from file "src/dtools_impl.ml", line 458, characters 10-35
Re-raised at file "src/dtools_impl.ml", line 466, characters 8-15
Called from file "list.ml", line 110, characters 12-15
Called from file "src/dtools_impl.ml", line 458, characters 10-35
Re-raised at file "src/dtools_impl.ml", line 466, characters 8-15
Called from file "main.ml" (inlined), line 566, characters 15-51
Called from file "runner.ml", line 25, characters 9-22

The only direct error I can see in there seems to be this line (not sure how/if it's related to the telnet server though):

[decoder.id3v2:2] Error while decoding file tags: (Invalid_argument "String.sub / Bytes.sub")

toots commented 2 years ago

Yes, I don't see anything obvious. What seemed to have happened here is a crash of liquidsoap, followed by an automatic restart that kept failing b/c the telnet port was still in use.

I just looked back at the code and all the sockets involved in the telnet stack are create with the close_on_exec flag. We used to have issues with external processed sometimes inheriting opened sockets, thus leaving them in use after liquidsoap crashed but this shouldn't be the case.

I'm afraid to ask but we would need to elevate the log levels in this case:

log.level.set(4)

At the beginning of the script.

Also, if the crash is a deadlock, we'd have to get ourselves familiar with gdb but let's get to it when we really need it and start with increasing the log level.

SpeedyCharly commented 2 years ago

[I'm afraid to ask but we would need to elevate the log levels in this case: log.level.set(4)]

That one was easy... done will report later - Can only report at night as stream is live during daytime hours...

No issues seen last night as I had disabled all Jingle playlists

toots commented 2 years ago

Thks!

ItsMitchh commented 2 years ago

I'm currently encountering this issue right now, despite trying to forcefully restart the station via CLI, the stations just unable to respond and continue to spin out the same errors are Speedy.

As soon as one station runs into the issue, all of my others run into the same issue and a chain crash occurs when I'm unable to recover from it. The only recovery option is to restart the entire process (azuracast) which isn't ideal.

Something from Glances:

2021-12-20 22:01:16 (ongoing) - LOAD (Min:1.0 Mean:1.4 Max:1.7)
 2021-12-20 22:00:12 (ongoing) - CPU_USER (Min:76.8 Mean:88.7 Max:95.8): liquidsoap, liquidsoap, liquidsoap
2021-12-20 21:59:40 (0:00:23) - CRITICAL on CPU_USER (92.4): supervisord, liquidsoap, liquidsoap

SpeedyCharly commented 2 years ago

@toots Not sure what is going on here but it seems my limited access to the server is blocking me. Issuing the command [sudo docker cp azuracast_stations:/var/azuracast/stations/radio_des_festivals/config/liquidsoap.log.1 /var/azuracast/liquidsoap.log.1] via putty but the logs already present in /var/azuracast/ don't get overwritten and accessing the server with Filezilla is no help as the delete command is denied - seems I only have reading rights...

The guy I inherited this server from had assured me I had full sudo/sftp access, but it seem not. Not sure what to do from here.... sorry !

ItsMitchh commented 2 years ago

@toots these logs may be useful for you as Log level 4. Caught this almost immediately https://gist.github.com/SC2Mitch/c2e796ed9a63733ad7ce6df3d6922d52

SpeedyCharly commented 2 years ago

@SC2Mitch Thanks... looks very similar to what I'm seeing on my install...

toots commented 2 years ago

Thanks @SC2Mitch that's very useful. Are you able to share more of what happens before?

ItsMitchh commented 2 years ago

Yep, I'm unsure on what exactly your looking for but here's some extended logs https://gist.githubusercontent.com/SC2Mitch/c2e796ed9a63733ad7ce6df3d6922d52/raw/6e3d400c19a52f3d5e1923e79b0bbf1d88e17516/Extended%2520logs

and here's the entire raw log file (9000+ lines) https://gist.githubusercontent.com/SC2Mitch/c2e796ed9a63733ad7ce6df3d6922d52/raw/6e3d400c19a52f3d5e1923e79b0bbf1d88e17516/Entire%2520log%2520file

toots commented 2 years ago

Thanks! I'm suspecting that something is blocking either the main streaming thread or the background processing queues.

I see this in the logs that looks suspicious:

2021/12/21 16:24:45 [lang:3] dj_auth: Sending AzuraCast API DJ Auth command for user: source
2021/12/21 16:24:48 [harbor:4] New client on port 8196: 167.248.133.44
2021/12/21 16:24:49 [lang:3] dj_auth: Sending AzuraCast API DJ Auth command for user: source
2021/12/21 16:24:49 [lang:3] dj_auth: AzuraCast API Response: false
2021/12/21 16:24:49 [harbor:4] ICY error: invalid password
2021/12/21 16:24:52 [harbor:4] New client on port 8196: 167.248.133.44
2021/12/21 16:24:52 [harbor:4] Connection reset by peer in read()
2021/12/21 16:24:52 [harbor:4] New client on port 8196: 162.142.125.58
2021/12/21 16:24:53 [harbor:4] Connection reset by peer in read()
2021/12/21 16:24:53 [harbor:4] New client on port 8196: 162.142.125.58
2021/12/21 16:24:53 [lang:3] dj_auth: Sending AzuraCast API DJ Auth command for user: îêüþkdtúå$tç¯GCÓ£HìxJÇNõ;A0 ·Q«S&^uªiú*wA¾ÃVz8Ív²&Ì¨Ì©À/À0À+À

What's the code around this log: Sending AzuraCast API DJ Auth command for user: ?

ItsMitchh commented 2 years ago

Snip, Silver replied with the code.

BusterNeece commented 2 years ago

dj_auth: Sending AzuraCast API DJ Auth command for user: ��î�ê��üþkdtúå$tç¯GCÓ£��H�ì�xJÇNõ��;A0 ·Q��«�S&^uªi�ú*�wA��¾��ÃVz8Ív�²&Ì¨Ì©À/À0À+À

Well, that's exciting...

Here's the relevant code for DJ auth:

def dj_auth(login) =
    user = ref("")
    password = ref("")

    if (login.user == "source" or login.user == "") and (string.match(pattern="(:|,)+", login.password)) then
        auth_string = string.split(separator="(:|,)", login.password)

        user := list.nth(default="", auth_string, 0)
        password := list.nth(default="", auth_string, 2)
    else
        user := login.user
        password := login.password
    end

    log("dj_auth: Sending AzuraCast API DJ Auth command for user: #{!user}")
    ret = list.hd(process.read.lines(env=[("DJ_USER", !user), ("DJ_PASSWORD", !password), ("API_AUTH", !azuracast_api_auth)], 'curl -s --request POST --url http://web/api/internal/3/auth --form dj-user="$DJ_USER" --form dj-password="$DJ_PASSWORD" --form api_auth="$API_AUTH"'), default="")
    log("dj_auth: AzuraCast API Response: #{ret}")

    authed = bool_of_string(ret)
    if (authed) then
        last_authenticated_dj := !user
    end

    authed
end

SpeedyCharly commented 2 years ago

Finally was able to retrieve the logs from yesterday's (21) crash ; 21-12 Crash.txt Here is the log of the hard reboot @ around 8:30 same morning; 21-12 Reboot.txt

Same story repeated itself overnight (21>22) Seems to always crash about 4 hrs after end of live broadcast... Something must be incrementally eating at CPU resources, forcing the crash after 4 hrs of AutoDJ operation...

BusterNeece commented 2 years ago

@SpeedyCharly We did do some updates yesterday afternoon to establish a timeout for some of the API calls; make sure you're on at least version 82a8d1c.

SpeedyCharly commented 2 years ago

[We did do some updates yesterday afternoon to establish a timeout for some of the API calls; make sure you're on at least version 82a8d1c.]

Will do at end of live streaming tonite - thanks SlvrEagle

Just a little while ago had a little episode during live streaming - not sure if related to the current issue - looks like potential hacker activity to me.... Here is the relevant log snippet; Weird episode.txt Seems to me SC2Mitch had something similar yesterday...

SpeedyCharly commented 2 years ago

Was unable to proceed with update as the process crashed on fatal error...

Please advise.... thanks

BusterNeece commented 2 years ago

@SpeedyCharly This has been fixed as of the latest update.

SpeedyCharly commented 2 years ago

@SlvrEagle23 Dankesheun mein Herr; update done

SpeedyCharly commented 2 years ago

@SlvrEagle23 @toots Happy to report that my station ran fine all of last night and wasn't down this morning, for the first time in a while. Thanks to the both of you for seeing this through!

ItsMitchh commented 2 years ago

I believe the commit silver applied has resolved the issue. Had no downtime for the past 2 days and system resources are very healthy.

toots commented 2 years ago

Glad to hear. Marking this as fixed.

savonet / liquidsoap

Unusually high CPU usage by Liquidsoap process #2098

I was sure my login was root... after sudo su I get another error...

root@vps-cff8157d:/home/ubuntu# docker cp azuracast_stations:/var/azuracast/stations/yourstationname/config/liquidsoap.log.1 /var/azuracast/liquidsoap.log.1 Error: No such container:path: azuracast_stations:/var/azuracast/stations/yourstationname/config/liquidsoap.log.1