100%CPU after network outage

DuncanFairley commented 4 years ago

Hi,

There were two network outages with my ISP today that, after they were resolved, caused icecast to stop responding to traffic. It started using 100% cpu on one of the cores, and didn't log anything for the duration. I had to kill -9 it both times.

I'm currently on 2.4.0-kh13 1592e0e on ubuntu 18.04. Is there anything I can run next time I find it in this state to get useful debug data? Alternatively, is it likely that this issue might already be fixed in kh14?

karlheyes commented 4 years ago

you can run an strace -o output -tt -ff -p pidof icecast and if you have a debug build do a gcore pidof icecast

I'm not aware of any fix since, that definitely resolves a busy loop. There are cases of possible random effects which may show as a busy loop although usually it shows as a crash. kh14 is more stable but the resolved instabilities tend to only occur if certain features are enabled and certain race conditions happen. Some times it is dependent on compiler optimization and/or hardware. The last entries of the error may help.

karl.

Swiftgti commented 4 years ago

Karl I am having same issue in kh14, but when ISP of the incoming mount point fails when when we have connected some thousands users. its possible that using AAC + audio compression is the problem? and that's why you don't see the issue. Also you asked the other day of how may ssl connections we have... its only 10% of all the traffic though ssl

karlheyes commented 4 years ago

For a busy loop to occur on AAC+ only, it would need to be in the frame alignment processing, and that would be hard to see tripping just on a timeout case. The gcore would help identify where it is. Obviously in a busy loop, the strace will probably not show much but a gcore would identify where it is.

While I don't ever rule things out completely, I don't see a case yet of aac+ and timeout causing a busy loop but it could be down to something like some listener details not getting cleared out. Again the more detail the better.

karl.

Swiftgti commented 4 years ago

Karl if I forward the traffic with a fallback Mountpoint To a file. It’s possible that could mitigate this issue ? Because i don’t know if this happens when they request to the Mountpoint without source or instantly when the source gets disconnected.

karlheyes commented 4 years ago

If there is no fallback present, any new listeners will either get a 404 or if it is active at the time of joining then they will get headers then no content and a disconnection. There is a point where listeners are moved/dropped which is where there could be a busy loop case but only the core file will tell me that for sure. Even the logs will indicate something in that situation.

karl.

Swiftgti commented 4 years ago

Karl today we noticed that our servers have 4 cores, but when icecast crash only 2 icecast process goes to + 100%

This gives you any idea ? Or is normal ? Because we also noticed that always only 2 process have really cpu usage always ! The other Icecast process remains at 0%

karlheyes commented 4 years ago

100% constant is not something wanted and does indicate an unwanted loop. I suspect you mean the threads of the process with most of the time with a group that matches the workers setting, eg 2. There are other threads that occur which tend to be blocking so don't tend to be high cpu.

The workers though deal with the normal processing clients and those can be busy although 100% is maxing out. I'd like to know what state those threads are in when it gets into that condition.

karl.

Swiftgti commented 4 years ago

In peak audience hours the 2 process that consume cpu are in a range of cpu usage of 30 to 50%

They get 100% only when crashes. And again it nos necessary crashes on peak audience hours.

Do you think that switching the compression from aac plus to mp3 could prevent this issue ?

karlheyes commented 4 years ago

the aac/mp3 difference will be minor as there is not much to loop for really. You either find a frame or you don't. As there are no errors/warnings reported then unlikely to be a factor. There may be some indirect causal effects like the fact that aac tends to be low bitrate, so the usage pattern changes and might be exposing a race condition elsewhere, but the number of listeners may also play a factor.

karl.

Swiftgti commented 4 years ago

Karl we have 4 cores on the servers, you recommend setting limit to 4 or for safety and leave cpu for other tasks to 3 ? Actually we where not setting this value.

I appreciate your recommendations. Thanks!! and good weekend

karlheyes commented 4 years ago

This depends on what you are running. If it is purely icecast related (mostly) then workers set to 4 is fine. You may have httpd or DB that is reasonably busy so you may prefer to use less. icecast itself tries to balance the load between the workers but tries to keep certain clients together so if there is just 1 stream then 1 worker will tend to get busy more than others, but if you have several then they can be dispersed so up to 4 is fine in most cases. One key element is that icecast is mainly IO bound so it should not really use much CPU.

karl

Swiftgti commented 4 years ago

Karl as you told mp3 or aac its the same for this issue. BUT WE have new information: The problem of the 100% cpu its not happening when the source disconnects its happening when the source re connects again to the server and you have thousands user retrying to re connect. icecast prcoesss after a seconds that the source is re connected is goes to 100% and will keep for ever. we restart icecast and again instantly after source re connects goes to 100% cpu. and so on until less users tries to reconnect. and icecast can handle it.

Here goes data and Strange facts: 1) this happens with only 3.000 simultaneous users and our servers are big 4 cores cpus. 2) when icecast is running fine, regarding how many workers we define, always we have one icecast process that always is consuming same cpu of the sum of the workers. Example: We have set in 3 workers.: we will see that we have 4 icecast process with real cpu usage: lest say that this 3 workers have equally 20% (its an example) a number 4 process what i am talking about will have 60% cpu usage. So regarding how much cores we have we are limited to one process that runs on one core that obviously when reaches 100% cpu will crash icecast. So now is the question: Its a bug crash or its a bug or miss configuration that regarding of how many cores and workers we set, a big load of icecast goes to one process limiting us how may concurrent user we can handle on the re connections of the users.

we are 100% sure now that the 100% cpu crash of icecast is not when source disconnects is when the source re connects again and user try to re connect to the streaming.

Thanks in advance

karlheyes commented 4 years ago

The behaviour of the workers is as expected. Most listeners on a stream are kept together, it helps the locking aspect,but there always is a tipping point where other threads get used. Other streams should migrate to those lesser used workers. At 100% though, you should only see a slowing down of the content coming through not an actual crash/core case.

In the description though, some clarity is needed. do you have a fallback configured and running so listeners stay connected on disconnect. Therefore the reconnect triggers a override? or are the listeners dropped and they all reconnect.

karl

Swiftgti commented 4 years ago

Hi, no we are not using fallback, so all listeners connections are dropped and they all need to reconnect

karlheyes commented 4 years ago

can you describe what happens based on the logs in such a case. On disconnect, the listeners should all get disconnected (no fallback will be getting set). They all try to reconnect. At this point there a question of whether the source has reconnected but if not then a lot should get a 404 type response, the access log should report a lot of those. Are they then reconnecting on a 404 return, like a try again repeat loop? If the source has already connected then they should get content (any intros/auth configured?), but the moving of the listener between workers may get delayed. I suspect the busy thread is the incoming worker.

karl

Swiftgti commented 4 years ago

Karl we have two issues: 1) when all the users want to re connect, ok now added a fallback to prevent it. 2) But also we still have the issue that the cpu goes to 100% on source disconnect regarding having a fallback mount or not. This is from today of a server with fallback after you asked about fallback. this is the logfile, as you will see the icecast goes abruptly to 100% cpu without generating any log, you will see in the logfile that after 30 seconds we restarted icecast. This server had configured a fallback mount point of another source that was working fine at that moment. If you want we could use an mp3 file as fallback, you tell me and we do it.

[2020-06-16 13:07:54] INFO source/source_read listener count on /XXX now 3059 [2020-06-16 13:07:54] INFO source/source_read listener count on /XXX now 3058 [2020-06-16 13:07:54] INFO source/source_add_listener max on /XXX is 20000 (cur 3057) [2020-06-16 13:07:54] INFO source/source_read listener count on /XXX now 3056 [2020-06-16 13:07:54] INFO source/source_add_listener max on /XXX is 20000 (cur 3056) [2020-06-16 13:08:18] INFO thread/ lock abort set to 0 [2020-06-16 13:08:18] INFO logging/recheck_log_file Using global log file /usr/local/etc/log/access.log [2020-06-16 13:08:18] INFO main/server_proc_init Icecast 2.4.0-kh14 server reading configuration from /usr/local/etc/icecast.xml

Thanks!!!

karlheyes commented 4 years ago

still a core file from a debug build will be helpful to be clear on the busy loop case. I've been testing the case you outlined and did notice a behavioural issue with load distribution. I've been tightening up the code around balancing between workers and previous post-kh14 commits on cleaning up the locking code might provide a result. I have not been able to reproduce a 100% busy loop situation so it is hard to confirm this is resolved but certain hardware setups have been known to expose cases that are hard to trigger elsewhere. If you can, build the master tree code (it just has fixes since kh14) with "make debug" and run with that. If there is a hang (ie where the restart is required) then run a gcore on it first.

karl

TheRealMattLear commented 3 years ago

Hi @karlheyes , we have a customer who is experiencing a similar problem on KH15 with a small handful of Icecast KH services. I can see multiple threads with one at 100% however the service is completely unresponsive. I've run the strace while the issue is occurring and there is no output. It seems to affect only a small number of services, from what i can tell the services affected were running with over 500 listeners. They have also noted it appears to occur after disconnection of the source encoder.

I will be working with the customer to recompile icecast and capture the core dump when the issue occurs. Please let me know if there is anything further i can do/try.

karlheyes commented 3 years ago

thanks, any info to track these issues down will be good. Obviously a source disconnect can trigger a number of events depending on the configuration, typically with many listeners and if there is a fallback involved. A source client cleanup does involved a reschedule for its listeners to cleanup. Assuming no memory corruption has occurred then a empty strace of the busy thread would indicate a busy loop in one of the processing clients, maybe a counter having a silly number. A stack trace from a core file will help pinpoint such things

karl.

TheRealMattLear commented 3 years ago

Hi Karl,

(Deleted previous post as i actually think the process might have locked up while i was trying to shut it down with with kill -TERM XXXX, so the gcore dump i took was not of the process running debug).

The core dump collected here experienced the issue within seconds of starting up, 100% cpu and access of the icecast page just times out. Core dump is gzipped here: https://we.tl/t-Qi0MS6MvDS

Also, the customer requested to use icecast kh 10 since they're previous system utilised this without issues. I've compiled icecast kh 10 and restarted their services a few times now and have been unable to replicate the problem. Yet the second i restart onto icecast kh 15 the CPU is running at 100% on 1 or more processes (they have 145 icecast kh processes running).

gcore & GDP output as follows, i hope this helps because i have no idea what i'm doing here :)

[root@ ~]# gcore 27450
[New LWP 27461]
[New LWP 27460]
[New LWP 27459]
[New LWP 27458]
[New LWP 27457]
[New LWP 27456]
[New LWP 27455]
[New LWP 27454]
[New LWP 27453]
[New LWP 27452]
[New LWP 27451]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fa85db5de9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
81      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
warning: target file /proc/27450/cmdline contained unexpected null characters
Saved corefile core.27450
[Inferior 1 (process 27450) detached]
[root@ ~]# gdb /usr/local/mediacp/icecastkh/bin/icecast core.27450
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-119.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/mediacp/icecastkh15/bin/icecast...done.
[New LWP 27451]
[New LWP 27452]
[New LWP 27453]
[New LWP 27454]
[New LWP 27455]
[New LWP 27456]
[New LWP 27457]
[New LWP 27458]
[New LWP 27459]
[New LWP 27460]
[New LWP 27461]
[New LWP 27450]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/local/mediacp/icecastkh/bin/icecast'.
#0  0x00007fa85d874c3d in poll () at ../sysdeps/unix/syscall-template.S:81
81      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Maybe of interest but i do notice the problem does seem to frequent this specific service quite frequently. So far every time we've restarted to debug on kh 15 this server has crashed among a couple others.

Configuration file as below with passwords blanked out with asterix, sorry it's a bit messy it's a generated file. I did also note that the customer was leaving relay passwords empty, not sure if that could cause a problem anywhere

<icecast>
    <limits>
        <clients>99999</clients>
        <sources>1</sources>
        <queue-size>524288</queue-size>
        <client-timeout>30</client-timeout>
        <header-timeout>15</header-timeout>
        <source-timeout>10</source-timeout>
                <burst-on-connect>0</burst-on-connect>
        <burst-size>65535</burst-size>
        <workers>8</workers>
    </limits>

    <authentication>
        <source-password>****</source-password>
        <relay-password></relay-password>
        <admin-user>admin</admin-user>
        <admin-password>****</admin-password>
    </authentication>

        <hostname>****</hostname>

    <listen-socket>
        <port>8006</port>
    </listen-socket>

    <http-headers>
        <header name="Access-Control-Allow-Origin" value="*" />
    </http-headers>

    <directory>
            <yp-url-timeout>15</yp-url-timeout>
            <yp-url>http://dir.xiph.org/cgi-bin/yp-cgi</yp-url>
        </directory>
        <public></public>
<mount>
  <mount-name>/stream</mount-name>
  <charset>UTF-8</charset>
</mount>
        <fileserve>1</fileserve>
    <paths>
        <logdir>/usr/local/mediacp/content/user_5/icecast_kh/8006_10/logs</logdir>
        <webroot>/usr/local/mediacp/content/user_5/icecast_kh/8006_10/web</webroot>
        <adminroot>/usr/local/mediacp/content/user_5/icecast_kh/8006_10/admin</adminroot>
                <pidfile>/usr/local/mediacp/content/user_5/icecast_kh/8006_10/icecast.pid</pidfile>
        <alias source="/" dest="/index.html"/>
        <ssl-certificate>/usr/local/mediacp/icecast2/icecast.cert</ssl-certificate>
        <ssl-allowed-ciphers>ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS</ssl-allowed-ciphers>
        <x-forwarded-for>127.0.0.1</x-forwarded-for>
    </paths>

    <logging>
        <accesslog>access.w3c</accesslog>
        <errorlog>error.log</errorlog>
        <playlistlog>playlist.log</playlistlog>
        <loglevel>2</loglevel>
                <logsize>10000</logsize>
    </logging>
</icecast>

Some errors that might be relevant:


[2020-10-06  21:31:57] EROR yp/send_to_yp connection to http://dir.xiph.org/cgi-bin/yp-cgi failed on /radiocaravana with "Failed connect to dir.xiph.org:80; Connection refused"
[2020-10-06  21:31:57] WARN yp/yp_process_server error detected, backing off on http://dir.xiph.org/cgi-bin/yp-cgi
[2020-10-06  21:32:22] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:32:27] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:32:32] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:32:37] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:32:42] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:32:47] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:32:52] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:32:57] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:33:02] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:33:07] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:33:12] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:33:17] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:33:22] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:33:27] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:33:32] EROR client/log_commit_thread logger received code 0
[2020-10-06  21:33:37] EROR client/log_commit_thread logger received code 0

TheRealMattLear commented 3 years ago

Please let me know if you require anything further

karlheyes commented 3 years ago

was dealing with something else so missed this. The log messages look to be a timeout case but will check the code path for that case. With the core file, can you load that back into gdb again (gdb /path/to/icecast15 core.xxxx), do a command "thread apply all bt" and report the output (could be a lot depending on the status at the time). While providing a core file can be helpful it does really require the exact platform to check as there are shared libs involved.

karl.

karlheyes / icecast-kh

100%CPU after network outage #310