ICE thread sometimes peg CPU

gfodor commented 4 years ago

Hi all, I am an engineer working on the same project as @mqp was when this issue was opened: https://github.com/meetecho/janus-gateway/issues/1260

At the time, that issue was closed due to the fact the CPU pegging behavior we had seen seemed to discontinue, but in fact it hadn't, and we never followed up. In the interim, we decided to wait and see how things would fare when we upgraded janus and its various dependencies.

Unfortunately, after the upgrade, we still see the behavior of pegged threads. Some relevant info:

We have upgraded to janus 0.7.6 - we cannot fully upgrade to the latest version due to plugin incompatibilities that need to be addressed. I realize it's frowned upon to open bugs for older versions, but at least would like to know if there is reason to believe this problem is fixed in a newer version so we can prioritize the plugin migration process.
We have upgraded the various dependencies, notably, we have upgraded libnice to master HEAD as of a few days ago.

The behavior we see is occasionally an hloop thread gets pegged, reporting ~50% CPU utilization in ps and one of the CPUs becomes pegged at 100%. Attached is a perf trace (perf.zip, showing that the time is spent polling in g_poll. These events seem relatively rare, and are hard to reproduce. We managed to reproduce it once under load testing but in general there's no systematic way to do it - it seems to be a very low probability event, though it does seem to happen more frequently against production traffic vs artificial load testing (our load test uses headless Chrome browsers.)

On one of our production nodes, we managed to grab all the relevant metadata from the admin API re: handles. The only fishy thing was one session had a relatively large number of handles (25 or so) compared to others and all of them but one had false flags around offer negotiation. Other sessions had similar non-negotiated handles but only this one 'degenerate' session had so many handles, and only a single successfully negotiated handle. Hard to say if this is useful info or not. Here is a dump of the handles: all_handles.zip I did a bunch of ad hoc analysis to see if there were any telltale signs that a session was 'corrupted' based upon its metadata - the best I could come up with is that there is a session with only a single handle successfully negotiated with a lot of other non-negotiated ones - the other sessions all had at least two successfully negotiated handles.

Some things we tried/discovered:

Forcibly detaching the handles and/or destroying the sessions from the admin API failed to resolve the CPU being pegged.
The CPU pegging does not result in any other negative side effects other than capacity reduction - of course, once all the CPUs become pegged, the server becomes unavailable, but we are mitigating this problem for now by running servers with many CPUs and doing nightly restarts.
The problem manifests if a thread is spawned for each handle, or if we enable the new fixed thread feature (we are running 128 threads in production.)
Running lsof shows that each thread has the same number of open file handles, and the non-eventd ones are identical (files, etc.)
Digging through older issues, with the caveat that I realize these may be long resolved (could not determine it), I found a few relevant things:
- This issue with libnice: https://gitlab.freedesktop.org/libnice/libnice/issues/14 - there did not seem to be a way to determine if/when this degenerate case occurs, though our stack traces from perf seem consistent. We are not running this under call grind so there's not a clear way to see non-sampled call counts.
- This issue from a year ago seems consistent as well, though again I can't be 100% certain: https://gitlab.freedesktop.org/libnice/libnice/issues/72

Looking at the janus code, the thing I was wondering if it is possible that the libnice issue mentioned above occurs if a call to nice_agent_attach_recv with NULL is made before a callback is registered. looking at janus, that seems possible here, since the attachment is skipped if candidate gathering fails (but detachment always occurs later) https://github.com/meetecho/janus-gateway/blob/master/ice.c#L3469. Again this is mostly a guess, I am super unfamiliar with all of this, but wanted to mention it!

Thanks for any assist!

gfodor commented 4 years ago

This is another relatively ancient chain I found, given that everything around this is consistent with what we're seeing it's interesting, but I also assume the underlying problems here were fixed: https://gitlab.freedesktop.org/libnice/libnice/issues/18

atoppi commented 4 years ago

Hi @gfodor, unfortunately we can not inspect other machines perf report but you should be able to get a call tree to understand where that g_poll call comes from. If that is not the case compile janus again with -O1 -ggdb3 -fno-omit-frame-pointer

As of your questions on libnice, I have no idea on what a call to nice_agent_attach_recv with a NULL cb, followed by a call with a defined cb, would trigger. Maybe @lminiero knows something more.

gfodor commented 4 years ago

Ah sorry, attached the txt report from perf:

perf.report.txt

its missing a few symbols but overall might be enough

gfodor commented 4 years ago

also the flow i suggested above was:

ICE candidates fail, so no callback is registered
user disconnects, so nice_agent_attach_recv is called with NULL

i agree tho inverting the order as well may be an issue, but if its spinning my guess is that the terminal state is one where there is no callback registered.

atoppi commented 4 years ago

Calling nice_agent_attach_recv with a null function should be the normal way to unset a callback

gfodor commented 4 years ago

Right, my question was is what happens if that is called before a callback has been set. Tracing through the code it seemed benign, but the fact that Janus does it even when a callback was not registered when ice candidates fail was something I wanted to raise given the libnice issue I linked.

On Mon, Mar 23, 2020 at 2:35 PM Alessandro Toppi notifications@github.com wrote:

Calling nice_agent_attach_recv with a null function should be the normal way to unset a callback

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/meetecho/janus-gateway/issues/2015#issuecomment-602869315, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVW5GEJH64LZKV7RFZ6I3RI7I37ANCNFSM4LSBEQFQ .

atoppi commented 4 years ago

I see, in that case it should be easy to reproduce. As of the perf report, how did you generate it?

atoppi commented 4 years ago

also the flow i suggested above was:

ICE candidates fail, so no callback is registered

user disconnects, so nice_agent_attach_recv is called with NULL

I have simulated this flow, but did not observe any cpu spike.

lminiero commented 4 years ago

Right, my question was is what happens if that is called before a callback has been set.

AFAIK, it simply means callbacks are not invoked, and events are not triggered.

lminiero commented 4 years ago

@gfodor as a side note, are you using ICE-TCP? I seem to remember issues with TCP sockets in libnice that did cause poll crazyness, mostly when trying to get rid of them.

gfodor commented 4 years ago

Nope, we aren’t using tcp I’m afraid. However we were considering turning that on at some point, is that ill advised?

On Fri, Mar 27, 2020 at 9:52 AM Lorenzo Miniero notifications@github.com wrote:

@gfodor https://github.com/gfodor as a side note, are you using ICE-TCP? I seem to remember issues with TCP sockets in libnice that did cause poll crazyness, mostly when trying to get rid of them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/meetecho/janus-gateway/issues/2015#issuecomment-605107175, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVW5FQPA7PRNJYXENBESTRJTKWDANCNFSM4LSBEQFQ .

lminiero commented 4 years ago

Nope, we aren’t using tcp I’m afraid. However we were considering turning that on at some point, is that ill advised?

Not sure, as we don't use them. Thinking about it, though, I think at the time the issue was file descriptors leaking with TCP, rather than CPU spikes.

lminiero commented 4 years ago

also the flow i suggested above was:

ICE candidates fail, so no callback is registered

user disconnects, so nice_agent_attach_recv is called with NULL

i agree tho inverting the order as well may be an issue, but if its spinning my guess is that the terminal state is one where there is no callback registered.

IIRC nice_agent_gather_candidates can only fail if libnice ran out of ports to use (which can happen more easily if you have a short RTP range) on any of the available interfaces, since that's only blocking for the host candidates but not the other ones. Did you see that method failing in your logs? I'll try replicating this with a very short RTP range and a lot of attempted connections: I tried a couple and nothing weird happened.

That said, looking at the issue you mentioned, then making sure we have a callback registered before that nice_agent_gather_candidates fails should help if the analysis Philip made there is correct. Have you already tried switching the two calls to see if it helps? It shouldn't have any harmful impact at all, in Janus itself, as it's cleaned up anyway when we get to janus_ice_webrtc_hangup.

gfodor commented 4 years ago

Ah. We did not see the candidate failure message in the logs, but this has happened so rarely and our log truncation is short we may have missed it. Our RTP range is very wide tho, and we are not under a lot of load, so if that's the only scenario, we can probably rule out my theory completely.

We haven't tried patching janus to mitigate this. The theory I proposed is very weak, but didn't want to fail to mention it - my guess is that it's much more likely something else is going on. I wish I had more theories or data to share tho :(

lminiero commented 4 years ago

At this point I think this would be better addressed in the libnice repo, then, as unfortunately we lack the knowledge to understand its internals.

gfodor commented 4 years ago

Opened here: https://gitlab.freedesktop.org/libnice/libnice/issues/107

sjkummer commented 4 years ago

I am also experiencing situations where Janus is using 100% CPU after a failed ICE connection and never recovers even if the call was ended - only restarting Janus helps.

I wonder: Shouldn't Janus cleanup / release all libnice stuff after a failed call ended?

As an additional info (sorry if this is too general / not analysed in detail): I have never observed CPU issues when using Kurento - which is also using libnice internally - even though we use Kurento in a large number of environments.

// Addendum: In my logs, I see that "WebRTC resources freed" is missing sometimes. Since janus_ice_webrtc_free is not called directly on hangup or session-destroy, but waits for the loop to be finished (as far as I can see). This might be a problem in relation with the endless-loop-problem https://gitlab.freedesktop.org/libnice/libnice/issues/14

Log example including WebRTC resources freed:

[1824004789464592] Telling the plugin about the handle detach (JANUS Streaming plugin)
Removing streaming session...
[1824004789464592] Sending event to transport...; 0x7ff577d565f0
[1824004789464592] Finalizing loop source
[1824004789464592] Handle thread ended! 0x7ff577d565f0
[1824004789464592] Handle and related resources freed; 0x7ff577d565f0 0x7ff577e034d0
[4773164771538039] WebRTC resources freed; 0x7ff579302300 0x7ff579000690

Log example where WebRTC resources freed is never called:

[7873379679739615] Telling the plugin about the hangup (JANUS Streaming plugin)
[janus.plugin.streaming-0x7ff577f026a0] No WebRTC media anymore
[7873379679739615] Telling the plugin about the handle detach (JANUS Streaming plugin)
Removing streaming session...
[7873379679739615] Sending event to transport...; 0x7ff577f032b0
[7873379679739615] Finalizing loop source
[7873379679739615] Handle thread ended! 0x7ff577f032b0
[7873379679739615] Handle and related resources freed; 0x7ff577f032b0 0x7ff579200cf0

sjkummer commented 4 years ago

@lminiero What about never calling nice_agent_attach_recv with a NULL callback and instead just don't touch it until it's freed? (having the libnice bug https://gitlab.freedesktop.org/libnice/libnice/issues/14 in mind)

lminiero commented 4 years ago

Shouldn't Janus cleanup / release all libnice stuff after a failed call ended?

That's exactly what we do already.

I have never observed CPU issues when using Kurento

I'm not going to study the Kurento code, if that's what you're asking.

What about never calling nice_agent_attach_recv with a NULL callback and instead just don't touch it until it's freed?

Feel free to experiment with that, if you can replicate the issue, and report the results. A PR in case would help. I won't have time to work on that myself.

sjkummer commented 4 years ago

Okay, I'll run that change for some days in my affected environment and create a PR if it works.

atoppi commented 4 years ago

@sjkummer in those logs WebRTC resources freed refers to a different handle (4773164771538039).

atoppi commented 4 years ago

Moreover, when the line Handle thread ended appears, that loop should be ended (at least from a glib API perspective), since g_main_loop_run has returned. I'm not sure how this translates in libnice internals though.

Finally note that janus_ice_webrtc_free might return early, skipping Webrtc resources freed:

static void janus_ice_webrtc_free(janus_ice_handle *handle) {
    if(handle == NULL)
        return;
    janus_mutex_lock(&handle->mutex);
    if(!handle->agent_created) {
        janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_NEW_DATACHAN_SDP);
        janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_READY);
        janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_CLEANING);
        janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_HAS_AGENT);
        janus_mutex_unlock(&handle->mutex);
        return;
    }

sjkummer commented 4 years ago

@atoppi Oups, yes that log section wasn't complete. Adding the complete file here: janus_stripped.txt

It shows that No WebRTC media anymore and Hanging up PeerConnection both were logged 22 times, but WebRTC resources freed appears only once.

Sorry, this log isn't from the newest Janus version. I once ran tests with several Janus versions but didn't keep all the logs.

lminiero commented 4 years ago

No WebRTC media anymore is printed in all cases, whether a PeerConnection was established or not. If you don't see a WebRTC media is now available before it (which is what most if not all plugins write on the setup_media callback), then there was no PeerConnection for that handle.

atoppi commented 4 years ago

@sjkummer have you managed to replicate with the changes you suggested ?

sjkummer commented 4 years ago

What I see so far is encouraging, but I am not completed yet.

Completed:

a minimal change to never call nice_agent_attach_recv with a NULL callback: https://github.com/sjkummer/janus-gateway/commit/e87d534dcdfd5d4eeb83bc44de2a1f68397a2003
run it in an environment that was affected (got 100% CPU with 0.8.2 at least after the 2nd call) but CPU never did go up anymore after deploying the patched janus version - even after multiple calls ✅
run the patched janus it in a 2nd environment which was not affected but requires STUN/TURN on janus side (just to make sure, the libnice handling still works) ✅ libnice 0.1.16 was used in both cases
Asked Olivier Crête if the is any drawback if we just never "unregister" / never call nice_agent_attach_recv with NULL -> https://gitlab.freedesktop.org/libnice/libnice/issues/14
- Seems like a safe choice ✅

Still in progress:

cleanup code and submit PR
building a lab environment where this issue can be reproduced deterministically
- hard too achieve.
- hypothesis: in the affected environment I see an initial RTP/UDP flow which becomes blocked / unidirectional after a short time. Might be some weird firewall / router behavior

lminiero commented 4 years ago

Thanks for the update! I guess you meant 0.9.2 and not 0.8.2 though?

gfodor commented 4 years ago

that's encouraging! thank you for testing that theory @sjkummer

sjkummer commented 4 years ago

@gfodor If you would like to test if that change helps in your case as well, you can find an updated DIFF here: https://github.com/meetecho/janus-gateway/pull/2101/files - this probably can be ported back to 0.7.6 without conflicts

lminiero commented 4 years ago

I'll close this issue as @sjkummer's PR (thanks again for that!) should have fixed it. If it still happens @gfodor let us know and we'll reopen.

meetecho / janus-gateway

ICE thread sometimes peg CPU #2015