Closed gfodor closed 4 years ago
This is another relatively ancient chain I found, given that everything around this is consistent with what we're seeing it's interesting, but I also assume the underlying problems here were fixed: https://gitlab.freedesktop.org/libnice/libnice/issues/18
Hi @gfodor,
unfortunately we can not inspect other machines perf report but you should be able to get a call tree to understand where that g_poll
call comes from. If that is not the case compile janus again with
-O1 -ggdb3 -fno-omit-frame-pointer
As of your questions on libnice, I have no idea on what a call to nice_agent_attach_recv
with a NULL cb, followed by a call with a defined cb, would trigger. Maybe @lminiero knows something more.
Ah sorry, attached the txt report from perf:
its missing a few symbols but overall might be enough
also the flow i suggested above was:
nice_agent_attach_recv
is called with NULLi agree tho inverting the order as well may be an issue, but if its spinning my guess is that the terminal state is one where there is no callback registered.
Calling nice_agent_attach_recv with a null function should be the normal way to unset a callback
Right, my question was is what happens if that is called before a callback has been set. Tracing through the code it seemed benign, but the fact that Janus does it even when a callback was not registered when ice candidates fail was something I wanted to raise given the libnice issue I linked.
On Mon, Mar 23, 2020 at 2:35 PM Alessandro Toppi notifications@github.com wrote:
Calling nice_agent_attach_recv with a null function should be the normal way to unset a callback
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/meetecho/janus-gateway/issues/2015#issuecomment-602869315, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVW5GEJH64LZKV7RFZ6I3RI7I37ANCNFSM4LSBEQFQ .
I see, in that case it should be easy to reproduce. As of the perf report, how did you generate it?
also the flow i suggested above was:
- ICE candidates fail, so no callback is registered
- user disconnects, so
nice_agent_attach_recv
is called with NULL
I have simulated this flow, but did not observe any cpu spike.
Right, my question was is what happens if that is called before a callback has been set.
AFAIK, it simply means callbacks are not invoked, and events are not triggered.
@gfodor as a side note, are you using ICE-TCP? I seem to remember issues with TCP sockets in libnice that did cause poll crazyness, mostly when trying to get rid of them.
Nope, we aren’t using tcp I’m afraid. However we were considering turning that on at some point, is that ill advised?
On Fri, Mar 27, 2020 at 9:52 AM Lorenzo Miniero notifications@github.com wrote:
@gfodor https://github.com/gfodor as a side note, are you using ICE-TCP? I seem to remember issues with TCP sockets in libnice that did cause poll crazyness, mostly when trying to get rid of them.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/meetecho/janus-gateway/issues/2015#issuecomment-605107175, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVW5FQPA7PRNJYXENBESTRJTKWDANCNFSM4LSBEQFQ .
Nope, we aren’t using tcp I’m afraid. However we were considering turning that on at some point, is that ill advised?
Not sure, as we don't use them. Thinking about it, though, I think at the time the issue was file descriptors leaking with TCP, rather than CPU spikes.
also the flow i suggested above was:
- ICE candidates fail, so no callback is registered
- user disconnects, so
nice_agent_attach_recv
is called with NULLi agree tho inverting the order as well may be an issue, but if its spinning my guess is that the terminal state is one where there is no callback registered.
IIRC nice_agent_gather_candidates
can only fail if libnice ran out of ports to use (which can happen more easily if you have a short RTP range) on any of the available interfaces, since that's only blocking for the host candidates but not the other ones. Did you see that method failing in your logs? I'll try replicating this with a very short RTP range and a lot of attempted connections: I tried a couple and nothing weird happened.
That said, looking at the issue you mentioned, then making sure we have a callback registered before that nice_agent_gather_candidates
fails should help if the analysis Philip made there is correct. Have you already tried switching the two calls to see if it helps? It shouldn't have any harmful impact at all, in Janus itself, as it's cleaned up anyway when we get to janus_ice_webrtc_hangup
.
Ah. We did not see the candidate failure message in the logs, but this has happened so rarely and our log truncation is short we may have missed it. Our RTP range is very wide tho, and we are not under a lot of load, so if that's the only scenario, we can probably rule out my theory completely.
We haven't tried patching janus to mitigate this. The theory I proposed is very weak, but didn't want to fail to mention it - my guess is that it's much more likely something else is going on. I wish I had more theories or data to share tho :(
At this point I think this would be better addressed in the libnice repo, then, as unfortunately we lack the knowledge to understand its internals.
I am also experiencing situations where Janus is using 100% CPU after a failed ICE connection and never recovers even if the call was ended - only restarting Janus helps.
I wonder: Shouldn't Janus cleanup / release all libnice stuff after a failed call ended?
As an additional info (sorry if this is too general / not analysed in detail): I have never observed CPU issues when using Kurento - which is also using libnice internally - even though we use Kurento in a large number of environments.
// Addendum:
In my logs, I see that "WebRTC resources freed
" is missing sometimes. Since janus_ice_webrtc_free
is not called directly on hangup or session-destroy, but waits for the loop to be finished (as far as I can see). This might be a problem in relation with the endless-loop-problem https://gitlab.freedesktop.org/libnice/libnice/issues/14
Log example including WebRTC resources freed
:
[1824004789464592] Telling the plugin about the handle detach (JANUS Streaming plugin)
Removing streaming session...
[1824004789464592] Sending event to transport...; 0x7ff577d565f0
[1824004789464592] Finalizing loop source
[1824004789464592] Handle thread ended! 0x7ff577d565f0
[1824004789464592] Handle and related resources freed; 0x7ff577d565f0 0x7ff577e034d0
[4773164771538039] WebRTC resources freed; 0x7ff579302300 0x7ff579000690
Log example where WebRTC resources freed
is never called:
[7873379679739615] Telling the plugin about the hangup (JANUS Streaming plugin)
[janus.plugin.streaming-0x7ff577f026a0] No WebRTC media anymore
[7873379679739615] Telling the plugin about the handle detach (JANUS Streaming plugin)
Removing streaming session...
[7873379679739615] Sending event to transport...; 0x7ff577f032b0
[7873379679739615] Finalizing loop source
[7873379679739615] Handle thread ended! 0x7ff577f032b0
[7873379679739615] Handle and related resources freed; 0x7ff577f032b0 0x7ff579200cf0
@lminiero
What about never calling nice_agent_attach_recv
with a NULL callback and instead just don't touch it until it's freed? (having the libnice bug https://gitlab.freedesktop.org/libnice/libnice/issues/14 in mind)
Shouldn't Janus cleanup / release all libnice stuff after a failed call ended?
That's exactly what we do already.
I have never observed CPU issues when using Kurento
I'm not going to study the Kurento code, if that's what you're asking.
What about never calling
nice_agent_attach_recv
with a NULL callback and instead just don't touch it until it's freed?
Feel free to experiment with that, if you can replicate the issue, and report the results. A PR in case would help. I won't have time to work on that myself.
Okay, I'll run that change for some days in my affected environment and create a PR if it works.
@sjkummer in those logs WebRTC resources freed
refers to a different handle (4773164771538039
).
Moreover, when the line Handle thread ended
appears, that loop should be ended (at least from a glib API perspective), since g_main_loop_run
has returned. I'm not sure how this translates in libnice internals though.
Finally note that janus_ice_webrtc_free
might return early, skipping Webrtc resources freed
:
static void janus_ice_webrtc_free(janus_ice_handle *handle) {
if(handle == NULL)
return;
janus_mutex_lock(&handle->mutex);
if(!handle->agent_created) {
janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_NEW_DATACHAN_SDP);
janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_READY);
janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_CLEANING);
janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_HAS_AGENT);
janus_mutex_unlock(&handle->mutex);
return;
}
@atoppi Oups, yes that log section wasn't complete. Adding the complete file here: janus_stripped.txt
It shows that No WebRTC media anymore
and Hanging up PeerConnection
both were logged 22 times, but WebRTC resources freed
appears only once.
Sorry, this log isn't from the newest Janus version. I once ran tests with several Janus versions but didn't keep all the logs.
No WebRTC media anymore
is printed in all cases, whether a PeerConnection was established or not. If you don't see a WebRTC media is now available
before it (which is what most if not all plugins write on the setup_media
callback), then there was no PeerConnection for that handle.
@sjkummer have you managed to replicate with the changes you suggested ?
What I see so far is encouraging, but I am not completed yet.
Completed:
nice_agent_attach_recv
with a NULL
callback: https://github.com/sjkummer/janus-gateway/commit/e87d534dcdfd5d4eeb83bc44de2a1f68397a2003nice_agent_attach_recv
with NULL -> https://gitlab.freedesktop.org/libnice/libnice/issues/14
Still in progress:
Thanks for the update! I guess you meant 0.9.2 and not 0.8.2 though?
that's encouraging! thank you for testing that theory @sjkummer
@gfodor If you would like to test if that change helps in your case as well, you can find an updated DIFF here: https://github.com/meetecho/janus-gateway/pull/2101/files - this probably can be ported back to 0.7.6 without conflicts
I'll close this issue as @sjkummer's PR (thanks again for that!) should have fixed it. If it still happens @gfodor let us know and we'll reopen.
Hi all, I am an engineer working on the same project as @mqp was when this issue was opened: https://github.com/meetecho/janus-gateway/issues/1260
At the time, that issue was closed due to the fact the CPU pegging behavior we had seen seemed to discontinue, but in fact it hadn't, and we never followed up. In the interim, we decided to wait and see how things would fare when we upgraded janus and its various dependencies.
Unfortunately, after the upgrade, we still see the behavior of pegged threads. Some relevant info:
We have upgraded to janus 0.7.6 - we cannot fully upgrade to the latest version due to plugin incompatibilities that need to be addressed. I realize it's frowned upon to open bugs for older versions, but at least would like to know if there is reason to believe this problem is fixed in a newer version so we can prioritize the plugin migration process.
We have upgraded the various dependencies, notably, we have upgraded libnice to master HEAD as of a few days ago.
The behavior we see is occasionally an
hloop
thread gets pegged, reporting ~50% CPU utilization inps
and one of the CPUs becomes pegged at 100%. Attached is a perf trace (perf.zip, showing that the time is spent polling in g_poll. These events seem relatively rare, and are hard to reproduce. We managed to reproduce it once under load testing but in general there's no systematic way to do it - it seems to be a very low probability event, though it does seem to happen more frequently against production traffic vs artificial load testing (our load test uses headless Chrome browsers.)On one of our production nodes, we managed to grab all the relevant metadata from the admin API re: handles. The only fishy thing was one session had a relatively large number of handles (25 or so) compared to others and all of them but one had false flags around offer negotiation. Other sessions had similar non-negotiated handles but only this one 'degenerate' session had so many handles, and only a single successfully negotiated handle. Hard to say if this is useful info or not. Here is a dump of the handles: all_handles.zip I did a bunch of ad hoc analysis to see if there were any telltale signs that a session was 'corrupted' based upon its metadata - the best I could come up with is that there is a session with only a single handle successfully negotiated with a lot of other non-negotiated ones - the other sessions all had at least two successfully negotiated handles.
Some things we tried/discovered:
Forcibly detaching the handles and/or destroying the sessions from the admin API failed to resolve the CPU being pegged.
The CPU pegging does not result in any other negative side effects other than capacity reduction - of course, once all the CPUs become pegged, the server becomes unavailable, but we are mitigating this problem for now by running servers with many CPUs and doing nightly restarts.
The problem manifests if a thread is spawned for each handle, or if we enable the new fixed thread feature (we are running 128 threads in production.)
Running
lsof
shows that each thread has the same number of open file handles, and the non-eventd ones are identical (files, etc.)Digging through older issues, with the caveat that I realize these may be long resolved (could not determine it), I found a few relevant things:
Looking at the janus code, the thing I was wondering if it is possible that the libnice issue mentioned above occurs if a call to
nice_agent_attach_recv
withNULL
is made before a callback is registered. looking at janus, that seems possible here, since the attachment is skipped if candidate gathering fails (but detachment always occurs later) https://github.com/meetecho/janus-gateway/blob/master/ice.c#L3469. Again this is mostly a guess, I am super unfamiliar with all of this, but wanted to mention it!Thanks for any assist!