Closed thehunmonkgroup closed 6 years ago
I don't see anything in there that may break our app. That said, we do wait for the selected-pair
callback before we start the handshake, so if that's not happening (e.g., because the stack internally waits for whatever else to happen first) then you won't get connectivity in Janus. Try checking the Admin API info for one of the affected handles, to see what it has to say with respect to the ICE state.
Also adding to this issue, that I receive a “nice_agent_set_port_range unavailable, port range disabled" warning in the Janus logs during startup, although I've been told that newer versions of libnice should support this feature -- perhaps there was a breaking API change in libnice around this feature that Janus needs to track?
Attached are a raw dump of events (minus media) for two users connecting to both audiobridge and videoroom (each user is publishing two video feeds).
The thing I notice immediately is that in the failure case, there are no ICE events after the 'connecting' events, so it seems ICE gets stuck there. It looks to me like all the plugin related communication and offer/answer stuff is happening ok, although I'm hardly an expert there :)
Any thoughts?
Please use gist/pastebin for pasting logs next time: I'd rather not save files on my laptop for temporary stuff.
That said, yeah, I don't see any connected/ready event for ICE (type:16), so ICE is never going beyond the "connecting" state for some reason. Not sure what that valid
flag they added does and why it prevents it to go through.
Ok, gist it is next time!
So would the next step be a post to the libnice mailing list?
If so, then I can also ask about the absence of availability of 'nice_agent_set_port_range' -- could you point me to the code in Janus that makes that determination?
Looks like nice_agent_set_port_range
is still there
https://github.com/libnice/libnice/blob/master/agent/agent.h#L502
We check it here:
https://github.com/meetecho/janus-gateway/blob/master/configure.ac#L139
which means we may be checking the agent.h
in the default path, and not the one where you installed it. If you were using an older version of libnice before, you may still have an old agent.h
in there that causes our check to break.
hm, doesn't add up on my end:
$~: find / -name agent.h
/usr/local/janus-gateway-deps/include/nice/agent.h
/usr/local/janus-gateway-deps/build/libnice/agent/agent.h
/var/local/git/licode/build/libdeps/build/include/nice/agent.h
/var/local/git/licode/build/libdeps/libnice/agent/agent.h
Even if it somehow used the libnice in the Licode dir (highly unlikely b/c I never set PKG_CONFIG_PATH
for that), it should still find it, because that's basically the same version of libnice I'm running for Janus.
Are we sure that check at https://github.com/meetecho/janus-gateway/blob/master/configure.ac#L146 is actually working properly?
It works for me. Try grepping your agent.h
for nice_agent_set_port_range
$ grep nice_agent_set_port_range /usr/local/janus-gateway-deps/include/nice/agent.h
484: * nice_agent_set_port_range:
502:nice_agent_set_port_range (
552: * <para>See also: nice_agent_set_port_range()</para>
weird, right?
Yeah but the check we have will look for agent.h
in the default folder(s), it's not affected by pkg-config
and doesn't use it. I'll have to see if there's a way to do that.
What about adding a --use-libnice=/path/to/libnice
configure parameter?
We might even be able to smartly set PKG_CONFIG_PATH based on that setting during the build.
Or we can get rid of pkg-config for checking libnice, and use plain autoconf stuff for it instead: this way, variables like CFLAGS and LDFLAGS can be used, and the check will be done properly because the header would be looked for in the right place.
This is so not my area of expertise... ;)
I poked around just a bit on Google, and it doesn't seem very straightforward to integrate autotools and pkg-config in this case.
My one concern with your suggestion is that I'm not sure if ditching pkg-config for libnice would result in more complexity in the configure step for finding libnice in 'standard' locations on different distros.
Yeah, mine neither, but I didn't mean you'd have to do that of course, it was more thinking out loud... right now, anyway, you should still be able to achieve the desired result with your pkg-config flag AND by prepending something like a CFLAG=-I/usr/local/janus-gateway-deps/include/
to the configure
Out of curiosity, why did you add support for nice_agent_set_port_range
? Is there a specific use case you envisioned it for?
FYI I reached out to the libnice mailing list re: the breaking commit referenced in this issue. Will report back when I hear from them.
@lminiero FYI, for discovering nice_agent_set_port_range
in a custom location, it's not CFLAGS that needs to be set, but LDFLAGS:
export LDFLAGS=-L/path/to/libnice/lib
You think that's worth adding to the doc?
Sorry for the late reply, I missed this message apparently. Not sure that's needed, as that's generic configure-related configuration. If you do a ./configure -h
, for instance, you'll get, among the other info:
Some influential environment variables:
CC C compiler command
CFLAGS C compiler flags
LDFLAGS linker flags, e.g. -L<lib dir> if you have libraries in a
nonstandard directory <lib dir>
I'd rather not write this for libnice, as otherwise I'd have to do the same for the tons of other libraries we depend on too.
Sure, understood. I think we can close that particular loop in this issue.
That still leaves libnice master not working with Janus master. No helpful word back from my post to the libnice mailing list. I'm actually surprised that libnice is used by so many other projects -- it doesn't seem to be stably maintained given the critical nature it holds in the workflow. Have you examined any other ICE connectivity libraries?
Yes, but there aren't many. There's pjnath (the one Asterisk uses) but it seems overkill for what we need. There's re/libre too (which I'm studying for a new SIP plugin) which also implements STUN and TURN, but it has a quite steep learning curve, at least for me. I've even considered writing my own ICE stack, but as you can guess that would take a lot of time and resources we currently don't have.
So, I've gotten nothing from the libnice maintainers on this issue. I did get a helpful message from a libnice user, saying this:
I went back to an even older version a2e25cf49b24c2012e8e9058ecc7e62f1e027cb3 after a lot of search back and forth and found it working.
So this is now the commit I'm using for libnice, hopefully it will be stable.
I do think this issue should stay open, though, as the interop issue remains with master/master
i am watching re/libre, it looks promising.
Any update? I have the same problem.
with libnice master, ice fails when client in NAT。
use patch D735.diff fix my problem
Do you mean this? https://phabricator.freedesktop.org/D735 They seem to be referring to another issue there ( https://phabricator.freedesktop.org/D742 ), although it looks like that has merged already and so may not be fixing our problem.
Now that libnice 0.1.14 has been shipped, I guess we'll have to figure out what's causing it to fail with Janus.
@thehunmonkgroup are you behind a symmetric NAT by any chance? @atoppi just made a couple of tests with libnice 0.1.14 and, while it did indeed fail when trying to contact a remote Janus server he deployed on AWS, it succeeded when trying to contact a local Janus instance. Using the D735 patch @jing3018 mentioned fixed the remote Janus scenario for him as well.
Since that patch fixes a prflx scenario, it means the only candidate that would work for him was a prflx one, which is to be expected as we're indeed behind a symmetric NAT ourselves. When talking to Janus directly the prflx candidate is not needed and so everything still works. Couldn't check if it would work with a remote Janus and a user behind a regular NAT, that is a user for which a normal srflx candidate should work, as we'd need to try a different network than the one we have at the office. If it's a symmetric NAT for both you and @jing3018, this might explain why it doesn't work in those specific scenarios for you.
I'm writing a mail in response to the post you wrote on the libnice ML, hopefully it will help us shed some light on this.
I’ve looked at several recent dumps of Chrome WebRTC connections between Janus and various users of my system, and they all seem to be using srvflx. I do recall seeing prflx candidates in the list from time to time in other older dumps, though.
I'm not a networking expert, so still don't have full command of when the various candidate types are used. My Janus installation is not behind symmetric NAT, it's just sitting on a regular ol' public IP, and almost all of my users are connecting from standard home routers, occasionally a cafe wifi or library.
We made another test with a publicly reachable Janus instance, and it worked for us, whether the user is behind a regular NAT (one of my colleagues at home) or a more complex one (us at the office). As such, the problem might happen when Janus itself is natted, as that's what the AWS test we made yesterday was configured: is that how your servers are deployed?
No, my servers are on a public IP. I'll make sure to test whatever patch they come up with, in the meantime, my hacked version seems to be working ok.
Turns out that the issue happens for me when talking between two vagrant machines on my development laptop, which do indeed use prflx (or local) candidates. I tested with the latest master branch and confirmed that it was indeed prflx candidates that were failing.
I probably never actually tested with servers on a public IP after they were failing in my dev environment.
I had this problem - also referenced here:
https://groups.google.com/forum/#!topic/meetecho-janus/6tUeHPIHhNY https://groups.google.com/forum/#!topic/meetecho-janus/UBlvbWiwgMk
Latest Janus, server in a docker container using stun and forwarded ports. The message that eventually led me to the solution was: "Still waiting for the DTLS stack for component 1 in stream 1"
I would be interested to know which is the most recent libnice that does not have this problem as OP had this with 0.1.13 - and we both reverted to 0.1.4.1 which is very old. If I get time to experiment I will post here - but if safe version is known, please post.
Also, might be worth documenting to avoid 0.1.13 / 0.1.14 as this is a world of pain to diagnose!
I've never used an official libnice release that didnt' have some kind of issue. Their current master branch, dated Jun 22, 2017, is my best recommendation.
Also, these kinds of failures are not only related to libnice, see here for the solution to another edge case that plagued me for about a year: https://groups.google.com/forum/#!topic/meetecho-janus/ebSZF8B2KxM
many thanks @thehunmonkgroup - current master of libnice is working for me - using:
git checkout dbaf8f5ccd76089e340883887c7e08e6c04de80a
Closing as it looks like it's not an issue anymore. Please let me know if that's not the case and I'll reopen.
Also, might be worth documenting to avoid 0.1.13 / 0.1.14 as this is a world of pain to diagnose!
So what's the most recommended version of libnice now?
I have some strange issues with 0.1.13 when CPU goes 100% load with 'iceloop' thread, going to figure it out. Would be great to know
I've never found any official release that's stable, and use my own fork.
Probably would recommend you try their latest master from https://github.com/libnice/libnice
Still have CPU 100% load issues with latest 'libnice' master
here is some detailed report about my issue and 'libnice' version https://github.com/meetecho/janus-gateway/issues/698#issuecomment-356928458
@soulfly - have you tried the version I suggest above: git checkout dbaf8f5ccd76089e340883887c7e08e6c04de80a
@richp10 no, going to try the following things now:
Hi there,
I can confirm that with dbaf8f5ccd76089e340883887c7e08e6c04de80a commit as suggested by @richp10 I got rid of CPU load issues.
I'm still playing with it to understand is it completely workable or not.
On one installation I periodically getting assert failed in this line https://github.com/libnice/libnice/blob/dbaf8f5ccd76089e340883887c7e08e6c04de80a/agent/agent.c#L2320
because sometimes NICE transition from GATHERING to CONNECTED happened.
Do not know to what part it relates..
Played even more with dbaf8f5ccd76089e340883887c7e08e6c04de80a and now can conclude the final results.
Unfortunately this commit has other issue for me.
It fixes an issue with CPU 100% load which is good, but yes, there is a new issue I mentioned above (with assert where NICE transition from GATHERING to CONNECTED happened) sometimes occures.
So I'm going to continue my journey here and find other commit which will not have any issues
Any chance you can chewck the latest git master of libnice and see if you can reproduce the gathering->connected transition problem? If you can, please add any details on the upstream bug at https://gitlab.freedesktop.org/libnice/libnice/issues/43
@ocrete I can confirm It works great with latest libnice master and this is the best version of libnice for now
@ocrete I am still getting the failed assertion with the current master of libnice (67771ba2a30ae7247262c252df052468809711eb):
libnice:ERROR:agent.c:2342:agent_signal_component_state_change: assertion failed: (TRANSITION (DISCONNECTED, FAILED) || TRANSITION (GATHERING, FAILED) || TRANSITION (CONNECTING, FAILED) || TRANSITION (CONNECTED, FAILED) || TRANSITION (READY, FAILED) || TRANSITION (DISCONNECTED, GATHERING) || TRANSITION (GATHERING, CONNECTING) || TRANSITION (CONNECTING, CONNECTED) || TRANSITION (CONNECTED, READY) || TRANSITION (READY, CONNECTED) || TRANSITION (FAILED, CONNECTING) || TRANSITION (FAILED, GATHERING) || TRANSITION (DISCONNECTED, CONNECTING))
@syntheticore Can you get a stack trace? And ideally a libnice log?
@ocrete Could you please outline the steps I would need to take to provide you with this information.
@ocrete Some additional info: I had the error using Janus v0.4.4 and libnice 67771ba2a30ae7247262c252df052468809711eb. When testing with Janus 0.4.5 and libnice 5496500b1535d9343fdac2a3408864643fe65d7e I am not able to reproduce it anymore.
@syntheticore I don't see the working libnice commit upstream?
run janus in gdb, in the environemnet set G_MESSAGES_DEBUG=all .. and when you get the assertion, do "backtrace full"
I assume Janus doesn't capture the GLib debug messages anywhere? (Janus devs may confirm)
Not sure if this is anything you want to look into now, but I thought I'd file it in case. Current master of libnice (1778f79008 as of this writing) doesn't seem to work properly with Janus. I am currently using the latest libnice with a Licode installation I have, so I don't think the issue is only with libnice.
Some git bisect work got me to the problem commit:
1ab9d7c104978ea1904aaaad708c1c8c23c77592 is the first bad commit commit 1ab9d7c104978ea1904aaaad708c1c8c23c77592 Author: Olivier Crête olivier.crete@collabora.com Date: Thu May 26 16:05:36 2016 -0400
After reverting that commit (took a small amount of conflict resolution, but not too bad), I was able to get libnice master working again with Janus.
Normally I wouldn't expect you to chase the master commits, but libnice is a unique beast -- the last official release was almost two years ago, and it has issues that have long since been fixed in master. I thought it might be worth looking at this problem commit that I found to see if there's some reasonable adjustment you can make in Janus to accommodate for it.
Here's a direct link to the diff from that commit: https://github.com/libnice/libnice/commit/1ab9d7c104978ea1904aaaad708c1c8c23c77592