meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.15k stars 2.47k forks source link

ICE fails on Janus with libnice master #788

Closed thehunmonkgroup closed 6 years ago

thehunmonkgroup commented 7 years ago

Not sure if this is anything you want to look into now, but I thought I'd file it in case. Current master of libnice (1778f79008 as of this writing) doesn't seem to work properly with Janus. I am currently using the latest libnice with a Licode installation I have, so I don't think the issue is only with libnice.

Some git bisect work got me to the problem commit:

1ab9d7c104978ea1904aaaad708c1c8c23c77592 is the first bad commit commit 1ab9d7c104978ea1904aaaad708c1c8c23c77592 Author: Olivier Crête olivier.crete@collabora.com Date: Thu May 26 16:05:36 2016 -0400

conncheck: Separate valid and succeded states

RFC 5245 specifies that when a mapped-address differs from the address
from the request was sent, the mapped-address is used to select the
valid pair, but the source address of the check is used to select the
pair that succeeded, so they are not the same.

After reverting that commit (took a small amount of conflict resolution, but not too bad), I was able to get libnice master working again with Janus.

Normally I wouldn't expect you to chase the master commits, but libnice is a unique beast -- the last official release was almost two years ago, and it has issues that have long since been fixed in master. I thought it might be worth looking at this problem commit that I found to see if there's some reasonable adjustment you can make in Janus to accommodate for it.

Here's a direct link to the diff from that commit: https://github.com/libnice/libnice/commit/1ab9d7c104978ea1904aaaad708c1c8c23c77592

lminiero commented 7 years ago

I don't see anything in there that may break our app. That said, we do wait for the selected-pair callback before we start the handshake, so if that's not happening (e.g., because the stack internally waits for whatever else to happen first) then you won't get connectivity in Janus. Try checking the Admin API info for one of the affected handles, to see what it has to say with respect to the ICE state.

thehunmonkgroup commented 7 years ago

Also adding to this issue, that I receive a “nice_agent_set_port_range unavailable, port range disabled" warning in the Janus logs during startup, although I've been told that newer versions of libnice should support this feature -- perhaps there was a breaking API change in libnice around this feature that Janus needs to track?

thehunmonkgroup commented 7 years ago

Attached are a raw dump of events (minus media) for two users connecting to both audiobridge and videoroom (each user is publishing two video feeds).

The thing I notice immediately is that in the failure case, there are no ICE events after the 'connecting' events, so it seems ICE gets stuck there. It looks to me like all the plugin related communication and offer/answer stuff is happening ok, although I'm hardly an expert there :)

Any thoughts?

lminiero commented 7 years ago

Please use gist/pastebin for pasting logs next time: I'd rather not save files on my laptop for temporary stuff.

That said, yeah, I don't see any connected/ready event for ICE (type:16), so ICE is never going beyond the "connecting" state for some reason. Not sure what that valid flag they added does and why it prevents it to go through.

thehunmonkgroup commented 7 years ago

Ok, gist it is next time!

So would the next step be a post to the libnice mailing list?

If so, then I can also ask about the absence of availability of 'nice_agent_set_port_range' -- could you point me to the code in Janus that makes that determination?

lminiero commented 7 years ago

Looks like nice_agent_set_port_range is still there

https://github.com/libnice/libnice/blob/master/agent/agent.h#L502

We check it here:

https://github.com/meetecho/janus-gateway/blob/master/configure.ac#L139

which means we may be checking the agent.h in the default path, and not the one where you installed it. If you were using an older version of libnice before, you may still have an old agent.h in there that causes our check to break.

thehunmonkgroup commented 7 years ago

hm, doesn't add up on my end:

$~: find / -name agent.h
/usr/local/janus-gateway-deps/include/nice/agent.h
/usr/local/janus-gateway-deps/build/libnice/agent/agent.h
/var/local/git/licode/build/libdeps/build/include/nice/agent.h
/var/local/git/licode/build/libdeps/libnice/agent/agent.h

Even if it somehow used the libnice in the Licode dir (highly unlikely b/c I never set PKG_CONFIG_PATH for that), it should still find it, because that's basically the same version of libnice I'm running for Janus.

Are we sure that check at https://github.com/meetecho/janus-gateway/blob/master/configure.ac#L146 is actually working properly?

lminiero commented 7 years ago

It works for me. Try grepping your agent.h for nice_agent_set_port_range

thehunmonkgroup commented 7 years ago
$ grep nice_agent_set_port_range /usr/local/janus-gateway-deps/include/nice/agent.h
484: * nice_agent_set_port_range:
502:nice_agent_set_port_range (
552: * <para>See also: nice_agent_set_port_range()</para>

weird, right?

lminiero commented 7 years ago

Yeah but the check we have will look for agent.h in the default folder(s), it's not affected by pkg-config and doesn't use it. I'll have to see if there's a way to do that.

thehunmonkgroup commented 7 years ago

What about adding a --use-libnice=/path/to/libnice configure parameter?

We might even be able to smartly set PKG_CONFIG_PATH based on that setting during the build.

lminiero commented 7 years ago

Or we can get rid of pkg-config for checking libnice, and use plain autoconf stuff for it instead: this way, variables like CFLAGS and LDFLAGS can be used, and the check will be done properly because the header would be looked for in the right place.

thehunmonkgroup commented 7 years ago

This is so not my area of expertise... ;)

I poked around just a bit on Google, and it doesn't seem very straightforward to integrate autotools and pkg-config in this case.

My one concern with your suggestion is that I'm not sure if ditching pkg-config for libnice would result in more complexity in the configure step for finding libnice in 'standard' locations on different distros.

lminiero commented 7 years ago

Yeah, mine neither, but I didn't mean you'd have to do that of course, it was more thinking out loud... right now, anyway, you should still be able to achieve the desired result with your pkg-config flag AND by prepending something like a CFLAG=-I/usr/local/janus-gateway-deps/include/ to the configure

thehunmonkgroup commented 7 years ago

Out of curiosity, why did you add support for nice_agent_set_port_range? Is there a specific use case you envisioned it for?

thehunmonkgroup commented 7 years ago

FYI I reached out to the libnice mailing list re: the breaking commit referenced in this issue. Will report back when I hear from them.

thehunmonkgroup commented 7 years ago

@lminiero FYI, for discovering nice_agent_set_port_range in a custom location, it's not CFLAGS that needs to be set, but LDFLAGS:

export LDFLAGS=-L/path/to/libnice/lib

You think that's worth adding to the doc?

lminiero commented 7 years ago

Sorry for the late reply, I missed this message apparently. Not sure that's needed, as that's generic configure-related configuration. If you do a ./configure -h, for instance, you'll get, among the other info:

Some influential environment variables:
  CC          C compiler command
  CFLAGS      C compiler flags
  LDFLAGS     linker flags, e.g. -L<lib dir> if you have libraries in a
              nonstandard directory <lib dir>

I'd rather not write this for libnice, as otherwise I'd have to do the same for the tons of other libraries we depend on too.

thehunmonkgroup commented 7 years ago

Sure, understood. I think we can close that particular loop in this issue.

That still leaves libnice master not working with Janus master. No helpful word back from my post to the libnice mailing list. I'm actually surprised that libnice is used by so many other projects -- it doesn't seem to be stably maintained given the critical nature it holds in the workflow. Have you examined any other ICE connectivity libraries?

lminiero commented 7 years ago

Yes, but there aren't many. There's pjnath (the one Asterisk uses) but it seems overkill for what we need. There's re/libre too (which I'm studying for a new SIP plugin) which also implements STUN and TURN, but it has a quite steep learning curve, at least for me. I've even considered writing my own ICE stack, but as you can guess that would take a lot of time and resources we currently don't have.

thehunmonkgroup commented 7 years ago

So, I've gotten nothing from the libnice maintainers on this issue. I did get a helpful message from a libnice user, saying this:

I went back to an even older version a2e25cf49b24c2012e8e9058ecc7e62f1e027cb3 after a lot of search back and forth and found it working.

So this is now the commit I'm using for libnice, hopefully it will be stable.

I do think this issue should stay open, though, as the interop issue remains with master/master

notedit commented 7 years ago

i am watching re/libre, it looks promising.

jing3018 commented 7 years ago

Any update? I have the same problem.

jing3018 commented 7 years ago

with libnice master, ice fails when client in NAT。

jing3018 commented 7 years ago

use patch D735.diff fix my problem

lminiero commented 7 years ago

Do you mean this? https://phabricator.freedesktop.org/D735 They seem to be referring to another issue there ( https://phabricator.freedesktop.org/D742 ), although it looks like that has merged already and so may not be fixing our problem.

lminiero commented 7 years ago

Now that libnice 0.1.14 has been shipped, I guess we'll have to figure out what's causing it to fail with Janus.

lminiero commented 7 years ago

@thehunmonkgroup are you behind a symmetric NAT by any chance? @atoppi just made a couple of tests with libnice 0.1.14 and, while it did indeed fail when trying to contact a remote Janus server he deployed on AWS, it succeeded when trying to contact a local Janus instance. Using the D735 patch @jing3018 mentioned fixed the remote Janus scenario for him as well.

Since that patch fixes a prflx scenario, it means the only candidate that would work for him was a prflx one, which is to be expected as we're indeed behind a symmetric NAT ourselves. When talking to Janus directly the prflx candidate is not needed and so everything still works. Couldn't check if it would work with a remote Janus and a user behind a regular NAT, that is a user for which a normal srflx candidate should work, as we'd need to try a different network than the one we have at the office. If it's a symmetric NAT for both you and @jing3018, this might explain why it doesn't work in those specific scenarios for you.

I'm writing a mail in response to the post you wrote on the libnice ML, hopefully it will help us shed some light on this.

thehunmonkgroup commented 7 years ago

I’ve looked at several recent dumps of Chrome WebRTC connections between Janus and various users of my system, and they all seem to be using srvflx. I do recall seeing prflx candidates in the list from time to time in other older dumps, though.

I'm not a networking expert, so still don't have full command of when the various candidate types are used. My Janus installation is not behind symmetric NAT, it's just sitting on a regular ol' public IP, and almost all of my users are connecting from standard home routers, occasionally a cafe wifi or library.

lminiero commented 7 years ago

We made another test with a publicly reachable Janus instance, and it worked for us, whether the user is behind a regular NAT (one of my colleagues at home) or a more complex one (us at the office). As such, the problem might happen when Janus itself is natted, as that's what the AWS test we made yesterday was configured: is that how your servers are deployed?

thehunmonkgroup commented 7 years ago

No, my servers are on a public IP. I'll make sure to test whatever patch they come up with, in the meantime, my hacked version seems to be working ok.

thehunmonkgroup commented 7 years ago

Turns out that the issue happens for me when talking between two vagrant machines on my development laptop, which do indeed use prflx (or local) candidates. I tested with the latest master branch and confirmed that it was indeed prflx candidates that were failing.

I probably never actually tested with servers on a public IP after they were failing in my dev environment.

richp10 commented 7 years ago

I had this problem - also referenced here:

https://groups.google.com/forum/#!topic/meetecho-janus/6tUeHPIHhNY https://groups.google.com/forum/#!topic/meetecho-janus/UBlvbWiwgMk

Latest Janus, server in a docker container using stun and forwarded ports. The message that eventually led me to the solution was: "Still waiting for the DTLS stack for component 1 in stream 1"

I would be interested to know which is the most recent libnice that does not have this problem as OP had this with 0.1.13 - and we both reverted to 0.1.4.1 which is very old. If I get time to experiment I will post here - but if safe version is known, please post.

Also, might be worth documenting to avoid 0.1.13 / 0.1.14 as this is a world of pain to diagnose!

thehunmonkgroup commented 7 years ago

I've never used an official libnice release that didnt' have some kind of issue. Their current master branch, dated Jun 22, 2017, is my best recommendation.

Also, these kinds of failures are not only related to libnice, see here for the solution to another edge case that plagued me for about a year: https://groups.google.com/forum/#!topic/meetecho-janus/ebSZF8B2KxM

richp10 commented 7 years ago

many thanks @thehunmonkgroup - current master of libnice is working for me - using:

git checkout dbaf8f5ccd76089e340883887c7e08e6c04de80a

lminiero commented 6 years ago

Closing as it looks like it's not an issue anymore. Please let me know if that's not the case and I'll reopen.

soulfly commented 6 years ago

Also, might be worth documenting to avoid 0.1.13 / 0.1.14 as this is a world of pain to diagnose!

So what's the most recommended version of libnice now?
I have some strange issues with 0.1.13 when CPU goes 100% load with 'iceloop' thread, going to figure it out. Would be great to know

thehunmonkgroup commented 6 years ago

I've never found any official release that's stable, and use my own fork.

Probably would recommend you try their latest master from https://github.com/libnice/libnice

soulfly commented 6 years ago

Still have CPU 100% load issues with latest 'libnice' master

here is some detailed report about my issue and 'libnice' version https://github.com/meetecho/janus-gateway/issues/698#issuecomment-356928458

richp10 commented 6 years ago

@soulfly - have you tried the version I suggest above: git checkout dbaf8f5ccd76089e340883887c7e08e6c04de80a

soulfly commented 6 years ago

@richp10 no, going to try the following things now:

soulfly commented 6 years ago

Hi there,

I can confirm that with dbaf8f5ccd76089e340883887c7e08e6c04de80a commit as suggested by @richp10 I got rid of CPU load issues.

I'm still playing with it to understand is it completely workable or not.

On one installation I periodically getting assert failed in this line https://github.com/libnice/libnice/blob/dbaf8f5ccd76089e340883887c7e08e6c04de80a/agent/agent.c#L2320

because sometimes NICE transition from GATHERING to CONNECTED happened.

Do not know to what part it relates..

soulfly commented 6 years ago

Played even more with dbaf8f5ccd76089e340883887c7e08e6c04de80a and now can conclude the final results.

Unfortunately this commit has other issue for me.

It fixes an issue with CPU 100% load which is good, but yes, there is a new issue I mentioned above (with assert where NICE transition from GATHERING to CONNECTED happened) sometimes occures.

So I'm going to continue my journey here and find other commit which will not have any issues

ocrete commented 6 years ago

Any chance you can chewck the latest git master of libnice and see if you can reproduce the gathering->connected transition problem? If you can, please add any details on the upstream bug at https://gitlab.freedesktop.org/libnice/libnice/issues/43

soulfly commented 6 years ago

@ocrete I can confirm It works great with latest libnice master and this is the best version of libnice for now

syntheticore commented 5 years ago

@ocrete I am still getting the failed assertion with the current master of libnice (67771ba2a30ae7247262c252df052468809711eb):

libnice:ERROR:agent.c:2342:agent_signal_component_state_change: assertion failed: (TRANSITION (DISCONNECTED, FAILED) || TRANSITION (GATHERING, FAILED) || TRANSITION (CONNECTING, FAILED) || TRANSITION (CONNECTED, FAILED) || TRANSITION (READY, FAILED) || TRANSITION (DISCONNECTED, GATHERING) || TRANSITION (GATHERING, CONNECTING) || TRANSITION (CONNECTING, CONNECTED) || TRANSITION (CONNECTED, READY) || TRANSITION (READY, CONNECTED) || TRANSITION (FAILED, CONNECTING) || TRANSITION (FAILED, GATHERING) || TRANSITION (DISCONNECTED, CONNECTING))

ocrete commented 5 years ago

@syntheticore Can you get a stack trace? And ideally a libnice log?

syntheticore commented 5 years ago

@ocrete Could you please outline the steps I would need to take to provide you with this information.

syntheticore commented 5 years ago

@ocrete Some additional info: I had the error using Janus v0.4.4 and libnice 67771ba2a30ae7247262c252df052468809711eb. When testing with Janus 0.4.5 and libnice 5496500b1535d9343fdac2a3408864643fe65d7e I am not able to reproduce it anymore.

ocrete commented 5 years ago

@syntheticore I don't see the working libnice commit upstream?

run janus in gdb, in the environemnet set G_MESSAGES_DEBUG=all .. and when you get the assertion, do "backtrace full"

I assume Janus doesn't capture the GLib debug messages anywhere? (Janus devs may confirm)