meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
7.96k stars 2.45k forks source link

Janus receiving messages over web-socket but not sending #1263

Closed g-garg-dev closed 5 years ago

g-garg-dev commented 6 years ago

Hi team, For communication with janus i am using web-socket transport. Web-socket works fine for sometime but after sometime it stops sending messages over web-socket. I verified this from log file. I am attaching logs of timestamps when it was sending message here when it was not here.

lminiero commented 6 years ago

I don't see any "Sending event to transport" line in your second log, so I don't see where the problem is. It looks like the core is not passing anything to the transport.

g-garg-dev commented 6 years ago

Hi Lorenzo, I went through source code of janus and from logging in janus.c file it looks like it sent message to transport but transport didn't received it. code

lminiero commented 6 years ago

Can't replicate this, might be an issue specific to your setup (which, by the way, you told us nothing about). If you have ways to consistently replicate this, please let us know.

g-garg-dev commented 6 years ago

Setup:- We are running janus on AWS machine. Application that is making web-socket connection with janus also running on AWS. For call we are using video room plugin. All of our video calls are one to one. One client sends video and at max one receives video.

How does video call happens:- From browser client(Who is sending video feed) sends a request to application server. Application server sends message over web-socket for session create then plugin handle create then create room. After create room response, application server sends a message to client that room has been created. After this client sends sdp offer and ICE candidates to application server which are sent to janus.

From browser client (who wants to consume this feed) sends request to application server. Application server sends message over web-socket for session create then plugin handle create then send Join request for with feed id. in response janus send sdp and we send this to client. client process it and send sdp answer and ice candidates. Here https://drive.google.com/file/d/1vTIQ0U3AysGHrIGgtkKNyZ9LWplhW7st/view?usp=sharing e detailed log file.Message stops coming to application server after 13:54:58. We are facing this issue on our production environment. However i will try to reproduce it locally. Maybe from logs you can find what is causing this issue. Also if you have any questions about our setup

On Wed, Jun 6, 2018 at 6:46 PM Lorenzo Miniero notifications@github.com wrote:

Can't replicate this, might be an issue specific to your setup (which, by the way, you told us nothing about). If you have ways to consistently replicate this, please let us know.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/meetecho/janus-gateway/issues/1263#issuecomment-395064417, or mute the thread https://github.com/notifications/unsubscribe-auth/Adsy-gOUtq1BdV0rBZ1HK_Q0mGmt_7Btks5t59YkgaJpZM4UUvPI .

lminiero commented 6 years ago

Setup:- We are running janus on AWS machine

This doesn't give any info on your setup: like distro, Janus version, libwebsocket version, etc.

lminiero commented 6 years ago

Here https://drive.google.com/file/d/1vTIQ0U3AysGHrIGgtkKNyZ9LWplhW7st/view?usp=sharing e detailed log file

Sorry but there's nothing we can do with this log file. Looks like you enabled debugging level 7, which means there's a ton of unrelated stuff. You should only enable libwebsockets debugging instead, which you can do in the related configuration file. The log level there is actually an integer that reflects a bitmask, documented here: https://libwebsockets.org/lws-api-doc-master/html/group__log.html I think that if you set 31 as log, you get

LLL_ERR = 1 << 0, LLL_WARN = 1 << 1, LLL_NOTICE = 1 << 2, LLL_INFO = 1 << 3, LLL_DEBUG = 1 << 4

Anyway, considering we can't replicate this, you may want to open an issue on the libwebsocket github page as well, in case they can help you figure out what's going wrong with the library in your case.

peffis commented 6 years ago

I'm not sure if the issue I'm seeing currently is related to this websocket issue.

I'm on latest janus master with libwebsockets v2.4-stable built with LWS_MAX_SMP=1. After some usage time the websocket interface is locked and all request attempts will timeout. After enabling the lock_logging (thanks for introducing that as a config option btw) I see that there is one lock still taken and it's the one on transports/janus_websockets.c:janus_websockets_send_message:804:lock ( https://github.com/meetecho/janus-gateway/blob/02a105a60cda41c1feea6ed1517f22778cb4cc51/transports/janus_websockets.c#L804)

I do not understand what could make it keep holding on to that mutex in the coming lines as I am not familiar with how for instance lws_callback_on_writable works, but my use case is a use case when users come and go and might not close or leave the rooms in a clean way, so one of the reasons might be that there could still be references participants where the websocket connection might not still have a remote side which would make the attempt to send an event fail (or bllock?). But I don't know. Still early in the analysis of this one. I did not see this when running on older Janus commits though (such as from march this year).

lminiero commented 6 years ago

@peffis did you mean to open a new issue? I don't see how your comment is related to this issue here. Or is it actually the same problem?

peffis commented 6 years ago

Is it not related? I am observing a potential locking situation around websockets so I thought perhaps this issue is related. But I can file a new issue for it if you prefer.

I see your point, this issue could still receive messages. Sorry about that.

lminiero commented 6 years ago

Actually it's not clear if this issue is an issue at all, let alone if the cause is a lock :smile: No need to open a new issue, then, in case these two are indeed related: for your locking analysis, you may want to have a look at this dumb script I wrote some time ago when investigating a similar problem, https://github.com/meetecho/janus-gateway/issues/732#issuecomment-297767502 Basically, you feed it the Janus log, and it tells you which lock seems to be still held, and where it was taken, so it might help you find out where the problem happens for you. You'll have to change the "LOCK" and "UNLOCK" strings to ":lock" and ":unlock", though, as we changed the way they're formatted in the meanwhile.

peffis commented 6 years ago

Yes, that is how I came to the conclusion that there is a lock issue on the line I mentioned above. The output of the (modified due to case) script is seen below:

$ node ../locks.js janus2.log
Parsing log file janus2.log for lock/unlock evaluation...
Locks still taken:
{ undefined:
   { locks: 1,
     line: '[transports/janus_websockets.c:janus_websockets_send_message:804:lock] 0x60700000a7c8' } }
peffis commented 6 years ago

But in another run I just did it showed that a different lock was being held when this websocket issue occurs:

$ node ../locks.js janus4.log
Parsing log file janus4.log for lock/unlock evaluation...
Locks still taken:
{ undefined:
   { locks: 1,
     line: '[plugins/janus_videoroom.c:janus_videoroom_hangup_media:4203:lock] 0x7feefbb28120' } }

So I'm not sure what is happening. What I see is that it is no longer possible to send messages to the websocket interface when this state is reached.

lminiero commented 6 years ago

@garg-geek can you let us know if it's still an issue in master?

g-garg-dev commented 6 years ago

Hi Lorenzo, I increased the logging level in main file lwsws/main.c to 31. Also rebuilt the lib-websocket with -DCMAKE_BUILD_TYPE=DEBUG but still no webscoket specific message appearing in janus log file. I went through logging description of lib-websocket and maybe i missed something. Can you help me with it.

On Mon, Jun 18, 2018 at 9:56 PM Lorenzo Miniero notifications@github.com wrote:

@garg-geek https://github.com/garg-geek can you let us know if it's still an issue in master?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/meetecho/janus-gateway/issues/1263#issuecomment-398113365, or mute the thread https://github.com/notifications/unsubscribe-auth/Adsy-v32ISBm12yZbxFsHbYcRBQYG1wDks5t99SlgaJpZM4UUvPI .

peffis commented 6 years ago

You should set the websocket loglevel in the websocket transport config in Janus. Janus will configure the log level in libwebsocket with a function call when this file is read. But even after having enabled logging there the websocket logging is a bit cryptic and easily missed - basically lines with the word “WSS” and some code

lminiero commented 6 years ago

You should set the websocket loglevel in the websocket transport config in Janus

Which reminds me I should improve the way you set the debugging level for WebSockets... at the moment we ask for a number, but the libwebsockets debugging is actually flag-based, so putting 7 instead of 4 doesn't mean you'll get more debugging at all, you're just setting different flags in a bitmask. I'll have to change that to something like a comma-separated list of things you want to enable debug for, so that we can update the mask ourselves and make it easier for users.

lminiero commented 6 years ago

Implemented here: https://github.com/meetecho/janus-gateway/commit/3ba537a006f04d1e180ede45c56599e04620e36c

Cibifang commented 5 years ago

I have met a similar problem and it appears very frequently when I test stability. I used websocket in videoroom plugin, and test it by: three people attend same room and quit frequently. And then the janus api can't response anymore, including websocket and http(I use http api for admin).

I'm on latest janus master with latest libwebsockets built with LWS_MAX_SMP=1. The output of the (modified due to case) script is seen below: ` $node lock-parser.js janus.last.log

Parsing log file janus.last.log for LOCK/UNLOCK evaluation...

Locks still taken:

{ undefined: { locks: 1, line: '[plugins/janus_videoroom.c:janus_videoroom_hangup_media:4217:lock] 0x7f9d6558b560' } } `

And Here is what pstack print.

I have met it many times. But different with @peffis , it's always in janus_videoroom_hangup_media. (Maybe because I test “close" with refreshing browser directly)

lminiero commented 5 years ago

Have you updated the lock-parser.js script? The original one looked for LOCK and UNLOCK, but the way we render those has changed in the meanwhile: https://github.com/meetecho/janus-gateway/blob/master/mutex.h#L31 https://github.com/meetecho/janus-gateway/blob/master/mutex.h#L37

Cibifang commented 5 years ago

yes,I use ":lock" and ":unlock“ now. I have read other related issues ^.^

update: when the janus api don't response, I try to use tcpdump to catch packets. And find:

  1. When I try to connect the websocket port, it response tcp's ack but only response it.
  2. Janus has been trying to reconnect webrtc even the remote reponse ”Port unreachable"
vahidov commented 5 years ago

I have the same (I think the same) situation as @Cibifang, but only sometimes on production (can't reproduce in test enviropment). Last one i'm catch this locks. Every of this was twice (two locks without unlock):

[transports/janus_websockets.c:janus_websockets_destroy_client:794:lock] 0x72d3d4039a88
[janus.c:janus_session_handles_remove:613:lock] 0x72d3e042cf28
[janus.c:janus_transport_gone:2517:lock] 0xd07202ab2e0 

After this locks - janus freezed and don't answered either websocket and http.

Maybe this helps

vahidov commented 5 years ago

Next

[janus.c:janus_session_handles_clear:621:lock] 0x774cec285098
[janus.c:janus_process_incoming_request:1388:lock] 0x774cec265ca0
[janus.c:janus_transport_gone:2517:lock] 0xdc63f3e02e0
[transports/janus_websockets.c:janus_websockets_destroy_client:794:lock] 0x774ce848cbb8
lminiero commented 5 years ago

@Cibifang can you let me know if you see the same lock still held in master? The lines don't match with what I have here. If you can share the log you feed to the lock parser that would help as well, so that I can look into it myself as well.

lminiero commented 5 years ago

@vahidov lines don't match for your report either, so please try master as well (and again, a log where the issue definitely happens to feed to the lock parser would help).

lminiero commented 5 years ago

Just guessing what might be happening here, it looks like the mutex that might be held is the transport one.

Specifically, when janus_websockets_destroy_client is called (WS connection closed), we lock the transport mutex and invoke transport_gone on the core. After that method returns, we free the rest of the transport stuff and release the mutex. Anyway, transport_gone is implemented by janus_transport_gone, which in turn calls janus_session_destroy for all the sessions created by that transport; janus_session_destroy calls janus_session_handles_clear to get rid of all the handles attached on that session, via multiple calls to janus_ice_handle_destroy per each handle; the janus_ice_handle_destroy will in turn result in calls to hangup_media and destroy_session on the plugin the handle was attached to.

The problem here is that some plugins (including the VideoRoom) will try to send events to other participants in either a hangup_media or destroy_session implementation, e.g., to tell other participants that a publisher just disconnected and they should remove their subscription. In case any of those participants where sharing the session with the handle that just left (e.g., you're sending your both webcam and screensharing, so two publishers, in a room), then the push_event will eventually reach the same transport that just went away: for WebSockets, that's the janus_websockets_send_message method, which will try to access the transport mutex lock again. Since the lock is held by the transport itself, which originated the series of event that led to this, we have a deadlock.

I haven't analyzed in full detail if this is the definite cause of your deadlocks, but it seems like a very likely candidate, especially if I look at the scenario @vahidov described (hangup media being the common cause for him). Not sure if the double publisher scenario is one that you're using, but if you can share some more details on how you're using the plugins it might help us better figure out if I'm on the right path or not.

At any rate, I'll start thinking of ways to avoid that lock. Something in that chain of events will need to become asynchronous, rather than synchronous: e.g., the notify_event from plugins might be a good option, as we already made close_pc and end_session asynchronous for similar reasons in the past (core getting back to plugins or viceversa causing deadlocks). Feedback is welcome of course, especially if you have better ideas.

lminiero commented 5 years ago

Mh, looks my steps-to-deadlock hypothesis was wrong: I just tried to replicate it with the VideoRoom demo, and I couldn't deadlock it. What I missed was that, before calling hangup_media and destroy_session, the session is marked as destroyed, and so any attempt to hit push_event actually finds the session already destroyed and the message never reaches the transport. As such, the deadlock may actually be caused by a different, yet similar, race condition, but one I have yet to identify.

I think I'll make transport_gone asynchronous for now, and give you something to test. If that seems to fix your issue, we'll know this was indeed the root cause. I'll prepare the fix shortly.

lminiero commented 5 years ago

@garg-geek @peffis @Cibifang @vahidov can you test again by pulling the fix above? As anticipated, this will make transport_gone not be within the lock calls anymore, and so should avoid lock contentions with send_message. Looking forward to your feedback.

vahidov commented 5 years ago

Hello. We're using only videoroom plugin. One publisher with only one stream (no simulcasting, no videosharing). Others - join to room as subscribers.

Steps publisher client (browser) 1) connect to janus 2) attach videoroom plugin 3) send request to our server API with credentials (session and handle) 4) our API make misc controls and after call to janus via rest with client session and handle

Steps listener client (browser) 1) connect to janus 2) attach videoroom plugin 3) send request to our server API with credentials (session and handle) 4) our API make misc controls and after call to janus via rest with client session and handle

If client get error in any step (from our API, janus error callbacks, mediaState off, webrtcState off, destroyed callback) - we try call hangup() and detach() on handle from client.

On one page listener can see more than 1 publisher (max = 12). For each room client make separate videoroom attach (separete handle). All handles worked over one janusConnection.

Client can make own publisher room and in same time looks on others users (on others handles as listener).

We clean unused rooms (rooms without active publishers) ones in 24 hours.

Sometimes, for old browsers we make rtp-forwarding and ffmpeg-convertation to rtmp. On request (not for all publishers) make rest-api calls to janus 1) listforwarders 2) if not exists - make rtp_forward

Params to create room, join publisher and listener.

P.S. We made reclaim_session_timeout = 3 in janus.cfg 4 days ago and till now we didn't catch transport freeze.

lminiero commented 5 years ago

But if you're using REST, what does it have to do with WebSockets?

vahidov commented 5 years ago

@lminiero clients (browsers) connects over WebSockets. Only our API using REST

lminiero commented 5 years ago

Then please try the fix I made and let me know if it's still an issue.

lminiero commented 5 years ago

Ping...

Cibifang commented 5 years ago

In order to temporarily circumvent this problem, I wrote a script to detect the status of Janus to restart directly when found the problem.

I will close this script and pull the fix to see if this issue happen again.

lminiero commented 5 years ago

Got no feedback in a month, so I'll assume this is fixed. Feel free to let me know if that's not the case.

g-garg-dev commented 5 years ago

Hi Lorenzo, i am still facing this issue. i enabled the web-socket logging to all but there was not logs printed.

Titan-p commented 4 years ago

janus_transport_gone will finally call destroy_session which is implemented by each plugin. I found that janus_transport_gone will lock sessions_mutex. janus_streaming_destroy_session by streaming plugin will lock sessions_mutex too. But sessions_mutex is inited by PTHREAD_MUTEX_INITIALIZER which is not recursive mutex. Will this cause any deadlock problems? @lminiero