meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.23k stars 2.48k forks source link

Segmentation fault when wss and admin wss are enabled #913

Closed tgabi333 closed 7 years ago

tgabi333 commented 7 years ago

When i try to connect to swss it hangs up, janus process has quit with segmentation fault.

Config:

Actions:

After admin_wss is turned off, it works. Using libwebsockets 2.2.1

Logs: there is no relevant log event with Stack trace: https://pastebin.com/aHMfj8mY

emustafa commented 7 years ago

I see. I will consider that.

One last question I have is how do I stop a server? To be more precise, how do I stop vhost? I checked the header file and didn't find anything like lws_vhost_destroy.

Is it similar to how you close a client connection? Like faking an event and returning -1 or something?

Thanks.

lws-team commented 7 years ago

It was introduced in v2.3

LWS_VISIBLE LWS_EXTERN void
lws_vhost_destroy(struct lws_vhost *vh);

You can also close connections cleanly outside of their own service by setting their timeout to -1.

emustafa commented 7 years ago

@lws-team

I am using v2.1. Would there be any workaround for that in this version?

By setting timeout to -1, do you mean this function?

LWS_VISIBLE LWS_EXTERN void
lws_set_timeout(struct lws *wsi, enum pending_timeout reason, int secs);

Is it like I just call

lws_set_timeout(wsi, 'ANY_REASON', -1)

Thanks.

lws-team commented 7 years ago

I am using v2.1. Would there be any workaround for that in this version?

Sure... create the vhosts once as I have been suggesting all along and let them be destroyed when the context is destroyed at the end.

lws_set_timeout(wsi, 'ANY_REASON', -1)

Yes but ANY_REASON is a number. It has no meaning except to diagnose what it was waiting on when it timed out... you can make your own number.

In 2.3 rather than -1 you can use one of these constants

#define LWS_TO_KILL_ASYNC -1
/**< If LWS_TO_KILL_ASYNC is given as the timeout sec in a lws_set_timeout()
 * call, then the connection is marked to be killed at the next timeout
 * check.  This is how you should force-close the wsi being serviced if
 * you are doing it outside the callback (where you should close by nonzero
 * return).
 */
#define LWS_TO_KILL_SYNC -2
/**< If LWS_TO_KILL_SYNC is given as the timeout sec in a lws_set_timeout()
 * call, then the connection is closed before returning (which may delete
 * the wsi).  This should only be used where the wsi being closed is not the
 * wsi currently being serviced.
 */
emustafa commented 7 years ago

Got it. Thanks!

lminiero commented 7 years ago

Just merged the fix, so I'm closing this. Thanks to you all for the feedback, especially to @lws-team for the help on how to best use the library!

tgabi333 commented 7 years ago

Thank you for all :+1:

lminiero commented 7 years ago

@lws-team apologies for pinging you again on this, but I found out a weird behaviour that the new vhost approach introduced, and so wanted to know a bit more about what's happening internally in the library.

Specifically, we got "Interrupted system call" EINTR errors for polls on some UDP server sockets we create in another plugin (see #1004), and I verified they only happened with the new change in the WebSockets transport plugin. I fixed those by ignoring EINTR (we have other checks to break the loop), but it's still weird that we only started getting those after this change.

Does the usage of vhosts imply something different than the multiple contexts we were doing before, e.g., SIGHUP signals being triggered or similar things that might explain what was happening? Thanks!

lws-team commented 7 years ago

Hmmm "interrupted system call" is a response to a signal. lws has no idea about the existence of your sockets outside lws and it doesn't randomly generate signals.

In lws we swallow SIGHUP and find out about it by other means. It sounds like at process level, you are not dealing with SIGHUP (which is generated by default when a socket peer closes).

lminiero commented 7 years ago

git bisect seems to blame that commit, and if I disable the WebSockets transport those errors go away, so that's why I suspected something with vhosts (it doesn't happen with the "old" plugin). Anyway, it may be something I'm doing wrong myself using the code. I'll try to dig a bit deeper in the next days to see if I can find anything about the source of that EINTR, then. Thanks!