meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.02k stars 2.45k forks source link

Videoroom - subscriber unable to switch feeds after publisher's peer connection goes down, even though close_pc is set to false on the subscriber #1761

Closed jflasch closed 4 years ago

jflasch commented 4 years ago

NOTE: I posted about this issue in the google group, but I believe there is a bug so I am posting again here with more specifics as to what is happening.

Overview

My team and I are using the videoroom plugin to create a basic video conferencing demo. Each participant has a publisher handle and multiple subscriber handles subscribed to feeds that should be displayed on their screen. Every subscriber has close_pc = false when they start subscribing. When we swap the feed we want to show on someone's screen, we do a switch request to the new feed. Everything works fine with normal use cases and we are able to switch feeds properly.

Bug

The problem arises when a publisher's peer connection gets destroyed and then we try to do a switch request from one of the destroyed publisher's subscribers to another active feed. When we make this request after the publisher is gone, we get an error back from janus saying "No such room". I have confirmed with "list" requests and "listparticipants" requests to the plugin that the room does exist and that the publisher I am trying to switch to also exists, so the "No such room" error is incorrect. Also, I have confirmed that my subscriber peer connection is still around after the publisher goes down, which is expected with close_pc = false. I believe there is a bug here since I should still be able to switch my subscriber feed to an active one since I have specified close_pc = false. There may be something that I am doing wrong or misunderstanding about switching and the close_pc field, but at the very least the "No such room" error is not correct in this scenario.

lminiero commented 4 years ago

Very likely this is the cause: https://github.com/meetecho/janus-gateway/blob/master/plugins/janus_videoroom.c#L4753

You can try commenting out those two lines to see if it works as expected for you. I'm not sure if that will introduce leaks, or other issues, so please let me know.

jflasch commented 4 years ago

That fix worked for us, but we began experiencing rare janus crashes. I don't have a lot of information yet as to what is causing the crash or how to replicate it, but the change certainly introduced a new problem. If we discover anything more specific about it I'll let you know.

UPDATE: The Janus crashes happening from this change are not rare contrary to what I said above. While using the code with the recommended fix we experienced very frequent crashes after we did some more testing. We are unable to produce an exact reason as to what is causing the crash since it seems to happen randomly. Can you help us resolve this problem in any way? We would be happy to try out any fixes that can be made. Also, should we expect this behavior to be any different on the unified plan branch with the 'switch' or is it a problem on that branch as well?

lminiero commented 4 years ago

I won't be able to help until you provide libasan dumps or gdb stacktraces.

lminiero commented 4 years ago

should we expect this behavior to be any different on the unified plan branch with the 'switch' or is it a problem on that branch as well?

In the unified-plan branch, renegotiations are much more frequent, and in control of the application.

jflasch commented 4 years ago

Here's a link to a tgz containing stacktraces, core files of crashes as well as our modified videoroom.c file (containing the minor code change from above) and our Janus binary. We used the most recent Janus master from today.

https://infics-my.sharepoint.com/:u:/g/personal/tcumaranatunge_infinite_com/EXxDuvkUc9VMo4OF4ynnAOUBYq_CpCYyfjpV4JMkWy9jIQ?e=1u1Lhp

We captured a total of 5 crashes, 2 of which contain the same stacktraces. In the tgz there is a crash.number file for each crash and inside of it there is a stacktrace and a core file name reference (core.). This is the stacktrace that we have two of:

Thread 20 "janus" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f1ff5ffb700 (LWP 22428)]
0x00007f201efb3b15 in g_mutex_lock () from target:/lib/x86_64-linux-gnu/libglib-2.0.so.0
(gdb) bt
#0  0x00007f201efb3b15 in g_mutex_lock () from target:/lib/x86_64-linux-gnu/libglib-2.0.so.0
#1  0x00007f200016fa77 in janus_videoroom_handler (data=<optimized out>) at plugins/janus_videoroom.c:5914
#2  0x00007f201ef95c55 in ?? () from target:/lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007f201d7f66ba in start_thread (arg=0x7f1ff5ffb700) at pthread_create.c:333
#4  0x00007f201d52c41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

core.22409

Please let us know what you find, thank you for the help.

lminiero commented 4 years ago

Please don't forget to put code snippets in quotes, or the gdb trace links will be misread as references to existing github issues. A libasan dump would help as well (probably more, if it is a memory issue).

jflasch commented 4 years ago

Sorry about that, thanks for fixing it. We used libasan and caused 3 more crashes, below you can find the new stacktraces.

https://pastebin.com/dzM1utWm https://pastebin.com/17SuEE8r https://pastebin.com/aTe4hEyQ

lminiero commented 4 years ago

This seems to confirm it's a memory/reference counter issue. The plugin is trying to access a subscriber after it's been freed because the count of references dropped to 0. Next step would be uncommenting REFCOUNT_DEBUG in refcount.h and recompile, and try again, and possibly share the full log: this will allow us to track the references as they go up and down, to try and figure out why/where a specific reference dropped too soon.

lminiero commented 4 years ago

That said, please do it on master and not on a modified/old version.

jflasch commented 4 years ago

We uncommented REFCOUNT_DEBUG and got 2 new crash logs using libasan. Attached is a zip file containing the 2 full logs where each crash occurred. The crash is at the end of each file.

asnLogs.zip

lminiero commented 4 years ago

This is not master, lines don't match. What commit do the logs refer to?

lminiero commented 4 years ago

Anyway, you probably want to remove/comment the janus_refcount_decrease call that, in your version, is at line 4775. Pretty sure that's the subscriber reference we add for having the subscriber in the room, so since you're not removing the subscriber from the room in that point anymore, the reference shouldn't decreased either.

That should solve that specific crash, but you may want to be sure there aren't any leaks now: keep the refcount-debugging active, and if it's not crashing anymore, wait until all users have left to cleanly close Janus. This will tell you if there are resources that haven't been properly deallocated. If they have, we can be confident that this fix can land in master.

jflasch commented 4 years ago

Just to confirm, this is the code that we have changed in janus_videoroom.c:

        if(s->feed)
                g_clear_pointer(&s->feed, janus_videoroom_publisher_dereference_by_subscriber);
/*      if(s->room)
                g_clear_pointer(&s->room, janus_videoroom_room_dereference); */
        if(s->session && s->close_pc)
                gateway->close_pc(s->session->handle);
        /* Done with the subscriber and free its reference */
/*      janus_refcount_decrease(&s->ref); */

We commented the room dereference and the refcount decrease. This change fixes the "No such room" error and also keeps Janus from crashing, so it seems to be the correct fix.

What do you mean by "cleanly close Janus"? How do we shut down Janus in a way that makes it print objects that are still left? Is there a way to match up the refcount increments and decrements to determine if there are still objects leaking?

lminiero commented 4 years ago

What do you mean by "cleanly close Janus"? How do we shut down Janus in a way that makes it print objects that are still left?

CTRL+C or a SIGINT

Is there a way to match up the refcount increments and decrements to determine if there are still objects leaking?

If refcount debugging is enabled, a summary will be printed on the console/logs before the application exits.

jflasch commented 4 years ago

The commit that this scenario refers to is 7df23ef9bcdc62e8e736682c31c502e37cace561

There are definitely some memory leaks caused by this fix. We verified this by first running Janus with the original janus_videoroom.c code from the above commit, in which the only remaining objects in memory were valid as they referenced the transport, which was still being used. Then we used the modified janus_videoroom.c code with the commented lines above; there were no crashes, but there were 56 objects leaked when we closed Janus.

Attached is a zip containing the Janus log that shows the leaks at the very end of the file as well as the janus_videoroom.c file that we are running. janus_leak.zip

What are the best steps for troubleshooting this further? Note that we are detaching all subscriber handles before we destroy the room. Is there a way to clean up subscriber references during the detach? Since we removed the refcount decrement, the subscribers always stick around.

jflasch commented 4 years ago

We have come up with a solution for this, the updated janus_videoroom.c file can be seen in the pastebin below:

https://pastebin.com/Dv5QJHen

The changes were made in janus_videoroom_hangup_subscriber starting on line 4775, as well as in janus_videoroom_destroy_session lines 2442 to 2446.

This change stops the memory leaks reported in the posts above. in janus_videoroom_hangup_subscriber, we are now only dereferencing the room and decreasing the refcount if close_pc is true. Otherwise, we are leaving the reference to the room so that we can still switch the subscriber to a new publisher even if the original publisher gets destroyed. Once janus_videoroom_destroy_session gets called, we will then properly cleanup the lingering subscriber references if close_pc is false. Please let us know if this is an appropriate fix for this problem

lminiero commented 4 years ago

The 2442-2446 lines seem overkill to me. There already is a check for s->room a few lines below, that removes a reference to the room if there: I think you can just add a janus_refcount_decrease(&s->ref); in that if for the same result. Besides, it would ensure references are removed no matter whether close_pc is set or not, but only depending on whether the subscriber is still in the room.

jflasch commented 4 years ago

You are correct, those lines were overkill. We tried it with your suggestion and it still worked properly, no crashes and no leaks. We tested with subscribers having close_pc set to true and false to make sure it works properly for both cases.

Below are the two functions we have modified in janus_videoroom.c. If these look good to you, could you please commit them to master?

static void janus_videoroom_hangup_subscriber(janus_videoroom_subscriber * s) {
        /* Already hung up */
        if (!s->feed) {
                JANUS_LOG(LOG_INFO, "videroom_hangup_subscriber NO FEED %p \n", &s->ref);
                return;
        } else {
        JANUS_LOG(LOG_INFO, "videroom_hangup_subscriber %p \n", &s->ref);
        }
        /* Check if the owner needs to be cleaned up */
        if(s->pvt_id > 0 && s->room != NULL) {
                janus_mutex_lock(&s->room->mutex);
                janus_videoroom_publisher *owner = g_hash_table_lookup(s->room->private_ids, GUINT_TO_POINTER(s->pvt_id));
                if(owner != NULL) {
                        janus_mutex_lock(&owner->subscribers_mutex);
                        /* Note: we should refcount these subscription-publisher mappings as well */
                        owner->subscriptions = g_slist_remove(owner->subscriptions, s);
                        janus_mutex_unlock(&owner->subscribers_mutex);
                }
                janus_mutex_unlock(&s->room->mutex);
        }
        /* TODO: are we sure this is okay as other handlers use feed directly without synchronization */
        if(s->feed)
                g_clear_pointer(&s->feed, janus_videoroom_publisher_dereference_by_subscriber);
        if(s->close_pc) {
                if(s->room)
                        g_clear_pointer(&s->room, janus_videoroom_room_dereference);
                if(s->session)
                        gateway->close_pc(s->session->handle);
                /* Done with the subscriber and free its reference */
                janus_refcount_decrease(&s->ref);
        }
}
void janus_videoroom_destroy_session(janus_plugin_session *handle, int *error) {
        if(g_atomic_int_get(&stopping) || !g_atomic_int_get(&initialized)) {
                *error = -1;
                return;
        }
        janus_mutex_lock(&sessions_mutex);
        janus_videoroom_session *session = janus_videoroom_lookup_session(handle);
        if(!session) {
                janus_mutex_unlock(&sessions_mutex);
                JANUS_LOG(LOG_ERR, "No VideoRoom session associated with this handle...\n");
                *error = -2;
                return;
        }
        if(g_atomic_int_get(&session->destroyed)) {
                janus_mutex_unlock(&sessions_mutex);
                JANUS_LOG(LOG_WARN, "VideoRoom session already marked as destroyed...\n");
                return;
        }
        /* Cleaning up and removing the session is done in a lazy way */
        if(!g_atomic_int_get(&session->destroyed)) {
                /* Any related WebRTC PeerConnection is not available anymore either */
                janus_videoroom_hangup_media_internal(handle);
                if(session->participant_type == janus_videoroom_p_type_publisher) {
                        /* Get rid of publisher */
                        janus_mutex_lock(&session->mutex);
                        janus_videoroom_publisher *p = (janus_videoroom_publisher *)session->participant;
                        if(p)
                                janus_refcount_increase(&p->ref);
                        session->participant = NULL;
                        janus_mutex_unlock(&session->mutex);
                        if(p && p->room) {
                                janus_videoroom_leave_or_unpublish(p, TRUE, FALSE);
                                /* Don't clear p->room.  Another thread calls janus_videoroom_leave_or_unpublish,
                                         too, and there is no mutex to protect this change. */
                                g_clear_pointer(&p->room, janus_videoroom_room_dereference);
                        }
                        janus_videoroom_publisher_destroy(p);
                        if(p)
                                janus_refcount_decrease(&p->ref);
                } else if(session->participant_type == janus_videoroom_p_type_subscriber) {
                        janus_videoroom_subscriber *s = (janus_videoroom_subscriber *)session->participant;
                        session->participant = NULL;
                        if(s->room) {
                                janus_refcount_decrease(&s->room->ref);
                                janus_refcount_decrease(&s->ref);
                        }
                        janus_videoroom_subscriber_destroy(s);
                }
                g_hash_table_remove(sessions, handle);
        }
        janus_mutex_unlock(&sessions_mutex);
        return;
}
lminiero commented 4 years ago

Any chance you can prepare a diff patch, or a pull request? Would make it easier to identify the changes... Thanks!

jflasch commented 4 years ago

Here's a patch file:

https://pastebin.com/Sf874YR7

This patch file is relative to the most recent commit 38d29d0f02db14c8dd07e1ded0c4d250f502ca1c Please merge to master. Thanks!

lminiero commented 4 years ago

Done, thanks!