pjsip / pjproject

PJSIP project
http://www.pjsip.org
GNU General Public License v2.0
2.06k stars 782 forks source link

If you hang up immediately after the call is connected, the hang up may not be executed immediately because there are too many events to be sent in the event queue, causing the hang up api to be blocked for a period of time. #4131

Closed zhuangguangkang0013 closed 1 day ago

zhuangguangkang0013 commented 1 week ago

Describe the bug

If you hang up immediately after the call is connected, the hang up may not be executed immediately because there are too many events to be sent in the event queue, causing the hang up api to be blocked for a period of time.

I traced the source code and found that the pjmedia_event_unsubscribe function requires acquiring the mgr->cb_mutex, but it seems to have to wait until all events in the queue are processed by the event_mgr_distribute_events function before it can acquire the lock. If there are still events in the queue that are being pushed (and failed pushes are causing them to be re-added to the queue or the event handler is blocked.potentially leading to an infinite loop,may be Will never be able to hangup), it is impossible to terminate immediately. Only if the other party actively hangs up or if the lock is acquired, can the hangup be triggered. I believe the correct approach should be that when I initiate a hangup, the events in the queue should be immediately cleared, no more events should be added to the queue, the subscription should be removed, and the hangup should take the highest priority. I should be able to interrupt the call immediately.

https://github.com/pjsip/pjproject/blob/master/pjmedia/src/pjmedia/event.c

pjsua_call_hangup -> pjsua_media_channel_deinit -> stop_media_session -> stop_media_stream -> pjsua_aud_stop_stream/pjsua_vid_stop_stream -> pjmedia_event_unsubscribe
{
....
 ########### Spin blocking trying to acquire cb_mutex lock ###########
    while(1) {
        pj_mutex_lock(mgr->mutex);
        if (pj_mutex_trylock(mgr->cb_mutex) == PJ_SUCCESS)
            break;
        /* The callback is currently being called, wait for a while. */
        pj_mutex_unlock(mgr->mutex);
        pj_thread_sleep(10);
    }
....
}

event_worker_thread / pjmedia_event_publish -> event_mgr_distribute_events
{
    while (sub != &mgr->esub_list) {
        *next_sub = sub->next;

        /* Check if the subscriber is interested in 
         * receiving the event from the publisher.
         */
        if (sub->epub == ev->epub || !sub->epub) {
            pjmedia_event_cb *cb = sub->cb;
            void *user_data = sub->user_data;
            pj_status_t status;

            if (rls_lock) {
                /* To make sure that event unsubscription waits
                 * until the callback completes.
                 */
                pj_mutex_lock(mgr->cb_mutex);
                pj_mutex_unlock(mgr->mutex);
            }

            TRACE_((THIS_FILE, "Distributing event %d sub->epub %p ev->epub %p "
                   "user data %p", i, sub->epub, ev->epub, sub->user_data));

            status = (*cb)(ev, user_data);
            if (status != PJ_SUCCESS && err == PJ_SUCCESS)
                err = status;

            if (rls_lock) {
                pj_mutex_unlock(mgr->cb_mutex);
                pj_mutex_lock(mgr->mutex);
            }
        }
        sub = *next_sub;
#ifdef EVENT_TRACE
        i++;
#endif
    }
}

Steps to reproduce

IP to IP. After the call is established, more events are sent or because of failure to send, there are events that are constantly resent, For example: sending INFO messages. (After answering the call, the hang-up API is called about 1s-2s later) can be triggered with a high probability.

PJSIP version

2.12

Context

andorid os version :8.1、6、11 pjsip version : 2.12 I checked the latest source code. It is consistent with version 2.12. This problem should also exist in the latest version.

Log, call stack, etc

https://github.com/pjsip/pjproject/blob/master/pjmedia/src/pjmedia/event.c

pjsua_call_hangup -> pjsua_media_channel_deinit -> stop_media_session -> stop_media_stream -> pjsua_aud_stop_stream/pjsua_vid_stop_stream -> pjmedia_event_unsubscribe
{
 ########### Spin blocking trying to acquire cb_mutex lock ###########
    while(1) {
        pj_mutex_lock(mgr->mutex);
        if (pj_mutex_trylock(mgr->cb_mutex) == PJ_SUCCESS)
            break;
        /* The callback is currently being called, wait for a while. */
        pj_mutex_unlock(mgr->mutex);
        pj_thread_sleep(10);
    }
}

event_worker_thread / pjmedia_event_publish -> event_mgr_distribute_events
{
    while (sub != &mgr->esub_list) {
        *next_sub = sub->next;

        /* Check if the subscriber is interested in 
         * receiving the event from the publisher.
         */
        if (sub->epub == ev->epub || !sub->epub) {
            pjmedia_event_cb *cb = sub->cb;
            void *user_data = sub->user_data;
            pj_status_t status;

            if (rls_lock) {
                /* To make sure that event unsubscription waits
                 * until the callback completes.
                 */
                pj_mutex_lock(mgr->cb_mutex);
                pj_mutex_unlock(mgr->mutex);
            }

            TRACE_((THIS_FILE, "Distributing event %d sub->epub %p ev->epub %p "
                   "user data %p", i, sub->epub, ev->epub, sub->user_data));

            status = (*cb)(ev, user_data);
            if (status != PJ_SUCCESS && err == PJ_SUCCESS)
                err = status;

            if (rls_lock) {
                pj_mutex_unlock(mgr->cb_mutex);
                pj_mutex_lock(mgr->mutex);
            }
        }
        sub = *next_sub;
#ifdef EVENT_TRACE
        i++;
#endif
    }
}
I traced the source code and found that the pjmedia_event_unsubscribe function requires acquiring the mgr->cb_mutex, but it seems to have to wait until all events in the queue are processed by the event_mgr_distribute_events function before it can acquire the lock. If there are still events in the queue that are being pushed (and failed pushes are causing them to be re-added to the queue, potentially leading to an infinite loop), it is impossible to terminate immediately. Only if the other party actively hangs up, or if the lock is acquired, can the hangup be triggered. I believe the correct approach should be that when I initiate a hangup, the events in the queue should be immediately cleared, no more events should be added to the queue, the subscription should be removed, and the hangup should take the highest priority. I should be able to interrupt the call immediately.
zhuangguangkang0013 commented 1 week ago

@Sauwming @Trengginas Hello, could you please take a look at this issue?

zhuangguangkang0013 commented 6 days ago

add log code runing .

pjmedia_event_unsubscribe{

....

PJ_LOG(4,(THIS_FILE, "Stopping stream..pjmedia_event_unsubscribe form hangup!"));
while(1) {
pj_mutex_lock(mgr->mutex);
PJ_LOG(4,(THIS_FILE, "Stopping stream..pjmedia_event_unsubscribe form hangup! try get mgr->cb_mutex lock!"));
if (pj_mutex_trylock(mgr->cb_mutex) == PJ_SUCCESS){
PJ_LOG(4,(THIS_FILE, "Stopping stream..pjmedia_event_unsubscribe form hangup!  get mgr->cb_mutex locked! now break while"));
break;
}
/* The callback is currently being called, wait for a while. */
pj_mutex_unlock(mgr->mutex);
pj_thread_sleep(10);
}

....

}

event_mgr_distribute_events{

...

while (sub != &mgr->esub_list) {
*next_sub = sub->next;
PJ_LOG(4,(THIS_FILE, "Stopping stream..event_mgr_distribute_events running"));

/* Check if the subscriber is interested in 
 * receiving the event from the publisher.
 */
if (sub->epub == ev->epub || !sub->epub) {
pjmedia_event_cb *cb = sub->cb;
void *user_data = sub->user_data;
pj_status_t status;

if (rls_lock) {
/* To make sure that event unsubscription waits
 * until the callback completes.
 */
PJ_LOG(4,(THIS_FILE, "Stopping stream..event_mgr_distribute_events lock! get  mgr->cb_mutex lock!"));
pj_mutex_lock(mgr->cb_mutex);
pj_mutex_unlock(mgr->mutex);
}

TRACE_((THIS_FILE, "Distributing event %d sub->epub %p ev->epub %p "
"user data %p", i, sub->epub, ev->epub, sub->user_data));

status = (*cb)(ev, user_data);
if (status != PJ_SUCCESS && err == PJ_SUCCESS)
err = status;

if (rls_lock) {
PJ_LOG(4,(THIS_FILE, "Stopping stream..event_mgr_distribute_events unlock free mgr->cb_mutex lock!"));
pj_mutex_unlock(mgr->cb_mutex);
pj_mutex_lock(mgr->mutex);
}
}
sub = *next_sub;
i++;
}

...

}

2024-11-06 10:02:21.627 20310-20339 pjsip-ev_thread D 10:02:21.626 pjsua_call.c !.Timed-out trying to acquire dialog mutex (possibly system has deadlocked) in pjsua_call_send_request 2024-11-06 10:02:21.629 20310-20339 pjsip-ev_thread D 10:02:21.628 pjsua_media.c !Failed requesting keyframe via SIP INFO: Operation timed out (PJ_ETIMEDOUT) It was blocked for a while before it actually sent a hang-up. log string max long.this is Complete call log from invite to bye log msg file. Complete call log from invite to bye.txt

sauwming commented 1 day ago

I believe this is caused by too rapid keyframe requests. Please check: https://github.com/pjsip/pjproject/pull/2786

The events in the queue cannot be cleared because they are shared (the events don't necessarily belong to that particular call only).

zhuangguangkang0013 commented 16 hours ago

I believe this is caused by too rapid keyframe requests. Please check: #2786

The events in the queue cannot be cleared because they are shared (the events don't necessarily belong to that particular call only).

Yes. Reducing the frequency of key frame requests can also be optimized by reducing events. At the same time, it will also cause another problem, which may cause the first frame screen to appear slower. The modification of my side is. When I hang up, I will prevent the incident from entering the queue and remove the event. You only need to wait for the current processing process to complete the hanging. As you say, the event manager is shared globally, which may cause some incident loss. But the scene of hanging up is that I already need to hang up, no other incidents. I tried my modification. The test runs well, and there is no problem for the time being. @sauwming