signalwire / freeswitch

FreeSWITCH is a Software Defined Telecom Stack enabling the digital transformation from proprietary telecom switches to a versatile software implementation that runs on any commodity hardware. From a Raspberry PI to a multi-core server, FreeSWITCH can unlock the telecommunications potential of any device.
https://freeswitch.com/#getting-started
Other
3.5k stars 1.4k forks source link

Hangup a call when starting recording may cause zombie channels #2067

Open hateeyan opened 1 year ago

hateeyan commented 1 year ago

Describe the bug Hangup a call when starting recording may cause zombie channels. The coredump shows that the channel is waiting a condition signal in recording_thread.

To Reproduce Hard to reproduce

Expected behavior Channel cleared

Package version or git hash

Trace logs

backtrace from core file If applicable, provide the full backtrace from the core file.

(gdb) t 54
[Switching to thread 54 (LWP 556260)]
#0  0x00007f63f02460df in pthread_rwlock_wrlock () from /mnt/volume1/debug/usr/lib/x86_64-linux-gnu/libpthread-2.28.so
(gdb) bt full
#0  0x00007f63f02460df in pthread_rwlock_wrlock () from /mnt/volume1/debug/usr/lib/x86_64-linux-gnu/libpthread-2.28.so
No symbol table info available.
#1  0x00007f63f0680a19 in apr_thread_rwlock_wrlock (rwlock=<optimized out>) at locks/unix/thread_rwlock.c:97
        stat = <optimized out>
#2  0x00007f63f0309dd5 in switch_thread_rwlock_wrlock (rwlock=<optimized out>) at src/switch_apr.c:253
No locals.
#3  0x00007f63f0348b2c in switch_core_session_write_lock (session=<optimized out>) at src/switch_core_rwlock.c:154
No locals.
#4  0x00007f63f033d73f in switch_core_session_thread (thread=<optimized out>, obj=<optimized out>) at src/switch_core_session.c:1755
        session = 0x7f63a453d308
        event = 0x7f6362eb2bc0
        event_str = 0x0
        val = <optimized out>
        __func__ = <error reading variable __func__ (Cannot access memory at address 0x7f63f06a1650)>
        __PRETTY_FUNCTION__ = <error reading variable __PRETTY_FUNCTION__ (Cannot access memory at address 0x7f63f06a1630)>
#5  0x00007f63f0338e2b in switch_core_session_thread_pool_worker (thread=0x7f638829b1c0, obj=<optimized out>) at src/switch_core_session.c:1800
        td = 0x7f63a454d138
        pop = 0x7f63a454d138
        check_status = <optimized out>
        node = <optimized out>
        pool = 0x7f638829af68
        __func__ = <error reading variable __func__ (Cannot access memory at address 0x7f63f06a16e0)>
#6  0x00007f63f0686b6c in dummy_worker (opaque=0x7f638829b1c0) at threadproc/unix/thread.c:151
        thread = 0x7f638829b1c0
#7  0x00007f63f0240fa3 in start_thread () from /mnt/volume1/debug/usr/lib/x86_64-linux-gnu/libpthread-2.28.so
No symbol table info available.
#8  0x00007f63efdf306f in clone () from /mnt/volume1/debug/usr/lib/x86_64-linux-gnu/libc-2.28.so

(gdb) t 55
[Switching to thread 55 (LWP 557436)]
#0  0x00007f63f024700c in pthread_cond_wait@@GLIBC_2.3.2 () from /mnt/volume1/debug/usr/lib/x86_64-linux-gnu/libpthread-2.28.so
(gdb) bt full
#0  0x00007f63f024700c in pthread_cond_wait@@GLIBC_2.3.2 () from /mnt/volume1/debug/usr/lib/x86_64-linux-gnu/libpthread-2.28.so
No symbol table info available.
#1  0x00007f63f03e73e1 in recording_thread (thread=<optimized out>, obj=<optimized out>) at src/switch_ivr_async.c:1329
        bug = <optimized out>
        session = 0x7f63a453d308
        channel = 0x7f63a453b3e0
        rh = 0x555782e9df00
        bsize = 8192
        samples = 0
        inuse = <optimized out>
        data = 0x7f635c14d028 ""
        channels = 2
        read_impl = {codec_type = SWITCH_CODEC_TYPE_AUDIO, ianacode = 0 '\000', iananame = 0x0, fmtp = 0x0, samples_per_second = 0, actual_samples_per_second = 0, bits_per_second = 0, 
          microseconds_per_packet = 0, samples_per_packet = 0, decoded_bytes_per_packet = 0, encoded_bytes_per_packet = 0, number_of_channels = 0 '\000', codec_frames_per_packet = 0, 
          init = 0x0, encode = 0x0, decode = 0x0, encode_video = 0x0, decode_video = 0x0, codec_control = 0x0, destroy = 0x0, codec_id = 0, impl_id = 0, modname = 0x0, next = 0x0}
        __func__ = <error reading variable __func__ (Cannot access memory at address 0x7f63f06bd590)>
        __PRETTY_FUNCTION__ = <error reading variable __PRETTY_FUNCTION__ (Cannot access memory at address 0x7f63f06bd570)>
#2  0x00007f63f0686b6c in dummy_worker (opaque=0x555782e9e970) at threadproc/unix/thread.c:151
        thread = 0x555782e9e970
#3  0x00007f63f0240fa3 in start_thread () from /mnt/volume1/debug/usr/lib/x86_64-linux-gnu/libpthread-2.28.so
No symbol table info available.
#4  0x00007f63efdf306f in clone () from /mnt/volume1/debug/usr/lib/x86_64-linux-gnu/libc-2.28.so
hateeyan commented 1 year ago

When the recording media_bug is initing, if hang up the call now, the problem will be reproduced.

To Reproduce

  1. patch switch_ivr_async.c Make sure we have enough time to hang up the call before the media bug init finishes.
--- switch_ivr_async.c  2023-08-24 13:46:50.935017745 +0800
+++ switch_ivr_async.c.patched  2023-08-24 13:55:23.135628464 +0800
@@ -1509,6 +1509,8 @@ static switch_bool_t record_callback(swi
                    switch_core_media_gen_key_frame(session);
                }
            }
+           /* Make sure we have enough time to hang up the call before the media bug init finishes */
+           switch_yield(20000000);
        }
        break;
    case SWITCH_ABC_TYPE_TAP_NATIVE_READ:
  1. make a call
  2. bgapi uuid_record
  3. hupall
andywolk commented 1 year ago

@hateeyan Thanks for the details. Can you please describe here a little bit more information about this PR and how it changes the behavior.

hateeyan commented 1 year ago

Hi, @andywolk

https://github.com/signalwire/freeswitch/blob/b74245d48a1f65a05e853f24e973f9b9ff35f8f5/src/switch_core_session.c#L1717-L1747

After hanging up the call, session_thread removed all media bugs (line 1728).

https://github.com/signalwire/freeswitch/blob/b74245d48a1f65a05e853f24e973f9b9ff35f8f5/src/switch_core_media_bug.c#L955-L1000

Media bug will be appended to session->bugs after the initialization finished (SWITCH_ABC_TYPE_INIT). In this case, the media bug will never be colsed (SWITCH_ABC_TYPE_CLOSE).

https://github.com/signalwire/freeswitch/blob/b74245d48a1f65a05e853f24e973f9b9ff35f8f5/src/switch_ivr_async.c#L1611-L1621

In record_callback, if the media bug is not closed, the thread condition will not signal.

https://github.com/signalwire/freeswitch/blob/b74245d48a1f65a05e853f24e973f9b9ff35f8f5/src/switch_ivr_async.c#L1276-L1278

https://github.com/signalwire/freeswitch/blob/b74245d48a1f65a05e853f24e973f9b9ff35f8f5/src/switch_ivr_async.c#L1327-L1333

In recording_thread, lock the session and wait for the condition. A deadlock happens which can explain the backtrace.

if we lock the bug_rwlock, it will wait for the initialization to complete.

andywolk commented 1 year ago

if we lock the bug_rwlock, it will wait for the initialization to complete.

That's the part I was more interested in. Can you please add a little bit more information about how this changes the behavior.

hateeyan commented 1 year ago

https://github.com/signalwire/freeswitch/blob/b74245d48a1f65a05e853f24e973f9b9ff35f8f5/src/switch_core_media_bug.c#L1237-L1244

Because it will also lock when removing media bugs. If a media bug is initing, it will block until the bug_rwlock unlocked.

andywolk commented 1 year ago

https://github.com/signalwire/freeswitch/blob/b74245d48a1f65a05e853f24e973f9b9ff35f8f5/src/switch_core_media_bug.c#L1237-L1244

Because it will also lock when removing media bugs. If a media bug is initing, it will block until the bug_rwlock unlocked.

Is it possible that switch_core_media_bug_add is called after switch_core_media_bug_remove_all_function?

hateeyan commented 1 year ago

https://github.com/signalwire/freeswitch/blob/b74245d48a1f65a05e853f24e973f9b9ff35f8f5/src/switch_core_media_bug.c#L1237-L1244

Because it will also lock when removing media bugs. If a media bug is initing, it will block until the bug_rwlock unlocked.

Is it possible that switch_core_media_bug_add is called after switch_core_media_bug_remove_all_function?

I just tested this case and it doesn't cause problems.

  1. switch_core_media_bug_remove_all_function will be called again in switch_core_session_destroy. https://github.com/signalwire/freeswitch/blob/b74245d48a1f65a05e853f24e973f9b9ff35f8f5/src/switch_core_session.c#L1558

  2. Before first calling switch_core_media_bug_remove_all_function, the channel has been set to CS_DESTROY, new api calls(uuid_record) will fail.

hateeyan commented 1 year ago

Found a related issue #1673

hateeyan commented 9 months ago

It works fine in our production environment for almost 3 months with the patch #2218.

mishehu commented 6 months ago

We're almost a year out with this patch. Are there any status updates? I'm seeing this issue in the wild.