signalwire / freeswitch

FreeSWITCH is a Software Defined Telecom Stack enabling the digital transformation from proprietary telecom switches to a versatile software implementation that runs on any commodity hardware. From a Raspberry PI to a multi-core server, FreeSWITCH can unlock the telecommunications potential of any device.
https://freeswitch.com/#getting-started
Other
3.54k stars 1.41k forks source link

stuck channels on FreeSWITCH 1.10.10 (possibly transcoding is the cause) #2264

Open gabada opened 1 year ago

gabada commented 1 year ago

Describe the bug FreeSWITCH channels are getting stuck. When you restart/fsctl crash freeswitch the CDR that's written says INCOMPATIBLE_DESTINATION.

When I set my ITSP to only offer PCMU everything works as expected.

When you do a uuid_kill on the channel it says OK but doesn't actually kill it (remove it from the database). You need to restart Freeswitch to get it removed.

To Reproduce Steps to reproduce the behavior:

  1. Make a bunch of inbound calls to Freeswitch 1.10.10 offering codes other than PCMU
  2. It is not every call but eventually you will start seeing calls that got stuck.

Expected behavior FreeSWITCH should handle the transcoding and the calls should release normally when they are completed.

Package version or git hash

Trace logs freeswitch.log is a call that is still stuck. freeswitch.log

backtrace from core file

backtrace.log

markjcrane commented 1 year ago

I've seen this on multiple servers as well. If using PCMU only it seems to solve the calls that get stuck.

andywolk commented 1 year ago

Not all debugging symbols were loaded. Please re-do the backtrace.

kaaelhaa commented 11 months ago

We are also seeing this. We cannot replicate in our test environment, but it happens in production.

It does not seem to be related to call volume, just randomly calls get "stuck" and it looks like the thread processing the call in FreeSWITCH is hanging.

kaaelhaa commented 11 months ago

Downgrading to FreeSWITCH 1.10.8 fixes the issue for us, although not ideal.

fetristan commented 11 months ago

Hi, We have the same problem there, the freeswitch receive the BYE but don't answer to it and stuck the line randomly. We are forced to uuid_kill the line. image

andywolk commented 11 months ago

@kaaelhaa FreeSWITCH 1.10.8 because 1.10.9 is also affected?

kaaelhaa commented 11 months ago

@kaaelhaa FreeSWITCH 1.10.8 because 1.10.9 is also affected?

@andywolk I don't know if 1.10.9 is affected. We skipped 1.10.9 in our production environment so haven't tested.

andywolk commented 11 months ago

@kaaelhaa was libsofia downgraded as well or just FreeSWITCH?

kaaelhaa commented 11 months ago

@kaaelhaa was libsofia downgraded as well or just FreeSWITCH?

@andywolk Downgraded by installing the Debian packages for 1.10.8 so I assume that also means libsofia was downloaded?

andywolk commented 11 months ago

@kaaelhaa What is libsofia version on the system with FS 1.10.8 right now?

gabada commented 11 months ago

I added the symbols yesterday. waiting for some stuck calls.

andywolk commented 11 months ago

@gabada If dbgsym files are installed and if you still have the core dump file you can re generate the backtrace.

gabada commented 11 months ago

@andywolk Here is the newly generated backtrace it still has some ?? in it. not sure why. I installed freeswitch-all-dbg libfreeswitch1-dbg so far no stuck calls but I dont have many calls going through this server right now. backtrace.log

Also BKW asked me to run deadlock.py on it and there are no deadlocks

andywolk commented 11 months ago

@gabada Because there are no libsofia dbg symbols: apt-get install libsofia-sip-ua0-dbgsym

Also when doing a backtrace please do

thread apply all bt
info threads
thread apply all bt full
gabada commented 11 months ago

I installed libsofia Still seeing this after bt

#0  0x00007fdfab91d7bc in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffe52debcb0) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x00007fdfabdb03e2 in fspr_sleep (t=<optimized out>) at time/unix/time.c:246
#2  0x00007fdfaba9f3f2 in switch_core_runtime_loop (bg=<optimized out>) at src/switch_core.c:1201
#3  0x0000558c99a7bf09 in ?? ()
#4  0x00007fdfab8461ca in __libc_start_call_main (main=main@entry=0x558c99a7b490, argc=argc@entry=7, argv=argv@entry=0x7ffe52df0a18) at ../sysdeps/nptl/libc_start_call_main.h:58
#5  0x00007fdfab846285 in __libc_start_main_impl (main=0x558c99a7b490, argc=7, argv=0x7ffe52df0a18, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe52df0a08) at ../csu/libc-start.c:360
#6  0x0000558c99a7d161 in ?? ()
andywolk commented 11 months ago

Oh never mind dbg for sofia. it is there. other ?? are at least from lua. We don't currently need those.

Please re-generate using gdb commands I mentioned.

gabada commented 11 months ago

Here is a new backtrace.

backtrace.log

andywolk commented 11 months ago

Now with your help let's check if there are deadlocks. You will need to download a py script from github, configure gdb, open the dump and execute the deadlock command. Here are the steps: libc6-dbg is a requirement for the script

apt-get install libc6-dbg wget
wget -O /deadlock.py https://raw.githubusercontent.com/facebook/folly/master/folly/experimental/gdb/deadlock.py
echo source -v /deadlock.py > /root/.gdbinit

/root/.gdbinit user root may be different in your case.

Then open gdb and do

deadlock

And see if it finds anything

gabada commented 11 months ago

No deadlock detected. Do you have debug symbols installed?

kaaelhaa commented 11 months ago

@kaaelhaa What is libsofia version on the system with FS 1.10.8 right now?

@andywolk this is the Sofia version on 1.10.8:

$ dpkg -l | grep sofia
ii  freeswitch-mod-sofia              1.10.8~release~20~3510866140~buster-1~buster+1 amd64        mod_sofia for FreeSWITCH
ii  libsofia-sip-ua0                  1.13.16-125~dfc7095f4c~buster                  amd64        Sofia-SIP library runtime

And 1.10.10 for comparison:

$ dpkg -l | grep sofia
ii  freeswitch-mod-sofia              1.10.10~release~24~4cb05e7f4a~buster-1~buster+1 amd64        mod_sofia for FreeSWITCH
ii  libsofia-sip-ua0                  1.13.16-125~dfc7095f4c~buster                   amd64        Sofia-SIP library runtime

So the same version of libsofia across the two systems.

andywolk commented 11 months ago
apt-cache policy libsofia-sip-ua0
gabada commented 11 months ago

@andywolk I currently have a stuck channel. I did gcore $(pidof freeswitch) I just ran the deadlock script on it but there is no deadlock. I am going to make a backtrace now and upload it shortly.

gabada commented 11 months ago

@andywolk Here is the new backtrace. The instance is still up with the stuck call so if you need anything else or even want to do a screenshare session please let me know. backtrace.log

kaaelhaa commented 11 months ago
apt-cache policy libsofia-sip-ua0

On the 1.10.8 host:

$ apt-cache policy libsofia-sip-ua0
libsofia-sip-ua0:
  Installed: 1.13.16-125~dfc7095f4c~buster
  Candidate: 1.13.16-125~dfc7095f4c~buster
  Version table:
 *** 1.13.16-125~dfc7095f4c~buster 500
        500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
        100 /var/lib/dpkg/status
     1.13.16-124~dfc7095f4c~buster 500
        500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
     1.13.15-123~2366f9cf40~buster 500
        500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
     1.13.15-122~a3b83039d4~buster 500
        500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
     1.12.11+20110422.1-2.1+deb10u4 500
        500 http://security.debian.org/debian-security buster/updates/main amd64 Packages
     1.12.11+20110422.1-2.1+b1 500
        500 http://cdn-aws.deb.debian.org/debian buster/main amd64 Packages

1.10.10 for comparison:

$ apt-cache policy libsofia-sip-ua0
libsofia-sip-ua0:
  Installed: 1.13.16-125~dfc7095f4c~buster
  Candidate: 1.13.16-125~dfc7095f4c~buster
  Version table:
 *** 1.13.16-125~dfc7095f4c~buster 500
        500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
        100 /var/lib/dpkg/status
     1.13.16-124~dfc7095f4c~buster 500
        500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
     1.13.15-123~2366f9cf40~buster 500
        500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
     1.13.15-122~a3b83039d4~buster 500
        500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
     1.12.11+20110422.1-2.1+deb10u4 500
        500 http://security.debian.org/debian-security buster/updates/main amd64 Packages
     1.12.11+20110422.1-2.1+b1 500
        500 http://cdn-aws.deb.debian.org/debian buster/main amd64 Packages
andywolk commented 11 months ago

Thank you. We see a pattern in both backtraces. Will analyze further.

prashan-abey commented 11 months ago

Hi @andywolk , is there any progress on the investigation.

We are experiencing the same issue after FreeSWITCH 1.10.10 upgrade.

flaviogrossi commented 11 months ago

@andywolk if it helps, I think I might be experiencing the same issue, and it seems that for the relevant calls, this log is printed in the logs, but not this one.

Meaning that the session lock can never be acquired or released.

We're currently investigating the issue, will post updates in case any further details come up.

andywolk commented 11 months ago

We are working on a solution. There is another issue filed where we can see similar things https://github.com/signalwire/freeswitch/issues/2290

xadhoom commented 11 months ago

For those that needs the latest freeswitch (like us, because of openssl 3.0.x support), reverting the commit cited in #2290 temporarly fixes the issue (we have it in production with the revert and all is running fine).

robertoscarpone commented 10 months ago

We are encountering this problem too, more and more frequently. Besides reverting the commit mentioned in #2290, is there any news on fixing this issue?

gabada commented 9 months ago

Was this resolved in FreeSWITCH 1.10.11? Thanks!

greenbea commented 8 months ago

No, 1.10.11 still has this bug.

gabada commented 8 months ago

No, 1.10.11 still has this bug.

Thanks for letting me know.

shaunjstokes commented 6 months ago

We also have this problem on 1.10.11, not an issue on 1.10.9.

pinc444 commented 5 months ago

I also have this issue I created a script that do uuid kill on old channels here is the code of this script

#!/bin/bash
now1=$(/usr/local/freeswitch/bin/fs_cli  -x "strepoch")
#here you set how old calls you want to kill
hour=3
minute=0
seconds=0

#here we calculate the total number of seconds from hour+minutes+seconds
allSec=0
allSec=$(($hour*60*60))
allSec=$(($allSec+$minute*60))
allSec=$(($allSec+$seconds))

#here we set the epoch we want to kill everything before it
timeLine=$(($now1-$allSec))

#here we are collecting from the datebase the uuid's and the epoch of all old calls
uuids=($(sqlite3 /usr/local/freeswitch/db/core.db  "select uuid from channels where created_epoch < $timeLine;"))
time1=($(sqlite3 /usr/local/freeswitch/db/core.db  "select created_epoch from channels where created_epoch < $timeLine;"))
time2=($(sqlite3 /usr/local/freeswitch/db/core.db  "select created from channels where  created_epoch < $timeLine;"))
#for uuid in "${uuids[@]}"; do

#here we loop over and do uuid_kill to all old calls
tmi=0
for ((i=0;i<${#uuids[@]};i++))
do
     echo "the uuid #:$i is:${uuids[$i]} with epoch: ${time1[$i]}  ${time2[$tmi]} ${time2[$tmi+1]}"
   rsp=$(/usr/local/freeswitch/bin/fs_cli  -x "uuid_kill ${uuids[$i]}")
   echo "the response is:$rsp"
   tmi=$tmi+2
done
dcw67 commented 1 week ago

@andywolk Is there any progress on this bug and related tickets #2290 and #2389? I was hoping it would be fixed in the most recent release. We have a large fleet of FS 1.10.10 servers and we have so many stuck channels now it impacts on channel limiting behaviour and capacity monitoring. We need to restart FS on all of them at a minimum which is a huge job and are considering whether to rollback to 1.10.9 to avoid this issue.