Open gabada opened 1 year ago
I've seen this on multiple servers as well. If using PCMU only it seems to solve the calls that get stuck.
Not all debugging symbols were loaded. Please re-do the backtrace.
We are also seeing this. We cannot replicate in our test environment, but it happens in production.
It does not seem to be related to call volume, just randomly calls get "stuck" and it looks like the thread processing the call in FreeSWITCH is hanging.
Downgrading to FreeSWITCH 1.10.8 fixes the issue for us, although not ideal.
Hi, We have the same problem there, the freeswitch receive the BYE but don't answer to it and stuck the line randomly. We are forced to uuid_kill the line.
@kaaelhaa FreeSWITCH 1.10.8 because 1.10.9 is also affected?
@kaaelhaa FreeSWITCH 1.10.8 because 1.10.9 is also affected?
@andywolk I don't know if 1.10.9 is affected. We skipped 1.10.9 in our production environment so haven't tested.
@kaaelhaa was libsofia downgraded as well or just FreeSWITCH?
@kaaelhaa was libsofia downgraded as well or just FreeSWITCH?
@andywolk Downgraded by installing the Debian packages for 1.10.8 so I assume that also means libsofia was downloaded?
@kaaelhaa What is libsofia version on the system with FS 1.10.8 right now?
I added the symbols yesterday. waiting for some stuck calls.
@gabada If dbgsym files are installed and if you still have the core dump file you can re generate the backtrace.
@andywolk Here is the newly generated backtrace it still has some ?? in it. not sure why.
I installed freeswitch-all-dbg libfreeswitch1-dbg
so far no stuck calls but I dont have many calls going through this server right now.
backtrace.log
Also BKW asked me to run deadlock.py on it and there are no deadlocks
@gabada Because there are no libsofia dbg symbols: apt-get install libsofia-sip-ua0-dbgsym
Also when doing a backtrace please do
thread apply all bt
info threads
thread apply all bt full
I installed libsofia Still seeing this after bt
#0 0x00007fdfab91d7bc in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffe52debcb0) at ../sysdeps/unix/sysv/linux/select.c:69
#1 0x00007fdfabdb03e2 in fspr_sleep (t=<optimized out>) at time/unix/time.c:246
#2 0x00007fdfaba9f3f2 in switch_core_runtime_loop (bg=<optimized out>) at src/switch_core.c:1201
#3 0x0000558c99a7bf09 in ?? ()
#4 0x00007fdfab8461ca in __libc_start_call_main (main=main@entry=0x558c99a7b490, argc=argc@entry=7, argv=argv@entry=0x7ffe52df0a18) at ../sysdeps/nptl/libc_start_call_main.h:58
#5 0x00007fdfab846285 in __libc_start_main_impl (main=0x558c99a7b490, argc=7, argv=0x7ffe52df0a18, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe52df0a08) at ../csu/libc-start.c:360
#6 0x0000558c99a7d161 in ?? ()
Oh never mind dbg for sofia. it is there. other ?? are at least from lua. We don't currently need those.
Please re-generate using gdb commands I mentioned.
Here is a new backtrace.
Now with your help let's check if there are deadlocks.
You will need to download a py script from github, configure gdb, open the dump and execute the deadlock
command.
Here are the steps:
libc6-dbg
is a requirement for the script
apt-get install libc6-dbg wget
wget -O /deadlock.py https://raw.githubusercontent.com/facebook/folly/master/folly/experimental/gdb/deadlock.py
echo source -v /deadlock.py > /root/.gdbinit
/root/.gdbinit
user root
may be different in your case.
Then open gdb and do
deadlock
And see if it finds anything
No deadlock detected. Do you have debug symbols installed?
@kaaelhaa What is libsofia version on the system with FS 1.10.8 right now?
@andywolk this is the Sofia version on 1.10.8:
$ dpkg -l | grep sofia
ii freeswitch-mod-sofia 1.10.8~release~20~3510866140~buster-1~buster+1 amd64 mod_sofia for FreeSWITCH
ii libsofia-sip-ua0 1.13.16-125~dfc7095f4c~buster amd64 Sofia-SIP library runtime
And 1.10.10 for comparison:
$ dpkg -l | grep sofia
ii freeswitch-mod-sofia 1.10.10~release~24~4cb05e7f4a~buster-1~buster+1 amd64 mod_sofia for FreeSWITCH
ii libsofia-sip-ua0 1.13.16-125~dfc7095f4c~buster amd64 Sofia-SIP library runtime
So the same version of libsofia
across the two systems.
apt-cache policy libsofia-sip-ua0
@andywolk
I currently have a stuck channel. I did gcore $(pidof freeswitch)
I just ran the deadlock script on it but there is no deadlock.
I am going to make a backtrace now and upload it shortly.
@andywolk Here is the new backtrace. The instance is still up with the stuck call so if you need anything else or even want to do a screenshare session please let me know. backtrace.log
apt-cache policy libsofia-sip-ua0
On the 1.10.8 host:
$ apt-cache policy libsofia-sip-ua0
libsofia-sip-ua0:
Installed: 1.13.16-125~dfc7095f4c~buster
Candidate: 1.13.16-125~dfc7095f4c~buster
Version table:
*** 1.13.16-125~dfc7095f4c~buster 500
500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
100 /var/lib/dpkg/status
1.13.16-124~dfc7095f4c~buster 500
500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
1.13.15-123~2366f9cf40~buster 500
500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
1.13.15-122~a3b83039d4~buster 500
500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
1.12.11+20110422.1-2.1+deb10u4 500
500 http://security.debian.org/debian-security buster/updates/main amd64 Packages
1.12.11+20110422.1-2.1+b1 500
500 http://cdn-aws.deb.debian.org/debian buster/main amd64 Packages
1.10.10 for comparison:
$ apt-cache policy libsofia-sip-ua0
libsofia-sip-ua0:
Installed: 1.13.16-125~dfc7095f4c~buster
Candidate: 1.13.16-125~dfc7095f4c~buster
Version table:
*** 1.13.16-125~dfc7095f4c~buster 500
500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
100 /var/lib/dpkg/status
1.13.16-124~dfc7095f4c~buster 500
500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
1.13.15-123~2366f9cf40~buster 500
500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
1.13.15-122~a3b83039d4~buster 500
500 https://freeswitch.signalwire.com/repo/deb/debian-release buster/main amd64 Packages
1.12.11+20110422.1-2.1+deb10u4 500
500 http://security.debian.org/debian-security buster/updates/main amd64 Packages
1.12.11+20110422.1-2.1+b1 500
500 http://cdn-aws.deb.debian.org/debian buster/main amd64 Packages
Thank you. We see a pattern in both backtraces. Will analyze further.
Hi @andywolk , is there any progress on the investigation.
We are experiencing the same issue after FreeSWITCH 1.10.10 upgrade.
@andywolk if it helps, I think I might be experiencing the same issue, and it seems that for the relevant calls, this log is printed in the logs, but not this one.
Meaning that the session lock can never be acquired or released.
We're currently investigating the issue, will post updates in case any further details come up.
We are working on a solution. There is another issue filed where we can see similar things https://github.com/signalwire/freeswitch/issues/2290
For those that needs the latest freeswitch (like us, because of openssl 3.0.x support), reverting the commit cited in #2290 temporarly fixes the issue (we have it in production with the revert and all is running fine).
We are encountering this problem too, more and more frequently. Besides reverting the commit mentioned in #2290, is there any news on fixing this issue?
Was this resolved in FreeSWITCH 1.10.11? Thanks!
No, 1.10.11 still has this bug.
No, 1.10.11 still has this bug.
Thanks for letting me know.
We also have this problem on 1.10.11, not an issue on 1.10.9.
I also have this issue I created a script that do uuid kill on old channels here is the code of this script
#!/bin/bash
now1=$(/usr/local/freeswitch/bin/fs_cli -x "strepoch")
#here you set how old calls you want to kill
hour=3
minute=0
seconds=0
#here we calculate the total number of seconds from hour+minutes+seconds
allSec=0
allSec=$(($hour*60*60))
allSec=$(($allSec+$minute*60))
allSec=$(($allSec+$seconds))
#here we set the epoch we want to kill everything before it
timeLine=$(($now1-$allSec))
#here we are collecting from the datebase the uuid's and the epoch of all old calls
uuids=($(sqlite3 /usr/local/freeswitch/db/core.db "select uuid from channels where created_epoch < $timeLine;"))
time1=($(sqlite3 /usr/local/freeswitch/db/core.db "select created_epoch from channels where created_epoch < $timeLine;"))
time2=($(sqlite3 /usr/local/freeswitch/db/core.db "select created from channels where created_epoch < $timeLine;"))
#for uuid in "${uuids[@]}"; do
#here we loop over and do uuid_kill to all old calls
tmi=0
for ((i=0;i<${#uuids[@]};i++))
do
echo "the uuid #:$i is:${uuids[$i]} with epoch: ${time1[$i]} ${time2[$tmi]} ${time2[$tmi+1]}"
rsp=$(/usr/local/freeswitch/bin/fs_cli -x "uuid_kill ${uuids[$i]}")
echo "the response is:$rsp"
tmi=$tmi+2
done
@andywolk Is there any progress on this bug and related tickets #2290 and #2389? I was hoping it would be fixed in the most recent release. We have a large fleet of FS 1.10.10 servers and we have so many stuck channels now it impacts on channel limiting behaviour and capacity monitoring. We need to restart FS on all of them at a minimum which is a huge job and are considering whether to rollback to 1.10.9 to avoid this issue.
Describe the bug FreeSWITCH channels are getting stuck. When you restart/fsctl crash freeswitch the CDR that's written says
INCOMPATIBLE_DESTINATION
.When I set my ITSP to only offer PCMU everything works as expected.
When you do a uuid_kill on the channel it says OK but doesn't actually kill it (remove it from the database). You need to restart Freeswitch to get it removed.
To Reproduce Steps to reproduce the behavior:
Expected behavior FreeSWITCH should handle the transcoding and the calls should release normally when they are completed.
Package version or git hash
Trace logs freeswitch.log is a call that is still stuck. freeswitch.log
backtrace from core file
backtrace.log