kamailio / kamailio

Kamailio - The Open Source SIP Server for large VoIP and real-time communication platforms -
https://www.kamailio.org
Other
2.25k stars 937 forks source link

No CDR is written on failover #3254

Closed Earn330 closed 10 months ago

Earn330 commented 1 year ago

Hi,

I think there is an issue with writing CDRs on failover scenarios. I have two servers with Kamailio. Both server running Debian 11 (bulleye) and latest Kamailio 5.6.1 from git. I'm using DMQ to sync dialogs and htable between these servers, so both servers have the same knowledge of dialog state. Both Kamailio uses nobind option, so I can switch a VIP from one server to the other one. This is managed with keepalived. Switching the VIP from one server the other one works fine an while a call is running I can switch the VIP. I can see that the BYE Message is handled OK after I made a switch. I can see that acc is triggered but what is missing is acc_cdr in this case. I'm using htable to fill all necessary variables to complete the CDR an I can see that all values are synced correctly.

The acc_cdr is created fine if there is no failover so i think it can't be an configuration issue. Both server holds the same Kamailio configuration except the IP.

version: kamailio 5.6.1 (x86_64/linux) bfc5c2-dirty flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLOCKLIST, HAVE_RESOLV_RES, TLS_PTHREAD_MUTEX_SHARED ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: bfc5c2 -dirty compiled with gcc 10.2.1

root@voip-lab-proxy01:~# lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 11 (bullseye) Release: 11 Codename: bullseye

root@voip-lab-proxy01:~# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 38 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 23 Model name: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz Stepping: 10 CPU MHz: 2500.088 BogoMIPS: 5000.17 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 16 MiB L3 cache: 16 MiB NUMA node0 CPU(s): 0-3 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Mitigation; PTE Inversion; VMX EPT disabled Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni vmx ssse3 cx16 pdcm sse4_1 x2apic tsc_deadline_timer xsave hypervisor lahf_lm cpuid_fault pti tpr_shadow vnmi flexpriority vpid tsc_adjust arat arch_capabilit ies

If you need more information, please let me know. As I'm running this on a test system I can reproduce the issue at any time.

miconda commented 1 year ago

I am not using dialog cdrs, but I would suggest to try to identify (maybe looking at the c code) what is missing when replicating the dialog data. Are there flags or other params that have to be set in order to get the acc callbacks executed on dialog cdr generation?

You should run with debug=3 and watch the debug messages printed by kamailio and compare between the two instances, when it happens and when it doesn't.

Earn330 commented 1 year ago

Hi Daniel,

I'm not very familiar with coding. As far as I know you need to set a flag for accounting. This is happening as I an acc request at failover scenario. I think for a cdr request the dialog needs to be set. But I think this should be done by dmq replication. Maybe that is the part that is missing here. I generated two debug log files. The first one is a normal call without failover. At the end of the call a CDR is generated. The second one is captured on the standby node that is getting active before I terminate the call. After terminating the call the BYE message is handled correctly. I see also a ACC request but no ACC CDR. I hope you can find something as I think it would be more than nice if a CDR is generated after one machine failed.

BR, Björn

debug_with_failover.txt debug_without_failover.txt

Earn330 commented 1 year ago

Hi Daniel,

I hope you might have some time to look at my debug files. If you need additional data please let me know.

BR, Björn

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 6 weeks with no activity. Remove stale label or comment or this will be closed in 2 weeks.

LookedPath commented 2 months ago

@miconda we are experiencing the same problem on Kamailio 5.6.6 , is there something else we could provide to help debug this issue?

henningw commented 2 months ago

Its probably worth a try if the problem is also present in one of the current maintened Kamailio versions, e.g. 5.7.6. The 5.6 branch is now end of life and also some improvements or bug fixes that could not applied anymore due to code extensions could be missing.

LookedPath commented 2 months ago

@henningw just tried on Kamailio 5.8.2 and the issue persists.

Earn330 commented 2 months ago

I think it happens because of some missing features in syncing dialogs via DMQ. I helped myself out by using hashtables and writing cdrs manualy comparing the current server id when receiving BYE or any other call terminating events happens