sipwise / rtpengine

The Sipwise media proxy for Kamailio
GNU General Public License v3.0
763 stars 360 forks source link

kzalloc() return 0. #1824

Open YuriyFateev opened 2 months ago

YuriyFateev commented 2 months ago

rtpengine version the issue has been seen with

11.5.1.3+0~mr11.5.1.3 git-HEAD-04b31c0

Used distribution and its version

Debian 11

Linux kernel version used

5.10.0-28-cloud-amd64

CPU architecture issue was seen on (see uname -m)

x86_64

Expected behaviour you didn't see

No response

Unexpected behaviour you saw

No response

Steps to reproduce the problem

No response

Additional program output to the terminal or logs illustrating the issue

Hello

We are using RTPEngine with enabled in kernel forwarding for RTP packages.
From time to time we faced a lot of messages in the syslog like:

Apr 29 12:47:32 221100001373 kernel: [207802.726149] kernel stats: page allocation failure: order:4, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
Apr 29 12:47:32 221100001373 kernel: [207802.727253] CPU: 1 PID: 953942 Comm: kernel stats Tainted: G           OE     5.10.0-28-cloud-arm64 #1 Debian 5.10.209-2
Apr 29 12:47:32 221100001373 kernel: [207802.727254] Hardware name: Amazon EC2 t4g.nano/, BIOS 1.0 11/1/2018
Apr 29 12:47:32 221100001373 kernel: [207802.727255] Call trace:
Apr 29 12:47:32 221100001373 kernel: [207802.727260]  dump_backtrace+0x0/0x200
Apr 29 12:47:32 221100001373 kernel: [207802.727261]  show_stack+0x20/0x30
Apr 29 12:47:32 221100001373 kernel: [207802.727264]  dump_stack+0xe8/0x124
Apr 29 12:47:32 221100001373 kernel: [207802.727267]  warn_alloc+0x120/0x194
Apr 29 12:47:32 221100001373 kernel: [207802.727268]  __alloc_pages_nodemask+0xe68/0xeac
Apr 29 12:47:32 221100001373 kernel: [207802.727270]  alloc_pages_current+0x90/0x150
Apr 29 12:47:32 221100001373 kernel: [207802.727272]  kmalloc_order+0x38/0x130
Apr 29 12:47:32 221100001373 kernel: [207802.727273]  kmalloc_order_trace+0x34/0x120
Apr 29 12:47:32 221100001373 kernel: [207802.727278]  proc_blist_read+0xbc/0x4a0 [xt_RTPENGINE]
Apr 29 12:47:32 221100001373 kernel: [207802.727280]  proc_reg_read+0xb0/0x104
Apr 29 12:47:32 221100001373 kernel: [207802.727282]  vfs_read+0xb4/0x1e0
Apr 29 12:47:32 221100001373 kernel: [207802.727283]  ksys_read+0x70/0x100
Apr 29 12:47:32 221100001373 kernel: [207802.727284]  __arm64_sys_read+0x24/0x30
Apr 29 12:47:32 221100001373 kernel: [207802.727286]  el0_svc_common.constprop.0+0x80/0x1d0
Apr 29 12:47:32 221100001373 kernel: [207802.727287]  do_el0_svc+0x2c/0xa4
Apr 29 12:47:32 221100001373 kernel: [207802.727289]  el0_svc+0x20/0x30
Apr 29 12:47:32 221100001373 kernel: [207802.727291]  el0_sync_handler+0xb0/0xb4
Apr 29 12:47:32 221100001373 kernel: [207802.727292]  el0_sync+0x180/0x1c0
Apr 29 12:47:32 221100001373 kernel: [207802.727293] Mem-Info:
Apr 29 12:47:32 221100001373 kernel: [207802.727297] active_anon:17840 inactive_anon:37151 isolated_anon:0
Apr 29 12:47:32 221100001373 kernel: [207802.727297]  active_file:17914 inactive_file:10529 isolated_file:0
Apr 29 12:47:32 221100001373 kernel: [207802.727297]  unevictable:0 dirty:335 writeback:0
Apr 29 12:47:32 221100001373 kernel: [207802.727297]  slab_reclaimable:4725 slab_unreclaimable:8080
Apr 29 12:47:32 221100001373 kernel: [207802.727297]  mapped:20295 shmem:5144 pagetables:2758 bounce:0
Apr 29 12:47:32 221100001373 kernel: [207802.727297]  free:3415 free_pcp:86 free_cma:489
Apr 29 12:47:32 221100001373 kernel: [207802.727300] Node 0 active_anon:71360kB inactive_anon:148604kB active_file:71656kB inactive_file:42116kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:81180kB dirty:1340kB writeback:0kB shmem:20576kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:5840kB all_unreclaimable? no
Apr 29 12:47:32 221100001373 kernel: [207802.727300] Node 0 DMA free:13660kB min:10568kB low:11672kB high:12776kB reserved_highatomic:2048KB active_anon:71360kB inactive_anon:148656kB active_file:71784kB inactive_file:42584kB unevictable:0kB writepending:1340kB present:485376kB managed:442652kB mlocked:0kB pagetables:11032kB bounce:0kB free_pcp:344kB local_pcp:0kB free_cma:1956kB
Apr 29 12:47:32 221100001373 kernel: [207802.727304] lowmem_reserve[]: 0 0 0 0
Apr 29 12:47:32 221100001373 kernel: [207802.727306] Node 0 DMA: 14894kB (UMEHC) 5068kB (UMEHC) 13916kB (UMEH) 5432kB (UMEH) 464kB (MH) 0128kB 0256kB 0512kB 01024kB 02048kB 0*4096kB = 14212kB
Apr 29 12:47:32 221100001373 kernel: [207802.727316] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Apr 29 12:47:32 221100001373 kernel: [207802.727317] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
Apr 29 12:47:32 221100001373 kernel: [207802.727318] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 29 12:47:32 221100001373 kernel: [207802.727319] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
Apr 29 12:47:32 221100001373 kernel: [207802.727319] 36331 total pagecache pages
Apr 29 12:47:32 221100001373 kernel: [207802.727321] 2955 pages in swap cache
Apr 29 12:47:32 221100001373 kernel: [207802.727322] Swap cache stats: add 187767, delete 184809, find 1418270/1456515
Apr 29 12:47:32 221100001373 kernel: [207802.727322] Free swap  = 285436kB
Apr 29 12:47:32 221100001373 kernel: [207802.727323] Total swap = 524284kB
Apr 29 12:47:32 221100001373 kernel: [207802.727324] 121344 pages RAM
Apr 29 12:47:32 221100001373 kernel: [207802.727324] 0 pages HighMem/MovableOnly
Apr 29 12:47:32 221100001373 kernel: [207802.727325] 10681 pages reserved
Apr 29 12:47:32 221100001373 kernel: [207802.727326] 16384 pages cma reserved
Apr 29 12:47:32 221100001373 kernel: [207802.727326] 0 pages hwpoisoned

It happens then RTPENGINE throws an error in kernel mode - error allocate memory.

This happens in the proc_blist_read function, which outputs information about RTP statistics to user space.
After this error, RTP statistics are not generated.

It looks like we don't push to the user space the statistic about kernel mode and it could leads to the issue with call drops by RTP timeout and wrong MOS.

Anything else?

No response

dfxdj commented 2 months ago

Looks like you're just out of memory. There was an skb leak fixed in 11.5.1.19. You should update.

YuriyFateev commented 1 month ago

Hello. After this fix https://github.com/sipwise/rtpengine/commit/6cc7ceb1e24eb1b4be8994b51471b92443e19a16 The problem remains, but appears much less frequently.

We also tried this

opp = kzalloc(sizeof(*opp), GFP_KERNEL);
    if (!opp) {
    set_current_state(TASK_UNINTERRUPTIBLE);
    schedule_timeout(msecs_to_jiffies(5));
    opp = kzalloc(sizeof(*opp), GFP_KERNEL);
        if (!opp) {
        set_current_state(TASK_UNINTERRUPTIBLE);
        schedule_timeout(msecs_to_jiffies(5));
        opp = kzalloc(sizeof(*opp), GFP_KERNEL);
            if (!opp) {
            printk(KERN_ERR "xt_RTPENGINE [%s:%i] Error allocate memory !!!!!", __FUNCTION__, __LINE__);
            goto err;
        }
    }
}

the problem did not recur. Without knowing the logic of the work, we are not entirely sure that this will be correct. Can you confirm or deny this? Thank you.

rfuchs commented 1 month ago

Hello. After this fix 6cc7ceb The problem remains, but appears much less frequently.

We also tried this

opp = kzalloc(sizeof(*opp), GFP_KERNEL);
    if (!opp) {
  set_current_state(TASK_UNINTERRUPTIBLE);
  schedule_timeout(msecs_to_jiffies(5));
  opp = kzalloc(sizeof(*opp), GFP_KERNEL);
      if (!opp) {
      set_current_state(TASK_UNINTERRUPTIBLE);
      schedule_timeout(msecs_to_jiffies(5));
      opp = kzalloc(sizeof(*opp), GFP_KERNEL);
              if (!opp) {
          printk(KERN_ERR "xt_RTPENGINE [%s:%i] Error allocate memory !!!!!", __FUNCTION__, __LINE__);
          goto err;
      }
  }
}

the problem did not recur. Without knowing the logic of the work, we are not entirely sure that this will be correct. Can you confirm or deny this? Thank you.

I would rather offload the retry logic to user space. Return -ENOMEM when allocation fails and then handle it in the daemon.

YuriyFateev commented 1 month ago

Hello. Transferred processing to the daemon.


#define  REQUEST_ATTEMPTS   1   

GList *kernel_list() {
    char s[64];
    int fd;
    struct rtpengine_list_entry *buf;
    GList *li = NULL;
    ssize_t ret;

    if (!kernel.is_open)
        return NULL;
    sprintf(s, PREFIX "/%u/blist", kernel.table);
    fd = open(s, O_RDONLY);
    if (fd == -1)
        return NULL;
    for (;;) {
        buf = g_slice_alloc(sizeof(*buf));
        ret = read(fd, buf, sizeof(*buf));
        int count = 0;
        while ( ret == -1 && errno == ENOMEM ) {    
            if ( count >= REQUEST_ATTEMPTS ) {
                ilog(LOG_ERROR, "RTPENGINE daemon [%s:%i] Error allocate memory in kernel mode!!! Request attempts: %d;", __FUNCTION__, __LINE__, count);
                break;
            }
            count++;
            usleep(5000);
            ret = read(fd, buf, sizeof(*buf));
        }
        if (ret != sizeof(*buf))
            break;
        li = g_list_prepend(li, buf);
    }
    g_slice_free1(sizeof(*buf), buf);
    close(fd);
    return li;
}

The value (REQUEST_ATTEMPTS) will be determined based on the results of load tests.
Can you confirm or deny this? 
Thank you.
rfuchs commented 1 month ago

Yes, this looks good to me. I would even be ok with completely aborting the function on failure and then just trying again on the next loop iteration of the main timer. But chances of success can vary, so feel free to experiment and determine the best approach. And then feel free to open a pull request if you want.

YuriyFateev commented 1 month ago

@rfuchs Thank you very much.