oracle / oracle-linux

Scripts, examples, and tutorials to get started with Oracle Linux
Universal Permissive License v1.0
130 stars 43 forks source link

Crash in glibc when jemalloc library is preloaded (latest Oracle Linux 7 only) #90

Closed antonsmyk closed 1 year ago

antonsmyk commented 1 year ago

We have noticed a problem with the new glibc package version 2.17-326.0.5.el7_9 recently published in Oracle Linux Yum repository: timestamp of the package is June 13, 2023 on the https://yum.oracle.com/repo/OracleLinux/OL7/latest/x86_64/index.html page.

When jemalloc.so library is preloaded, then process crashes in glibc's get_nprocs function. This is reproducible with jemalloc version 3.6.0-1.el7 available from ol7_developer_EPEL repository, as well with custom build of jemalloc version 4.2.0 we use in our environment.

This is easy to reproduce with a basic Oracle Linux 7 container image.

First prepare a recent Oracle Linux 7 container image with latest updates and jemalloc installed:

$ cat Dockerfile
FROM container-registry.oracle.com/os/oraclelinux:7
RUN yum -y clean all && yum -y upgrade
RUN yum -y install oracle-epel-release-el7 && yum -y install jemalloc

Then run a shell in the container and execute any command with jemalloc library preloaded: it would immediately crash:

# LD_PRELOAD=/usr/lib64/libjemalloc.so.1 /usr/bin/true
Segmentation fault (core dumped)

Stacktrace of the crash:

(gdb) bt
#0  0x00007f954ed6e0a4 in get_nprocs () from /lib64/libc.so.6
#1  0x00007f954ed38d1c in sysconf () from /lib64/libc.so.6
#2  0x00007f954f043970 in malloc_init_hard () from /usr/lib64/libjemalloc.so.1
#3  0x00007f954f0450bd in malloc () from /usr/lib64/libjemalloc.so.1
#4  0x00007f954ecfdb8a in strdup () from /lib64/libc.so.6
#5  0x00007f954ea5bec1 in __nptl_tunables_init () from /lib64/libpthread.so.0
#6  0x00007f954ea5bd77 in __pthread_initialize_minimal_internal () from /lib64/libpthread.so.0
#7  0x00007f954ea5a4f1 in _init () from /lib64/libpthread.so.0
#8  0x0000000000000000 in ?? ()

When the glibc package is downgraded to _latest known working version 2.17-326.0.3.el79, the crash is not reproducible anymore.

We believe it has to do this recent update introducing call to __nptl_tunables_init in nptl/nptl-init.c source file. This change is mentioned in glibc.spec file from the source RPM provided by Oracle at https://oss.oracle.com/ol7/SRPMS-updates/ :

* Fri Apr 21 2023 Cupertino Miranda <cupertino.miranda@oracle.com> - 2.17-326.0.5
- OraBug 35318841 Glibc tunable to disable huge pages on pthread_create stacks
  Reviewed-by: Jose E. Marchesi <jose.marchesi@oracle.com>

There is no such problem with Red Hat Enterprise Linux 7, and CentOS 7. Oracle Linux 8 does not seem to be affected either.

We believe its preloading should not lead to crash in glibc library. We do not want to stop using jemalloc library.

It seems also that other users of Linux Oracle 7 are going to hit the same problem as soon as they apply the latest updates.

totalamateurhour commented 1 year ago

Thanks for reporting. Investigating...

totalamateurhour commented 1 year ago

Bug filed internally.

Dirimsa commented 1 year ago

This bug also seems to affect Splunk Universal Forwarder installations since splunk internally uses the same mechanism. Partial stacktrace of Splunk UF Version 9.0.5 attached including OEL Kernel & glibc version. Also prevents Splunk UF from starting.

glibc Version

glibc-2.17-326.0.5.el7_9.x86_64
glibc-common-2.17-326.0.5.el7_9.x86_64

Kernel Version

kernel-3.10.0-1160.92.1.0.1.el7.x86_64

StackTrace

[pid  9461] open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 3
 > /usr/lib64/libc-2.17.so(open+0x10) [0xef900]
 > /usr/lib64/libc-2.17.so(get_nprocs+0xbb) [0xfd01b]
 > /usr/lib64/libc-2.17.so(__sysconf+0x53b) [0xc7d1b]
 > /opt/splunkforwarder/lib/libjemalloc.so.2() [0x6299]
 > /opt/splunkforwarder/lib/libjemalloc.so.2(malloc+0x4b4) [0x8f64]
 > /usr/lib64/libc-2.17.so(__strdup+0x19) [0x8cb89]
 > /usr/lib64/libpthread-2.17.so(__nptl_tunables_init+0x40) [0x6ec0]
 > /usr/lib64/libpthread-2.17.so(__pthread_initialize_minimal+0x316) [0x6d76]
 > /usr/lib64/libpthread-2.17.so(_init+0x8) [0x54f0]
 > No DWARF information found
[pid  9461] read(3, "0-3\n", 8192)      = 4
 > /usr/lib64/libc-2.17.so(__read+0x10) [0xefb40]
 > /usr/lib64/libc-2.17.so(next_line+0xaa) [0xfcd9a]
 > /usr/lib64/libc-2.17.so(get_nprocs+0xe5) [0xfd045]
 > /usr/lib64/libc-2.17.so(__sysconf+0x53b) [0xc7d1b]
 > /opt/splunkforwarder/lib/libjemalloc.so.2() [0x6299]
 > /opt/splunkforwarder/lib/libjemalloc.so.2(malloc+0x4b4) [0x8f64]
 > /usr/lib64/libc-2.17.so(__strdup+0x19) [0x8cb89]
 > /usr/lib64/libpthread-2.17.so(__nptl_tunables_init+0x40) [0x6ec0]
 > /usr/lib64/libpthread-2.17.so(__pthread_initialize_minimal+0x316) [0x6d76]
 > /usr/lib64/libpthread-2.17.so(_init+0x8) [0x54f0]
 > No DWARF information found
[pid  9461] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x15} ---
[pid  9461] +++ killed by SIGSEGV +++
<... wait4 resumed>[{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 0, NULL) = 9461

LD_PRELOAD

-bash-4.2$ LD_PRELOAD=/opt/splunkforwarder/lib/libjemalloc.so /usr/bin/true
Segmentation fault

Seems to be related to what antonsmyk reported, same function calls in the StackTrace and also fails when preloading the splunkforwarder libjemalloc.so library.

totalamateurhour commented 1 year ago

Fix going through release process. Shouldn't be long now.

totalamateurhour commented 1 year ago

https://yum.oracle.com/repo/OracleLinux/OL7/latest/x86_64/getPackage/glibc-2.17-326.0.7.el7_9.x86_64.rpm

https://linux.oracle.com/errata/ELBA-2023-12516.html

antonsmyk commented 1 year ago

Hi again, @totalamateurhour! Thanks for promptly reviewing my report!

I've noticed already today that newer version of glibc-2.17-326.0.7.el7_9 is available on public Oracle Yum repo. And it appears to fix the reported problem with preloading jemalloc in this simple case:

# LD_PRELOAD=/usr/lib64/libjemalloc.so.1 /usr/bin/true

However, I see a problem in the source code: at line 91 there is free() call to pointer allocated from stack using strdupa(): ol7-glibc-__nptl_tunables_init

Indeed, this blows up when I have jemalloc preloaded and GLIBC_TUNABLES set:

# GLIBC_TUNABLES=glibc.malloc.trim_threshold=128:glibc.malloc.check=3 LD_PRELOAD=/usr/lib64/libjemalloc.so.1 /usr/bin/true
Segmentation fault (core dumped)

and stacktrace:

(gdb) bt
#0  0x00007f1546362c6c in je_extent_tree_ad_search () from /usr/lib64/libjemalloc.so.1
#1  0x00007f1546363f28 in je_huge_salloc () from /usr/lib64/libjemalloc.so.1
#2  0x00007f154634a5f5 in free () from /usr/lib64/libjemalloc.so.1
#3  0x00007f1545d62f6d in __nptl_tunables_init () from /lib64/libpthread.so.0
#4  0x00007f1545d62d27 in __pthread_initialize_minimal_internal () from /lib64/libpthread.so.0
#5  0x00007f1545d614b9 in _init () from /lib64/libpthread.so.0
#6  0x0000000000000000 in ?? ()

Modern version GCC warns on this free() attempt, and ASan also catches the misuse:

# cat hello.c
// set _GNU_SOURCE for strdupa() avaliability
#define _GNU_SOURCE
#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void hello(const char *str)
{
    char *p = strdupa(str);
    puts(p);
    free(p);
}

int main(void)
{
    hello("Hello\n");
}

# /opt/rh/devtoolset-12/root/usr/bin/gcc -fsanitize=address hello.c -o hello
hello.c: In function 'hello':
hello.c:12:5: warning: 'free' called on pointer to an unallocated object [-Wfree-nonheap-object]
   12 |     free(p);
      |     ^~~~~~~
cc1: note: returned from '__builtin_alloca_with_align'

# ./hello
Hello

=================================================================
==135==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x7ffc0ddae5e0 in thread T0
    #0 0x7fc7cc5c40a0 in __interceptor_free.part.0 (/lib64/libasan.so.8+0xbe0a0)
    #1 0x401285 in hello (/hello+0x401285)
    #2 0x4012a6 in main (/hello+0x4012a6)
    #3 0x7fc7cc15a554 in __libc_start_main (/lib64/libc.so.6+0x22554)
    #4 0x401118  (/hello+0x401118)

Address 0x7ffc0ddae5e0 is located in stack of thread T0
SUMMARY: AddressSanitizer: bad-free (/lib64/libasan.so.8+0xbe0a0) in __interceptor_free.part.0
==135==ABORTING
totalamateurhour commented 1 year ago

Thanks for the follow up, and apologies. This will be addressed shortly.

totalamateurhour commented 1 year ago

https://linux.oracle.com/errata/ELBA-2023-12526.html

https://yum.oracle.com/repo/OracleLinux/OL7/latest/x86_64/getPackage/glibc-2.17-326.0.9.el7_9.x86_64.rpm

antonsmyk commented 1 year ago

Thanks, @totalamateurhour! The new version seems fine. Closing the case.