libnxz / power-gzip

POWER NX zlib compliant library
23 stars 18 forks source link

zlib may erroneously call libnxz functions #187

Open mscastanho opened 2 years ago

mscastanho commented 2 years ago

Recently I hit the following scenario: libnxz's deflateInit2_ calls zlib's init function sw_deflateInit2 (which maps to deflateInit2_ from libz.so), which internally calls deflateReset. Since libnxz has been loaded first, this maps to libnxz's deflateReset, instead of zlib's.

$ cat ./bug.c 
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <zlib.h>

int main()
{
        Byte src[128], dst[128];
        z_stream strm;

        strm.zalloc = Z_NULL;
        strm.zfree = Z_NULL;
        strm.opaque = (voidpf)0;
        assert(deflateInit(&strm, Z_DEFAULT_COMPRESSION) == Z_OK);

        return 0;
}

$ gcc -Og -g -o bug bug.c -lz
$ LD_PRELOAD=./lib/.libs/libnxz.so gdb ./bug
> break deflateReset
> run

Call stack:
[0] from 0x00007ffff7eee3ac in deflateReset+8 at nx_deflate.c:2312   <-------- called libnxz (not good!)
[1] from 0x00007ffff7e69c58 in deflateInit2_+632 at deflate.c:362    <-------- was on zlib
[2] from 0x00007ffff7f05fd8 in sw_deflateInit2_+48 at sw_zlib.c:102
[3] from 0x00007ffff7eedfd8 in deflateInit2_+276 at nx_deflate.c:2269
[4] from 0x00007ffff7eee384 in deflateInit_+48 at nx_deflate.c:2251
[5] from 0x00000000100102a8 in main+60 at bug.c:14

I tried passing RTLD_DEEPBIND to dlopen to force the dlopen'd zlib to only call it's own symbols:

diff --git a/lib/sw_zlib.c b/lib/sw_zlib.c
index e97b477..09c0173 100644
--- a/lib/sw_zlib.c
+++ b/lib/sw_zlib.c
@@ -285,7 +285,7 @@ int sw_zlib_init(void)
 {
        char *error;

-       sw_handler = dlopen(ZLIB_PATH, RTLD_LAZY);
+       sw_handler = dlopen(ZLIB_PATH, RTLD_LAZY | RTLD_DEEPBIND);
        if(sw_handler == NULL) {
                prt_err(" %s\n", dlerror());
                return Z_ERRNO;

But the result was the same. There's some loader magic going on that I couldn't figure out yet. Turns out that when zlib's deflateInit2_ calls deflateReset through the PLT, the symbol is already resolved to libnxz's deflateReset.

Now if I link the program against libnxz instead, things work as desired:

$ gcc -o bug -g -Og ./bug.c -L./lib/.libs  -lnxz
$ LD_PRELOAD=./lib/.libs/libnxz.so gdb ./bug
> break deflateReset
> run

Call stack
[0] from 0x00007ffff7bb7908 in deflateReset+8 at deflate.c:526    <------------ called zlib's deflateReset. ok!!
[1] from 0x00007ffff7bb9c58 in deflateInit2_+632 at deflate.c:362
[2] from 0x00007ffff7f05fd8 in sw_deflateInit2_+48 at sw_zlib.c:102
[3] from 0x00007ffff7eedfd8 in deflateInit2_+276 at nx_deflate.c:2269
[4] from 0x00007ffff7eee384 in deflateInit_+48 at nx_deflate.c:2251
[5] from 0x00000000100102a8 in main+60 at ./bug.c:14

This second example becomes just like the first one if I remove RTLD_DEEPBIND flag from dlopen.

I suspect that in the first case RTLD_DEEPBIND doesn't have effect because libz.so has already been loaded when we call dlopen, because it was in the program's dependency list.

$ gcc -Og -g -o bug bug.c -lz
$ LD_PRELOAD=./lib/.libs/libnxz.so  LD_DEBUG=libs ./bug
   1679000:     find library=libz.so.1 [0]; searching
   1679000:      search cache=/etc/ld.so.cache
   1679000:       trying file=/lib64/libz.so.1
   1679000:
   1679000:     find library=libc.so.6 [0]; searching
   1679000:      search cache=/etc/ld.so.cache
   1679000:       trying file=/lib64/libc.so.6
   1679000:
   1679000:
   1679000:     calling init: /lib64/ld64.so.2
   1679000:
   1679000:
   1679000:     calling init: /lib64/libc.so.6
   1679000:
   1679000:
   1679000:     calling init: /lib64/libz.so.1       <------- libz.so loaded before libnxz.so
   1679000:
   1679000:
   1679000:     calling init: ./lib/.libs/libnxz.so
   1679000:
   1679000:     find library=libz.so [0]; searching  <------- triggered by dlopen from sw_zlib.c:sw_zlib_init
   1679000:      search cache=/etc/ld.so.cache
   1679000:       trying file=/lib64/libz.so
   1679000:                                          <------- libz.so not loaded again
   1679000:
   1679000:     initialize program: ./bug
   1679000:
   1679000:
   1679000:     transferring control: ./bug
   1679000:
   1679000:
   1679000:     calling fini: ./bug [0]
   1679000:
   1679000:
   1679000:     calling fini: ./lib/.libs/libnxz.so [0]
   1679000:
   1679000:
   1679000:     calling fini: /lib64/libz.so.1 [0]
   1679000:

When program is linked against libnxz instead:

$ gcc -o bug -g -Og ./bug.c -L./lib/.libs  -lnxz
$ LD_PRELOAD=./lib/.libs/libnxz.so  LD_DEBUG=libs ./bug
   1681332:     find library=libc.so.6 [0]; searching
   1681332:      search cache=/etc/ld.so.cache
   1681332:       trying file=/lib64/libc.so.6
   1681332:
   1681332:
   1681332:     calling init: /lib64/ld64.so.2
   1681332:
   1681332:
   1681332:     calling init: /lib64/libc.so.6
   1681332:
   1681332:                                          <------- libz.so not loaded because not in app's dependency list
   1681332:     calling init: ./lib/.libs/libnxz.so
   1681332:
   1681332:     find library=libz.so [0]; searching  <------- triggered by dlopen from sw_zlib.c:sw_zlib_init 
   1681332:      search cache=/etc/ld.so.cache
   1681332:       trying file=/lib64/libz.so
   1681332:
   1681332:
   1681332:     calling init: /lib64/libz.so         <------- libz.so loaded
   1681332:
   1681332:
   1681332:     initialize program: ./bug
   1681332:
   1681332:
   1681332:     transferring control: ./bug
   1681332:
   1681332:
   1681332:     calling fini: ./bug [0]
   1681332:
   1681332:
   1681332:     calling fini: ./lib/.libs/libnxz.so [0]
   1681332:
   1681332:
   1681332:     calling fini: /lib64/libz.so [0]
   1681332:

The current behavior is not desirable because when we call into zlib with sw_* we truly want pure zlib functionality. Having zlib jump back to libnxz while executing one of those calls may lead to unexpected behavior, and most likely hard-to-tackle bugs. I actually hit this while improving deflateReset to always reset both sw and nx state.

mscastanho commented 2 years ago

I may have found a solution: using dlmopen to load zlib in a new linker namespace. This new namespace won't have any references to libnxz, so any calls from zlib will always resolve to itself.

diff --git a/lib/sw_zlib.c b/lib/sw_zlib.c
index e97b477..09790f9 100644
--- a/lib/sw_zlib.c
+++ b/lib/sw_zlib.c
@@ -37,6 +37,7 @@
  */

+#define _GNU_SOURCE
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -285,7 +286,7 @@ int sw_zlib_init(void)
 {
        char *error;

-       sw_handler = dlopen(ZLIB_PATH, RTLD_LAZY);
+       sw_handler = dlmopen(LM_ID_NEWLM, ZLIB_PATH, RTLD_LAZY);
        if(sw_handler == NULL) {
                prt_err(" %s\n", dlerror());
                return Z_ERRNO;

The downsides I identified so far:

mscastanho commented 2 years ago

I included the fix above in #188 but some tests started to fail on older glibcs. Please see more info here: https://github.com/libnxz/power-gzip/pull/188#issuecomment-1194409345.

I did some more digging to try to understand what exactly was happening on the older glibcs. The commit that "fixed" the issue doesn't seem very much related, so I suspect there may be still a bug on glibc somewhere, specially because __get_nprocs is called in other situations and doesn't fail. It only fails when called indirectly by zlib's deflateInit2_ in the new link namespace.

I compared the call to __get_nprocs before and after the commit that uses dlmopen(LM_ID_NEWLM,.. to load zlib. The execution is pretty much the same. Except for 1 thing. This is the instruction that segfaults:

            while (l < re && isspace (*l))
  1301c4:       2c 00 40 42     bdz     1301f0 <get_nprocs+0x170>
  1301c8:       00 00 3e 89     lbz     r9,0(r30)
  1301cc:       a4 0f 29 79     rldicr  r9,r9,1,62
  1301d0:       2e 4a 2a 7d     lhzx    r9,r10,r9     <-------- at this point r10 is 0, r9 is 20, so try load from address 20 => segfault.

The value of r10 is read from somewhere in TLS (?):

[...]
  130158:       70 8c e2 ea     ld      r23,-29584(r2)   <--- reading something from the TOC?
  13015c:       14 6a f7 7e     add     r23,r23,r13       <--- r23 has value thread pointer + ???
[...]
  1301b4:       00 00 57 e9     ld      r10,0(r23)

This last load is OK without dlmopen, but loads value 0 to r10 when dlmopen is used.

After inspecting the .o from a glibc 2.31 build (which still has the issue) it's easier to see what these instructions are trying to access: the address of __libc_tsd_CTYPE_B from the GOT:

$ objdump -Sr --disassemble=__get_nprocs /home/mscastanho/build/glibc/misc/getsysstats.o
[...]
            while (l < re && isspace (*l))
 334:   00 00 e2 3e     addis   r23,r2,0
                        334: R_PPC64_GOT_TPREL16_HA     __libc_tsd_CTYPE_B
 338:   00 00 f7 ea     ld      r23,0(r23)
                        338: R_PPC64_GOT_TPREL16_LO_DS  __libc_tsd_CTYPE_B
 33c:   14 6a f7 7e     add     r23,r23,r13
                        33c: R_PPC64_TLS        __libc_tsd_CTYPE_B
[...]

This is most likely related to the call to isspace. And look what got removed by the glibc commit b5eeca8cfd9d0fd92b5633a88901d9ff27f2b496 that "fixed" the issue:

diff --git a/sysdeps/unix/sysv/linux/getsysstats.c b/sysdeps/unix/sysv/linux/getsysstats.c
index 1ef077f9af..a9069b7056 100644
--- a/sysdeps/unix/sysv/linux/getsysstats.c
+++ b/sysdeps/unix/sysv/linux/getsysstats.c
@@ -143,6 +143,7 @@ __get_nprocs (void)
   char *re = buffer_end;

   const int flags = O_RDONLY | O_CLOEXEC;
+  /* This file contains comma-separated ranges.  */
   int fd = __open_nocancel ("/sys/devices/system/cpu/online", flags);
   char *l;
   int result = 0;
@@ -175,10 +176,10 @@ __get_nprocs (void)
            result += m - n + 1;

            l = endp;
-           while (l < re && isspace (*l))   <---------
+           if (l < re && *l == ',')         <----------------
              ++l;
          }
-       while (l < re);
+       while (l < re && *l != '\n');

       __close_nocancel_nostatus (fd);

That exact isspace call! So I suspect the commit is just avoiding the real issue, not fixing it. I'm afraid it could be a bug in glibc.

Of course, fixing glibc is outside the scope of #188, so the question that matters is: should we really use dlmopen? Looks like there are some rough edges. If not, how else can we make calls to zlib while guaranteeing it won't call into libnxz?