Open mscastanho opened 2 years ago
I may have found a solution: using dlmopen
to load zlib in a new linker namespace. This new namespace won't have any references to libnxz, so any calls from zlib will always resolve to itself.
diff --git a/lib/sw_zlib.c b/lib/sw_zlib.c
index e97b477..09790f9 100644
--- a/lib/sw_zlib.c
+++ b/lib/sw_zlib.c
@@ -37,6 +37,7 @@
*/
+#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
@@ -285,7 +286,7 @@ int sw_zlib_init(void)
{
char *error;
- sw_handler = dlopen(ZLIB_PATH, RTLD_LAZY);
+ sw_handler = dlmopen(LM_ID_NEWLM, ZLIB_PATH, RTLD_LAZY);
if(sw_handler == NULL) {
prt_err(" %s\n", dlerror());
return Z_ERRNO;
The downsides I identified so far:
I included the fix above in #188 but some tests started to fail on older glibcs. Please see more info here: https://github.com/libnxz/power-gzip/pull/188#issuecomment-1194409345.
I did some more digging to try to understand what exactly was happening on the older glibcs. The commit that "fixed" the issue doesn't seem very much related, so I suspect there may be still a bug on glibc somewhere, specially because __get_nprocs
is called in other situations and doesn't fail. It only fails when called indirectly by zlib's deflateInit2_
in the new link namespace.
I compared the call to __get_nprocs
before and after the commit that uses dlmopen(LM_ID_NEWLM,..
to load zlib. The execution is pretty much the same. Except for 1 thing. This is the instruction that segfaults:
while (l < re && isspace (*l))
1301c4: 2c 00 40 42 bdz 1301f0 <get_nprocs+0x170>
1301c8: 00 00 3e 89 lbz r9,0(r30)
1301cc: a4 0f 29 79 rldicr r9,r9,1,62
1301d0: 2e 4a 2a 7d lhzx r9,r10,r9 <-------- at this point r10 is 0, r9 is 20, so try load from address 20 => segfault.
The value of r10 is read from somewhere in TLS (?):
[...]
130158: 70 8c e2 ea ld r23,-29584(r2) <--- reading something from the TOC?
13015c: 14 6a f7 7e add r23,r23,r13 <--- r23 has value thread pointer + ???
[...]
1301b4: 00 00 57 e9 ld r10,0(r23)
This last load is OK without dlmopen, but loads value 0 to r10 when dlmopen is used.
After inspecting the .o from a glibc 2.31 build (which still has the issue) it's easier to see what these instructions are trying to access: the address of __libc_tsd_CTYPE_B
from the GOT:
$ objdump -Sr --disassemble=__get_nprocs /home/mscastanho/build/glibc/misc/getsysstats.o
[...]
while (l < re && isspace (*l))
334: 00 00 e2 3e addis r23,r2,0
334: R_PPC64_GOT_TPREL16_HA __libc_tsd_CTYPE_B
338: 00 00 f7 ea ld r23,0(r23)
338: R_PPC64_GOT_TPREL16_LO_DS __libc_tsd_CTYPE_B
33c: 14 6a f7 7e add r23,r23,r13
33c: R_PPC64_TLS __libc_tsd_CTYPE_B
[...]
This is most likely related to the call to isspace
. And look what got removed by the glibc commit b5eeca8cfd9d0fd92b5633a88901d9ff27f2b496
that "fixed" the issue:
diff --git a/sysdeps/unix/sysv/linux/getsysstats.c b/sysdeps/unix/sysv/linux/getsysstats.c
index 1ef077f9af..a9069b7056 100644
--- a/sysdeps/unix/sysv/linux/getsysstats.c
+++ b/sysdeps/unix/sysv/linux/getsysstats.c
@@ -143,6 +143,7 @@ __get_nprocs (void)
char *re = buffer_end;
const int flags = O_RDONLY | O_CLOEXEC;
+ /* This file contains comma-separated ranges. */
int fd = __open_nocancel ("/sys/devices/system/cpu/online", flags);
char *l;
int result = 0;
@@ -175,10 +176,10 @@ __get_nprocs (void)
result += m - n + 1;
l = endp;
- while (l < re && isspace (*l)) <---------
+ if (l < re && *l == ',') <----------------
++l;
}
- while (l < re);
+ while (l < re && *l != '\n');
__close_nocancel_nostatus (fd);
That exact isspace
call! So I suspect the commit is just avoiding the real issue, not fixing it. I'm afraid it could be a bug in glibc.
Of course, fixing glibc is outside the scope of #188, so the question that matters is: should we really use dlmopen
? Looks like there are some rough edges. If not, how else can we make calls to zlib while guaranteeing it won't call into libnxz?
Recently I hit the following scenario: libnxz's
deflateInit2_
calls zlib's init functionsw_deflateInit2
(which maps todeflateInit2_
from libz.so), which internally callsdeflateReset
. Since libnxz has been loaded first, this maps to libnxz'sdeflateReset
, instead of zlib's.I tried passing
RTLD_DEEPBIND
to dlopen to force the dlopen'd zlib to only call it's own symbols:But the result was the same. There's some loader magic going on that I couldn't figure out yet. Turns out that when zlib's
deflateInit2_
callsdeflateReset
through the PLT, the symbol is already resolved to libnxz's deflateReset.Now if I link the program against libnxz instead, things work as desired:
This second example becomes just like the first one if I remove
RTLD_DEEPBIND
flag fromdlopen
.I suspect that in the first case
RTLD_DEEPBIND
doesn't have effect becauselibz.so
has already been loaded when we calldlopen
, because it was in the program's dependency list.When program is linked against libnxz instead:
The current behavior is not desirable because when we call into zlib with
sw_*
we truly want pure zlib functionality. Having zlib jump back to libnxz while executing one of those calls may lead to unexpected behavior, and most likely hard-to-tackle bugs. I actually hit this while improvingdeflateReset
to always reset both sw and nx state.