openzfsonosx / zfs

OpenZFS on OS X
https://openzfsonosx.org/
Other
824 stars 72 forks source link

zdb needs largeblock support #322

Closed lundman closed 9 years ago

lundman commented 9 years ago

Initially, the stack on OSX is too large with largeblock support, so we have to change to using kmem_alloc()

d516622b0bc0e26301a2f7a34033e4691e44264e

But after that, it can coredump, for example:

# zdb -ddddddddd BOOM
[snip lots of output]
0
0
0
0
0
0
 49625 Segmentation fault: 11  zdb   -dddddddd BOOM

This is caused in:

for (i = 0; i <= last_nonzero; i++) {
    (void) printf("\t%llu\n", (longlong_t)subobjs[i]);

because last_nonzero can be -1 so the for loop will never terminate. Changing it to

for (i = 0; (last_nonzero >= 0) && (i <= last_nonzero); i++) {

will at least avoid this core dump, but we still fail, with the assert being triggered:

large_blocks feature refcount mismatch: expected 0 != actual 140652718343632

So it would seem zdb, and/or, the userland compile of the largeblock support is missing. This seems to be the case on ZOL as well.

Passing the buck upstream in case they can solve it faster than me, but I will take a deep look tomorrow.

ahrens commented 9 years ago

Looks like a bug specific to ZoL; not present in illumos, where i and last_nonzero are declared as signed numbers. I would guess that when porting to Linux, they changed this code (because they won't use C99) and broke it.

lundman commented 9 years ago

Yes, I can see your signed int in zdb.c and can certainly fix our code here. (We can-do and want, C99, as we only use clang, but there is no upstream repo to switch to yet :) Getting tired of changing all zsb back to zfsvfs.. :) )

But I still need to work out why zdb trips over the assert about large_blocks refcount, since the pool does not have large_blocks enabled, it should skip that feature. I will compare what I have against the original patch and see if anything stands out.

lundman commented 9 years ago

@ahrens Using just IllumOS kernels, I can make this happen:

Create pool with zpool without large_block support:

ftp01.dw:~$ uname -a
SunOS ftp01 5.11 oi_151a6 i86pc i386 i86pc Solaris
root@ftp01:~# mkfile 100M /var/tmp/filepool.bin
root@ftp01:~# zpool create BOOM /var/tmp/filepool.bin
root@ftp01:~# mkdir /BOOM/helloworld
root@ftp01:~# zpool export BOOM
root@ftp01:~# scp /var/tmp/filepool.bin ssh01.newer:/var/tmp/

Import pool on system with large_block support, but do not upgrade

root@ssh01:/export/home/lundman# uname -a
SunOS ssh01 5.11 omnios-170cea2 i86pc i386 i86pc
root@ssh01:/export/home/lundman# zpool import -d /var/tmp/ BOOM
root@ssh01:/export/home/lundman# zdb -ddddddd BOOM
[snip]
Indirect blocks:
               0 L0 0:ba00:200 0:120ba00:200 200L/200P F=1 B=7/7

                segment [0000000000000000, 0000000000000200) size   512

large_blocks feature refcount mismatch: expected 0 != actual -2748781167608
lundman commented 9 years ago

@behlendorf @ahrens So changing just

    for (i = 0; i <= last_nonzero; i++) {

to

    for (i = 0; (int64_t)i <= last_nonzero; i++) {

Make it work, so it is interesting that my compiler silently opts to do an unsigned comparison, and IllumOS presumably stays as signed. Either way, since it is so subtle, perhaps we should avoid code that relies on the compiler preference?

But that aside, that above is not what the issue is about. That part is easy to fix, its the problem that comes after fixing it

large_blocks feature refcount mismatch: expected 0 != actual 140489770666256

The procedure to make zdb fail on just IllumOS is listed above. It applies to all versions, and we were hoping for some assistance in fixing it :)

lundman commented 9 years ago

https://github.com/openzfsonosx/zfs/commit/972e8dabf7696ca082e60cc5ac0e32ed28834174