Closed uwkhahn closed 10 years ago
@uwkhahn This sure feels like another corrupted SA to me. If you could find the inode number of the etc
directory, could you try running zdb -ddddd <pool>/<file_system> <inode_number_of_etc>
on it. Unfortunately, if the SA's corrupted, it can be difficult to get its inode number. I'd start with ls -di etc
in its parent directory. If all else fails, you can use zdb -ddddd <pool>/<file_system> <inode_number_of_parent_directory_of_etc
which will dump the directory with a line for each entry in it of the form "name = inode (type)" which would allow you to get the inode number of "etc" (look for a line of the form "etc =
@dweeezil Here's what I got:
Dataset tank/rsyncbackup/byservice/web [ZPL], ID 675, cr_txg 1785867, 79.9G, 1998692 objects, rootbp DVA[0]=<5:4801057a400:600> DVA[1]=<6:4800f73c200:600> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=23637893L/23637893P fill=1998692 cksum=18879dda00:76d0cb72a58:138747d1a301a:24c9fb0cbd6157
Object lvl iblk dblk dsize lsize %full type
2687072 1 16K 512 4.50K 512 100.00 ZFS directory
196 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED SPILL_BLKPTR
dnode maxblkid: 0
path /web/snapshots/somehost/Web/current/by-vg/webvg/vhosts/otherhost/etc
uid 5257
gid 20150
atime Thu Oct 30 11:45:23 2014
mtime Mon Sep 6 20:10:21 2010
ctime Thu Oct 30 11:45:23 2014
crtime Thu Oct 30 11:45:23 2014
gen 1794735
mode 42775
size 2
parent 2686699
links 2
pflags 40800000044
microzap: 512 bytes, 0 entries
Indirect blocks:
0 L0 4:2ce76bee600:600 200L/200P F=1 B=1794735/1794735
segment [0000000000000000, 0000000000000200) size 512
One other bit of info, before the crash.. I see this set of output from the rsync.. I'm assuming it's related:
rsync: get_acl: sys_acl_get_file(snapshots/somehost/Web/current/by-vg/webvg/vhosts/otherhost/etc, ACL_TYPE_ACCESS): Input/output error (5)
rsync: get_acl: sys_acl_get_file(snapshots/somehost/Web/current/by-vg/webvg/vhosts/otherhost/etc, ACL_TYPE_ACCESS): Input/output error (5)
rsync: set_acl: sys_acl_set_file(snapshots/somehost/Web/current/by-vg/webvg/vhosts/otherhost/etc, ACL_TYPE_ACCESS): Bad address (14)
@uwkhahn Sorry to say but you've got the corrupted SA (actually, more correctly, a corrupted dnode) problem. I've been trying to track this down for a better part of a month or two but have not been able to reproduce it.
I've been working on an overhaul of the way these SAs are created but it's proving to be quite a bit more involved than I initially expected due to the differences in the ways that posix acls and selinux acls interface with the kernel.
Although I'm 99% sure I know exactly what the form of the corruption is, you could confirm it by providing the output of zdb -dddd <pool>/<filesystem> 5 6
and the building dweeezil/zfs@1019e58 (but not installing it) and then from the build directory, run cmd/zdb/zdb -ddddddd <pool>/<filesystem> <inode_of_etc_directory>
. That will give the complete picture.
The problem is that in some cases, the SA layout in the dnode is not updated when kicking an SA xattr to the spill block. This problem is likely caused by a race condition and/or made more probable under conditions of memory pressure. Was there a lot else going on while this "etc" directory was created?
Finally, I'd like to ask if you know whether this directory has been corrupted since its creation or whether some other operation may have been done to it after-the-fact.
Here's the first bit you asked for.
zdb -dddd tank/rsyncbackup/byservice/web 5 6
Dataset tank/rsyncbackup/byservice/web [ZPL], ID 675, cr_txg 1785867, 80.1G, 1999823 objects, rootbp DVA[0]=<8:47bf33ac600:600> DVA[1]=<9:28340417e00:600> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=23673413L/23673413P fill=1999823 cksum=13f177984f:67b05bc25f1:12617fe700ca3:254acc5b7f6c65
Object lvl iblk dblk dsize lsize %full type
5 1 16K 1.50K 2.50K 1.50K 100.00 SA attr registration
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 0
microzap: 1536 bytes, 21 entries
ZPL_DACL_COUNT = 8000010 : [8:0:16]
ZPL_GEN = 8000004 : [8:0:4]
ZPL_SYMLINK = 30011 : [0:3:17]
ZPL_LINKS = 8000008 : [8:0:8]
ZPL_ATIME = 10000000 : [16:0:0]
ZPL_UID = 800000c : [8:0:12]
ZPL_RDEV = 800000a : [8:0:10]
ZPL_DACL_ACES = 40013 : [0:4:19]
ZPL_PAD = 2000000e : [32:0:14]
ZPL_XATTR = 8000009 : [8:0:9]
ZPL_DXATTR = 30014 : [0:3:20]
ZPL_ZNODE_ACL = 5803000f : [88:3:15]
ZPL_CRTIME = 10000003 : [16:0:3]
ZPL_FLAGS = 800000b : [8:0:11]
ZPL_SCANSTAMP = 20030012 : [32:3:18]
ZPL_SIZE = 8000006 : [8:0:6]
ZPL_MODE = 8000005 : [8:0:5]
ZPL_PARENT = 8000007 : [8:0:7]
ZPL_CTIME = 10000002 : [16:0:2]
ZPL_MTIME = 10000001 : [16:0:1]
ZPL_GID = 800000d : [8:0:13]
Object lvl iblk dblk dsize lsize %full type
6 1 16K 16K 11.5K 32K 100.00 SA attr layouts
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 1
Fat ZAP stats:
Pointer table:
1024 elements
zt_blk: 0
zt_numblks: 0
zt_shift: 10
zt_blks_copied: 0
zt_nextblk: 0
ZAP entries: 4
Leaf blocks: 1
Total blocks: 2
zap_block_type: 0x8000000000000001
zap_magic: 0x2f52ab2ab
zap_salt: 0x40f6f138d
Leafs with 2^n pointers:
9: 1 *
Blocks with n*5 entries:
0: 1 *
Blocks n/10 full:
1: 1 *
Entries with n chunks:
3: 1 *
4: 3 ***
Buckets with n entries:
0: 508 ****************************************
1: 4 *
4 = [ 5 6 4 12 13 7 11 0 1 2 3 8 16 19 20 ]
3 = [ 5 6 4 12 13 7 11 0 1 2 3 8 16 19 17 ]
2 = [ 5 6 4 12 13 7 11 0 1 2 3 8 16 19 ]
5 = [ 20 ]
Here's the custom zdb output:
cmd/zdb/zdb -ddddddd tank/rsyncbackup/byservice/web 2687072
Dataset tank/rsyncbackup/byservice/web [ZPL], ID 675, cr_txg 1785867, 80.1G, 1999823 objects, rootbp DVA[0]=<8:47bf33ac600:600> DVA[1]=<9:28340417e00:600> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=23673413L/23673413P fill=1999823 cksum=13f177984f:67b05bc25f1:12617fe700ca3:254acc5b7f6c65
Object lvl iblk dblk dsize lsize %full type
2687072 1 16K 512 4.50K 512 100.00 ZFS directory (K=inherit) (Z=inherit)
196 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED SPILL_BLKPTR
dnode maxblkid: 0
Spill blkptr:
4:2ce76beec00:600 5:2ce732b2c00:600 200L/200P F=1 B=1794735/1794735
Spill blkptr dump: \003\000\000\000\004\000\000\000\166\137\073\147\001\000\000\000\003\000\000\000\005\000\000\000\226\225\071\147\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\002\007\054\200\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\257\142\033\000\000\000\000\000\001\000\000\000\000\000\000\000\334\201\257\326\016\000\000\000\171\354\354\031\331\005\000\000\313\276\240\013\376\056\001\000\152\134\007\344\313\354\051\000
SA hdrsize 16
SA layout 4
path /web/snapshots/somehost/Web/current/by-vg/webvg/vhosts/otherhost/etc
uid 5257
gid 20150
atime Thu Oct 30 11:45:23 2014
mtime Mon Sep 6 20:10:21 2010
ctime Thu Oct 30 11:45:23 2014
crtime Thu Oct 30 11:45:23 2014
gen 1794735
mode 42775
size 2
parent 2686699
links 2
pflags 40800000044
ndacl 3
dump_znode_sa_xattr: sa_xattr_size=20 sa_size error=0
SA packed dump sa_xattr_size=20: \001\001\000\000\000\000\000\000\000\000\000\001\000\000\000\000\003\000\000\000
SA xattr dump:
dump_znode_sa_xattr: nvlist_unpack error=14
microzap: 512 bytes, 0 entries
Indirect blocks:
0 L0 4:2ce76bee600:600 5:2ce732b2600:600 200L/200P F=1 B=1794735/1794735
segment [0000000000000000, 0000000000000200) size 512
Be aware that I'm anonymizing the paths that I'm posting here. so the pathname is slightly different in length and slightly different in content, but unless you tell me otherwise I'm going to assume that's not an issue.
At the times that these issues occur, it appears that zfs is using most of the system memory (there's 128GB RAM).
I'm going to move the affected directory and then try to do the rsync again to see if the path is created corrupted again. I'll let you know what I see.
Ah.. one further addition.. There is a zvol which is being used over iscsi that is active at the same time. I note that the zvol seems to add memory usage beyond what I would have expected to have run into. Hmmm...
@uwkhahn Thanks, the problem is exactly the same as others have been having. The SA layout is wrong. It's 4 but should be 2.
Could you please check the source file system and see whether the parent directory of "etc" has a default ACL? I'm looking for any clues which might help me reproduce this problem.
I'm going to look into how hard it would be to whip up a gross hack to detect and work around this problem after the fact.
Here's a "getfacl -n ." on the parent directory.
# file: .
# owner: 5257
# group: 20150
# flags: -s-
user::rwx
user:801:rwx
group::rwx
mask::rwx
other::r-x
default:user::rwx
default:user:801:rwx
default:group::rwx
default:mask::rwx
default:other::r-x
So if I destroy this filesystem and re-create with xattrs=on will that be at risk of hitting this same bug? (ok maybe we'll rename it so that we can have it around for reference).
Also with 128GB I've set the zfs_arc_max to 107374182400 (100GB). I note that I see significant usage beyond 100GB (which reduces when the zvol is not being written to actively) so I'm contemplating reducing it further. You were saying that you believed this might be related to memory pressure, so I wonder if you think this might help, in our case.
BTW Thanks for helping figure this out.
@uwkhahn xattr=on
should be perfectly safe. The normal caveats regarding dir-style xattrs do apply, but at least they don't involve filesystem corruption.
Thinking about the ARC oversubscription for a moment, I just re-read this thread and realized you're running a post 3.12 kernel and stock 0.6.3 packages which means you've not got the shrinker fixes. You might want to try cherry-picking ed6e9cc and zfsonlinux/spl@802a4a2 (assuming you're using DKMS they should patch cleanly into 0.6.3) and see whether the ARC stays under control better.
Thanks,
I've not done a patched dkms package before (have done patched kernel packages directly, but not dkms).. I'll go figure that out (I'd take pointers to any good instructions). If we start going down that path, I think there are some other patches of interest to us as well, but we tend to not want to do patched packages if we can avoid it (generally much more work to maintain).
I don't suppose ZOL is close to any new release coming out the door, is it? :) Thanks again.
Ah... I see that there is no instructions on how to build the debian dkms packages, but yet they exist for download. Any clue on how they are generated currently?
You could just patch the sources in /usr/src and then dkms remove spl/0.6.3 ; dkms build spl/0.6.3 ; dkms install spl/0.6.3
and then the same for zfs.
hmm.. the patch for spl didn't apply cleanly... The one for zfs did...
Oh well, I thought they'd apply cleanly. They're pretty small. You can definitely try it with just the zfs change.
Last night, I removed the parent directory of the one with the bad SA, rsync ran again and indeed caused the I/O errors followed by the same crash. I tried rsyncing data out of there, so I was only reading.. I also got the immediate I/O error and then later the same kernel crash.
For now I've kept with the same version of everything and moved the filesystems with the xattrs=sa setting to an archived location.
We're recompiling a debian dkms package using the developer build instructions, but not that the current head doesn't apply the debian patches cleanly. (apparently the debian daily build is from a month ago)
I've started rsyncing the old data to new filesystems with xattrs=on instead of sa. It has not been crashing, with the exception of when I rsync from the filesystem I've been talking with you about.
We'll be trying to use newer versions soon, but I wanted to get our backups back on track... the filesystems are just moved so we can go look at them later.
@uwkhahn So you have saved the corrupted filesystem? Good. I'm about halfway done with a patch that will implement a simple heuristic to detect this and work around the problem. I've resisted doing this over the past few weeks but figured that it'll get my eyes in front of the SA code yet again and maybe I'll get an idea as to what's actually causing this corruption. And it might help others recover their files more easily.
@uwkhahn Could you please run zdb -R tank/rsyncbackup/byservice/web 4:2ce76beec00:600
and post the output? I'm starting to gather enough debugging data to formulate an idea of what's going on. For reference, here are the related issues: #2663, #2700, #2717 and zfsonlinux/spl#390.
While working on the hack alluded to in my previous post, I realized that your corrupted dnode actually contains an empty, but corrupted packed nvlist dxattr which has pointed me in a slightly different direction. That said, I'm fairly certain the work in #2809 will eliminate this whole issue.
I moved the filesystem, so the command is slightly different, but here's the output:
zdb -R tank/archive/web 4:2ce76beec00:600
Found vdev type: raidz
4:2ce76beec00:600
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
000000: 00c40405002f505a 0000000000000101 ZP/.............
000010: 5800000001000000 1800000060000000 .......X...`....
000020: 702e6d6574737973 6c63615f7869736f system.posix_acl
000030: 746c75616665645f 2c0000000a000000 _default.......,
000040: 0007000100000002 00070002ffffffff ................
000050: 0007000400000321 00070010ffffffff !...............
000060: 00050020ffffffff 58000000ffffffff .... ..........X
000070: 1700000058000000 702e6d6574737973 ...X....system.p
000080: 6c63615f7869736f 007373656363615f osix_acl_access.
000090: 2c0000000a000000 0007000100000002 .......,........
0000a0: 00070002ffffffff 0007000400000321 ........!.......
0000b0: 00000010ffffffff 00000020ffffffff ............ ...
0000c0: 00000000ffffffff 0000000000000000 ................
0000d0: 0000000000000000 0000000000000000 ................
0000e0: 0000000000000000 0000000000000000 ................
0000f0: 0000000000000000 0000000000000000 ................
000100: 0000000000000000 0000000000000000 ................
000110: 0000000000000000 0000000000000000 ................
000120: 0000000000000000 0000000000000000 ................
000130: 0000000000000000 0000000000000000 ................
000140: 0000000000000000 0000000000000000 ................
000150: 0000000000000000 0000000000000000 ................
000160: 0000000000000000 0000000000000000 ................
000170: 0000000000000000 0000000000000000 ................
000180: 0000000000000000 0000000000000000 ................
000190: 0000000000000000 0000000000000000 ................
0001a0: 0000000000000000 0000000000000000 ................
0001b0: 0000000000000000 0000000000000000 ................
0001c0: 0000000000000000 0000000000000000 ................
0001d0: 0000000000000000 0000000000000000 ................
0001e0: 0000000000000000 0000000000000000 ................
0001f0: 0000000000000000 0000000000000000 ................
000200: 8000000000000003 0000000a561da7eb ...........V....
000210: 0000000000000000 0000000000000000 ................
000220: 0000000000000000 0000000000000000 ................
000230: 0000000000000000 0000000000000000 ................
000240: 400000000029006d 732e000000000000 m.)....@.......s
000250: 0000000000006e76 0000000000000000 vn..............
000260: 0000000000000000 0000000000000000 ................
000270: 0000000000000000 0000000000000000 ................
000280: 400000000029006e 6461000000000000 n.)....@......ad
000290: 00000000006e696d 0000000000000000 min.............
0002a0: 0000000000000000 0000000000000000 ................
0002b0: 0000000000000000 0000000000000000 ................
0002c0: 400000000029006f 6e69000000000000 o.)....@......in
0002d0: 0000736564756c63 0000000000000000 cludes..........
0002e0: 0000000000000000 0000000000000000 ................
0002f0: 0000000000000000 0000000000000000 ................
000300: 4000000000290070 696c000000000000 p.)....@......li
000310: 0000000000000062 0000000000000000 b...............
000320: 0000000000000000 0000000000000000 ................
000330: 0000000000000000 0000000000000000 ................
000340: 0000000000000000 0000000000000000 ................
000350: 0000000000000000 0000000000000000 ................
000360: 0000000000000000 0000000000000000 ................
000370: 0000000000000000 0000000000000000 ................
000380: 0000000000000000 0000000000000000 ................
000390: 0000000000000000 0000000000000000 ................
0003a0: 0000000000000000 0000000000000000 ................
0003b0: 0000000000000000 0000000000000000 ................
0003c0: 0000000000000000 0000000000000000 ................
0003d0: 0000000000000000 0000000000000000 ................
0003e0: 0000000000000000 0000000000000000 ................
0003f0: 0000000000000000 0000000000000000 ................
000400: 8000000000000003 0000000a561da7eb ...........V....
000410: 0000000000000000 0000000000000000 ................
000420: 0000000000000000 0000000000000000 ................
000430: 0000000000000000 0000000000000000 ................
000440: 400000000029006d 732e000000000000 m.)....@.......s
000450: 0000000000006e76 0000000000000000 vn..............
000460: 0000000000000000 0000000000000000 ................
000470: 0000000000000000 0000000000000000 ................
000480: 400000000029006e 6461000000000000 n.)....@......ad
000490: 00000000006e696d 0000000000000000 min.............
0004a0: 0000000000000000 0000000000000000 ................
0004b0: 0000000000000000 0000000000000000 ................
0004c0: 400000000029006f 6e69000000000000 o.)....@......in
0004d0: 0000736564756c63 0000000000000000 cludes..........
0004e0: 0000000000000000 0000000000000000 ................
0004f0: 0000000000000000 0000000000000000 ................
000500: 4000000000290070 696c000000000000 p.)....@......li
000510: 0000000000000062 0000000000000000 b...............
000520: 0000000000000000 0000000000000000 ................
000530: 0000000000000000 0000000000000000 ................
000540: 0000000000000000 0000000000000000 ................
000550: 0000000000000000 0000000000000000 ................
000560: 0000000000000000 0000000000000000 ................
000570: 0000000000000000 0000000000000000 ................
000580: 0000000000000000 0000000000000000 ................
000590: 0000000000000000 0000000000000000 ................
0005a0: 0000000000000000 0000000000000000 ................
0005b0: 0000000000000000 0000000000000000 ................
0005c0: 0000000000000000 0000000000000000 ................
0005d0: 0000000000000000 0000000000000000 ................
0005e0: 0000000000000000 0000000000000000 ................
0005f0: 0000000000000000 0000000000000000 ................
@uwkhahn Thanks. Except for the ZAP remnants in the slack area, it looks to be OK to me. I'll run a standalone decoder program on it later today to verify.
Hey, just for reference I followed the process for another dir that is having the Input output error. Its dump looks like this (I can give you the steps I used to get there, but I'm just following the recipe you gave above):
zdb -R tank/archive/web 8:48fcab22200:600
Found vdev type: raidz
8:48fcab22200:600
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
000000: 00c40405002f505a 0000000000000101 ZP/.............
000010: 5800000001000000 1800000060000000 .......X...`....
000020: 702e6d6574737973 6c63615f7869736f system.posix_acl
000030: 746c75616665645f 2c0000000a000000 _default.......,
000040: 0007000100000002 00070002ffffffff ................
000050: 0007000400000321 00070010ffffffff !...............
000060: 00050020ffffffff 58000000ffffffff .... ..........X
000070: 1700000058000000 702e6d6574737973 ...X....system.p
000080: 6c63615f7869736f 007373656363615f osix_acl_access.
000090: 2c0000000a000000 0007000100000002 .......,........
0000a0: 00070002ffffffff 0007000400000321 ........!.......
0000b0: 00000010ffffffff 00000020ffffffff ............ ...
0000c0: 00000000ffffffff 0000000000000000 ................
0000d0: 0000000000000000 0000000000000000 ................
0000e0: 0000000000000000 0000000000000000 ................
0000f0: 0000000000000000 0000000000000000 ................
000100: 0000000000000000 0000000000000000 ................
000110: 0000000000000000 0000000000000000 ................
000120: 0000000000000000 0000000000000000 ................
000130: 0000000000000000 0000000000000000 ................
000140: 0000000000000000 0000000000000000 ................
000150: 0000000000000000 0000000000000000 ................
000160: 0000000000000000 0000000000000000 ................
000170: 0000000000000000 0000000000000000 ................
000180: 0000000000000000 0000000000000000 ................
000190: 0000000000000000 0000000000000000 ................
0001a0: 0000000000000000 0000000000000000 ................
0001b0: 0000000000000000 0000000000000000 ................
0001c0: 0000000000000000 0000000000000000 ................
0001d0: 0000000000000000 0000000000000000 ................
0001e0: 0000000000000000 0000000000000000 ................
0001f0: 0000000000000000 0000000000000000 ................
000200: 7ce8937fab5b1b37 129d66279c3ee5d5 7.[....|..>.'f..
000210: 9150d286eb2f192f 5c69aa8a7766d9a4 /./...P...fw..i\
000220: 05a895c6195f2ef3 b338788f73fba470 .._.....p..s.x8.
000230: 7780c41187c2a869 dca85f5e96075f2e i......w._..^_..
000240: ee72bade6a2cbcf4 1cf591acbfb9dd87 ..,j..r.........
000250: 53832ae2b8f8b3d2 729d4f83e4a9634a .....*.SJc...O.r
000260: 54754ebb98ee10d6 d0eae8a1bbda98cd .....NuT........
000270: 031817c20127cac2 4bb5006514b03ded ..'......=..e..K
000280: 5c8d0870fc38d2ae 09b140db3998ca0e ..8.p..\...9.@..
000290: 60f913f3204fb7f5 fafe1ca9dedb060c ..O ...`........
0002a0: 0cc8ef8264f8ba94 cc38007338303db2 ...d.....=08s.8.
0002b0: 830d521d3791871b 258480e8703de5d8 ...7.R....=p...%
0002c0: b05ebfb868c6921a 333178eb70cb7022 ...h..^."p.p.x13
0002d0: 0fdbb98f1a73ba9c 2f30d76e19d2fca6 ..s.........n.0/
0002e0: 1399a692d8c2aab4 d38b309db8d3d204 .............0..
0002f0: f7eedfc137e1a34e 9044153549ea1517 N..7.......I5.D.
000300: 7460c05b3cbed0e0 c6ee84f93d3a6247 ...<[.`tGb:=....
000310: c2042d662926b015 1ec29863dba51605 ..&)f-......c...
000320: d9491f10a3c7e30d 8a920a3b12d1d448 ......I.H...;...
000330: ee0560c2f3f4cfc7 d1c1c511f3e65fe7 .....`..._......
000340: 12f599a58a78c7ec 5066fd0b0d0d49b0 ..x......I....fP
000350: 2a129ba8efaf91b6 641e4fcd71e7aa07 .......*...q.O.d
000360: 8c8061656acc53d0 72f1fb552e590f5f .S.jea.._.Y.U..r
000370: d96cab7d22e8ded0 a253565296803d16 ..."}.l..=..RVS.
000380: 62e3f86051bcb36e 504167b10b280d94 n..Q`..b..(..gAP
000390: de77a540ecb4dad3 93dbd47eb77376a9 ....@.w..vs.~...
0003a0: a3ae0b6204d918d3 9b4f4c8d17c5490b ....b....I...LO.
0003b0: 024684c802369f13 1ef6e7314e55bbe2 ..6...F...UN1...
0003c0: 86a35670c8da0184 4ef63bb7e9f8f033 ....pV..3....;.N
0003d0: 93f99ab4a2ecd360 2cde98d37ccf0426 `.......&..|...,
0003e0: a2cce6f77a7f84ab 4abebe447ae2b4fa ...z.......zD..J
0003f0: 82e1dbe93372c3bd 1b0bf889b7f93f24 ..r3....$?......
000400: 98ccb1d1303b641a 388ba75491b23725 .d;0....%7..T..8
000410: c19bb1b078a2d595 12b8674a27c74691 ...x.....F.'Jg..
000420: 17be22959c663a35 499af4a147951475 5:f.."..u..G...I
000430: 0296690f67f8f144 6471049a96329736 D..g.i..6.2...qd
000440: 5b7e52e39cb18e97 106344a3cd2dcf9b .....R~[..-..Dc.
000450: 8147e683618762c6 b05a02f6cc742540 .b.a..G.@%t...Z.
000460: 664c6f7d21a52e83 5bae34b54980ab16 ...!}oLf...I.4.[
000470: 65931ee3d205781b aea48c026883ddce .x.....e...h....
000480: 5672eb42460eca8d f8171fa702fad4a0 ...FB.rV........
000490: faf0d39ec74b1438 7fff72977ad03c64 8.K.....d<.z.r..
0004a0: 48ebeffb402ca576 d13e4a4109adcb26 v.,@...H&...AJ>.
0004b0: 68bc938dd379f86d ac501b05f9dc6fd9 m.y....h.o....P.
0004c0: 5d459299ad1f2b06 46777627492b434e .+....E]NC+I'vwF
0004d0: a6b460434df2ebf8 062ae0c76c3bbe14 ...MC`....;l..*.
0004e0: a37d5b857a95feb5 f1ee40c178d3f55a ...z.[}.Z..x.@..
0004f0: 987a193142ecbe8c 35d06b67d745c203 ...B1.z...E.gk.5
000500: 3ccc3e6d1704af63 8cc316fb01646a3a c...m>.<:jd.....
000510: 9adf720155b27791 6b9609664e120925 .w.U.r..%..Nf..k
000520: 049cfea873f25ddd 442b1eb917993db5 .].s.....=....+D
000530: 79243102d41f610e cfa85fb424b29ad8 .a...1$y...$._..
000540: 2b9d92bf7b630adf b0ceed1782a18f72 ..c{...+r.......
000550: 6d19456286171c33 386d7536ce5884ca 3...bE.m..X.6um8
000560: cad6c580a622440c d12fa558af1d1e39 .D".....9...X./.
000570: 6b867e89539adda1 a4340ed9fd9a6ba3 ...S.~.k.k....4.
000580: 73d5b6bb5d3b2845 81408384384ff71e E(;]...s..O8..@.
000590: 63ec89250dc3a863 9687ce1aea719132 c...%..c2.q.....
0005a0: 44012a4ac7a298f8 aa2c9cb647acf88a ....J*.D...G..,.
0005b0: 54379337be5e9e6b ef42829bce5cb494 k.^.7.7T..\...B.
0005c0: 03d36fb54cc5be7f 35da40903edbcc64 ...L.o..d..>.@.5
0005d0: 5b1bcc53c1800777 45e6c81b6e4d4a80 w...S..[.JMn...E
0005e0: 73da46d03576b180 04cc4afc64548af1 ..v5.F.s..Td.J..
0005f0: 546fce4f2fbf7a6d 4cc3a8a2872c05f3 mz./O.oT..,....L
Big breakthrough this weekend, I've got a 100% reliable reproducer and am hot on the trail of the problem. More information later. I think it will explain all currently reported instances of SA corruption.
Update (now that I'm not on a mobile device): The problem is that a transition from a dnode with a spill block to a dnode without a spill block doesn't work properly if the the dnode isn't synced in-between the transition. The cause seems to be that the dbuf corresponding to the spill block (db_blkid of DMU_SPILL_BLKID) is still laying around to be synced even after the spill block has been removed at higher levels.
The effect is that a perfectly valid dnode is written with the exception that it has an unnecessary spill block attached to it. This, in itself, doesn't cause much of a problem (but it is detected by my extra zdb debugging) but when subsequent changes are made to the dnode, such as re-extending an xattr into the spill area, fatal corruption can occur.
I'm hoping to have a better description and solution within the next day or so. As best as I can tell at the moment, this problem is not specific to SA xattrs, but is exacerbated by their use because without them, the spill code within ZFS doesn't get a whole lot of use.
Very interesting. It sounds like the spill block dbuf isn't getting properly undirtied in this case and is being left around. It wouldn't shock me if all the ZFS implementations suffer from this. We make much more use of the SA code and are more likely to notice it.
The acltype=posixacl
and xattr=sa
combination gives the SA code quite a workout. The key bit of information is that in several of the bits of extended zdb debugging provided in these related issues, there was an almost perfectly valid non-spill xattr SA present but its size was not so coincidentally exactly what it would be when there were zero nvpairs in the packed nvlist.
I think I'm getting pretty close to the root of the problem on my test system on which I'm running the reproducers.
Further tracing shows that the dbufs for both the dnode and the spill are being dirtied properly. It looks like I went on the wrong track here and the problem is likely to be in the main SA code but it is definitely dependent on the flushing of earlier data.
Ugh, you were right @behlendorf (as was my initial hunch; I was confusing dirtied with undirtied). Patch to be posted shortly.
Thanks for all the work.. will this show up as a daily build at some point?
I took some of bpkroth's work to try and get a debian package with the latest code, and it hasn't gone well. (zfs 7b2d78a and spl 917fef2 ).
I then tried turnning on xattrs=sa on a new filesystem and rsyncing old data over... I crashed pretty soon after the rsync started (rsync -aviPHAX) and now I can't even traverse the directory structure without it eventually crashing...
It is a very strange crash... Initially it will tell me that the program traversing the filesystem: (e.g. find,ls,rm) is using 100% of the cpu then it eventually freezes the machine. zpool iostat 2 doesn't really show anything going on.
Hmm
Hi, I've been working with Ken on this issue as well. Here's some other details we have to offer:
We were testing the git checkout build ids Ken mentioned, first with a local build, then again with the packages in the wheezy-daily repository (details below).
The testing procedure goes something like this:
When doing that, we'll end up with a crash in a couple of different places, as noted above.
In the case of the zfs destory, we just got this:
[ 349.160037] VERIFY(size != 0) failed
[ 349.160122] SPLError: 14585:0:(range_tree.c:172:range_tree_add()) SPL PANIC
[ 349.160205] SPL: Showing stack for process 14585
[ 349.160271] CPU: 8 PID: 14585 Comm: z_fr_iss_6/3 Tainted: P O 3.16-0.bpo.3-amd64 #1 Debian 3.16.5-1~bpo70+1
[ 349.160346] Hardware name: Supermicro X9DRW-3LN4F+/X9DRW-3TF+/X9DRW-3LN4F+/X9DRW-3TF+, BIOS 3.0a 02/06/2014
[ 349.160420] 0000000000000000 ffffffffa0f3cfc3 ffffffff8154144f 0000000000000000
[ 349.160598] ffffffffa02d669c ffffffffa02eab4d 000000000000000a 000000000000ffff
[ 349.160601] 000000000000ffff 00000000000000ac ffffffffa02e2746 0000000000000028
[ 349.160605] Call Trace:
[ 349.160625] [<ffffffff8154144f>] ? dump_stack+0x41/0x51
[ 349.160637] [<ffffffffa02d669c>] ? spl_debug_bug+0x7c/0xe0 [spl]
[ 349.160647] [<ffffffffa02e2746>] ? spl_PANIC+0xb6/0x100 [spl]
[ 349.160654] [<ffffffffa02dbbeb>] ? spl_kmem_cache_alloc+0x9b/0xcf0 [spl]
[ 349.160661] [<ffffffffa02dc98f>] ? spl_kmem_cache_free+0x14f/0x3e0 [spl]
[ 349.160667] [<ffffffffa028018a>] ? avl_find+0x5a/0xa0 [zavl]
[ 349.160674] [<ffffffffa0ec46e1>] ? range_tree_add+0x291/0x2d0 [zfs]
[ 349.160700] [<ffffffffa0ec09f0>] ? metaslab_free_dva+0x1d0/0x3b0 [zfs]
[ 349.160721] [<ffffffffa0ec38f4>] ? metaslab_free+0x94/0xe0 [zfs]
[ 349.160740] [<ffffffffa0f2323c>] ? zio_dva_free+0x1c/0x30 [zfs]
[ 349.160758] [<ffffffffa0f24269>] ? zio_execute+0xa9/0x140 [zfs]
[ 349.160775] [<ffffffffa02df0c6>] ? taskq_thread+0x236/0x4c0 [spl]
[ 349.160781] [<ffffffff8109f330>] ? try_to_wake_up+0x310/0x310
[ 349.160787] [<ffffffffa02dee90>] ? task_done+0x150/0x150 [spl]
[ 349.160792] [<ffffffff8108f491>] ? kthread+0xc1/0xe0
[ 349.160797] [<ffffffff8108f3d0>] ? flush_kthread_worker+0xb0/0xb0
[ 349.160801] [<ffffffff8154787c>] ? ret_from_fork+0x7c/0xb0
[ 349.160805] [<ffffffff8108f3d0>] ? flush_kthread_worker+0xb0/0xb0
For the crash while walking the filesystem (either via find or rsync) we were basically getting "BUG: soft lockup" and/or "INFO: rcu_sched detected stalls on CPUs/tasks" and the rsync or find process would appear to be spinning at 100% in top (though I'd have expected it to be in D state or something like that and spinning in kernel land instead). Anyways, those netconsole dumps are pretty long since they repeat until the panic occurs and dump for every CPU on the box, so I'll only post a snippet here:
Nov 20 11:23:33 randal netconsole: 683.784535] INFO: rcu_sched detected stalls on CPUs/tasks:
Nov 20 11:23:33 randal netconsole: by 14, t=36762 jiffies, g=11502, c=11501, q=137251)
Nov 20 11:23:33 randal netconsole: 683.784842] NMI backtrace for cpu 8
Nov 20 11:23:33 randal netconsole: 683.784913] CPU: 8 PID: 15162 Comm: find Tainted: P W O 3.14-0.bpo.1-amd64 #1 Debian 3.14.12-1~bpo70+1
Nov 20 11:23:33 randal netconsole: 683.785007] Hardware name: Supermicro X9DRW-3LN4F+/X9DRW-3TF+/X9DRW-3LN4F+/X9DRW-3TF+, BIOS 3.0a 02/06/2014
Nov 20 11:23:33 randal netconsole: 683.785101] task: ffff880ffd3de9e0 ti: ffff880fc141e000 task.ti: ffff880fc141e000
Nov 20 11:23:33 randal netconsole: 683.785191] RIP: 0010:[<ffffffff814f297d>]
Nov 20 11:23:33 randal netconsole[<ffffffff814f297d>]: _raw_spin_lock_irq+0x1d/0x30
Nov 20 11:23:33 randal netconsole: 683.785337] RSP: 0018:ffff880fc141faa8 EFLAGS: 00000006
Nov 20 11:23:33 randal netconsole: 683.785409] RAX: 000000000000000a RBX: ffff880ffd3de9e0 RCX: 0000000000000016
Nov 20 11:23:33 randal netconsole: 683.785484] RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff88203149a488
Nov 20 11:23:33 randal netconsole: 683.785559] RBP: ffff88203149a480 R08: 008293cec78f2ba7 R09: ffff88203aee9000
Nov 20 11:23:33 randal netconsole: 683.785634] R10: ffff88203149a480 R11: ffff882010d44038 R12: ffff88203149a480
Nov 20 11:23:33 randal netconsole: 683.785709] R13: 0000000000000002 R14: ffff880fc141fe58 R15: 0000000000000000
Nov 20 11:23:33 randal netconsole: 683.785786] FS: 00007fd5a2a1b700(0000) GS:ffff88207fc00000(0000) knlGS:0000000000000000
Nov 20 11:23:33 randal netconsole: 683.785876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 20 11:23:33 randal netconsole: 683.785948] CR2: 00007f14a0496610 CR3: 0000000fbeb89000 CR4: 00000000001407e0
Nov 20 11:23:33 randal netconsole: 683.786023] Stack:
Nov 20 11:23:33 randal netconsole: 683.786089] ffffffff814f1f7c
Nov 20 11:23:33 randal netconsole: 683.786372] ffff880ffd3de9e0
Nov 20 11:23:33 randal netconsole: 683.786655] ffff88203149a480
Nov 20 11:23:33 randal netconsole: 683.786939] Call Trace:
Nov 20 11:23:33 randal netconsole: 683.787008] [<ffffffff814f1f7c>] ? rwsem_down_read_failed+0x2c/0xf0
Nov 20 11:23:33 randal netconsole: 683.787101] [<ffffffffa0d4d3bc>] ? dmu_buf_hold_noread+0xfc/0x190 [zfs]
Nov 20 11:23:33 randal netconsole: 683.787179] [<ffffffff812a53b4>] ? call_rwsem_down_read_failed+0x14/0x30
Nov 20 11:23:33 randal netconsole: 683.787255] [<ffffffff814f1a57>] ? down_read+0x17/0x20
Nov 20 11:23:33 randal netconsole: 683.787341] [<ffffffffa0da5fa0>] ? zap_get_leaf_byblk+0xe0/0x300 [zfs]
Nov 20 11:23:33 randal netconsole: 683.787425] [<ffffffffa0daa070>] ? zap_match.part.5+0x40/0x40 [zfs]
Nov 20 11:23:33 randal netconsole: 683.787508] [<ffffffffa0da622a>] ? zap_deref_leaf+0x6a/0x90 [zfs]
Nov 20 11:23:33 randal netconsole: 683.787593] [<ffffffffa0da7c1f>] ? fzap_cursor_retrieve+0xdf/0x290 [zfs]
Nov 20 11:23:33 randal netconsole: 683.787678] [<ffffffffa0dab31f>] ? zap_cursor_retrieve+0x6f/0x330 [zfs]
Nov 20 11:23:33 randal netconsole: 683.787761] [<ffffffffa0d4d1a3>] ? dmu_prefetch+0x1c3/0x2e0 [zfs]
Nov 20 11:23:33 randal netconsole: 683.787847] [<ffffffffa0dcc75f>] ? zfs_readdir+0x11f/0x460 [zfs]
Nov 20 11:23:33 randal netconsole: 683.787925] [<ffffffff811a7b81>] ? path_openat+0xd1/0x630
Nov 20 11:23:33 randal netconsole: 683.788001] [<ffffffff810a6519>] ? remove_wait_queue+0x19/0x60
Nov 20 11:23:33 randal netconsole: 683.788084] [<ffffffffa0de3e11>] ? zpl_iterate+0x41/0x70 [zfs]
Nov 20 11:23:33 randal netconsole: 683.788159] [<ffffffff811aaa76>] ? iterate_dir+0xa6/0xe0
Nov 20 11:23:33 randal netconsole: 683.788234] [<ffffffff811a2299>] ? generic_permission+0xf9/0x1a0
Nov 20 11:23:33 randal netconsole: 683.788309] [<ffffffff811aabd3>] ? SyS_getdents+0x93/0x120
Nov 20 11:23:33 randal netconsole: 683.788382] [<ffffffff811aa8c0>] ? filldir64+0x110/0x110
Nov 20 11:23:33 randal netconsole: 683.788455] [<ffffffff814faf79>] ? system_call_fastpath+0x16/0x1b
Something else interesting to note, that's not necessarily related to the zfs xattrs=sa stuff:
In our initial tests we rsynced from another zfs fs (eg: /tank/working with xattr=on) but neglected to --exclude .zfs. To our surprise, not only did /tank/zfs/.zfs/snapshot/ contain the data from /tank/working/.zfs/snapshot/, but zfs list actually reported them as snapshots. Of course they weren't (this wasn't a cloned fs nor did we take any snapshots from the fs), so the zfs destroy command had some trouble with those. We eventually just rm -rf /tank/test/.zfs/snapshot/* and were able to zfs destory tank/test. We were wondering if the following commit might be related to that:
https://github.com/zfsonlinux/zfs/commit/c944be5d7e619ad7259fdb5802f0f9bd90a22488
Here are the (even more) broken package and kernel version details:
ii debian-zfs 7~wheezy amd64 Native ZFS filesystem metapackage for Debian.
ii dkms 2.2.0.3-1.2+zfs6 all Dynamic Kernel Module Support Framework
ii libzfs2 0.6.3-21~7b2d78~wheezy amd64 Native ZFS filesystem library for Linux
ii spl 0.6.3-13~917fef~wheezy amd64 Solaris Porting Layer user-space utilities for Linux
ii spl-dkms 0.6.3-13~917fef~wheezy all Solaris Porting Layer kernel modules for Linux
ii zfs-dbg 0.6.3-21~7b2d78~wheezy amd64 Debugging symbols for ZFS userland libraries and tools
ii zfs-dkms 0.6.3-21~7b2d78~wheezy all Native ZFS filesystem kernel modules for Linux
ii zfsonlinux 3~wheezy all archive.zfsonlinux.org trust package
ii zfsutils 0.6.3-21~7b2d78~wheezy amd64 command-line tools to manage ZFS filesystems
Linux randal 3.16-0.bpo.3-amd64 #1 SMP Debian 3.16.5-1~bpo70+1 (2014-11-02) x86_64 GNU/Linux
Anyways, we're going to downgrade back to the "stable" 0.6.3 release in your wheezy (non-daily) repo and then attempt to apply the patches you suggested (for the shrinker fix and xattr=sa fix) separately directly to the local dkms source to see how that responds.
Let us know if you need any more details.
Thanks, Brian
I was not able to cleanly apply the patch that you provided for the dev source (https://github.com/zfsonlinux/zfs/commit/4254acb05743dc2175ae76f6e15b0785d4b688fd) to the current stable source. Having a look at it I came up with the following.. I was curious if it looked correct to you
--- dbuf.c 2014-05-20 02:25:26.000000000 -0500
+++ kens_dbuf.c 2014-11-21 13:37:38.224882327 -0600
@@ -872,18 +872,19 @@
{
dmu_buf_impl_t *db, *db_next;
uint64_t txg = tx->tx_txg;
+ boolean_t freespill = (start == DMU_SPILL_BLKID || end == DMU_SPILL_BLKID);
int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
uint64_t first_l1 = start >> epbs;
uint64_t last_l1 = end >> epbs;
- if (end > dn->dn_maxblkid && (end != DMU_SPILL_BLKID)) {
+ if ( end > dn->dn_maxblkid && !freespill ) {
end = dn->dn_maxblkid;
last_l1 = end >> epbs;
}
dprintf_dnode(dn, "start=%llu end=%llu\n", start, end);
mutex_enter(&dn->dn_dbufs_mtx);
- if (start >= dn->dn_unlisted_l0_blkid * dn->dn_datablksz) {
+ if (start >= dn->dn_unlisted_l0_blkid * dn->dn_datablksz && !freespill) {
/* There can't be any dbufs in this range; no need to search. */
mutex_exit(&dn->dn_dbufs_mtx);
return;
@@ -919,7 +920,7 @@
if (db->db_level != 0)
continue;
dprintf_dbuf(db, "found buf %s\n", "");
- if (db->db_blkid < start || db->db_blkid > end)
+ if (db->db_blkid < start || db->db_blkid > end && !freespill)
continue;
/* found a level 0 buffer in the range */
@uwkhahn Yes, that patch looks good.
@bpkroth I'd be very interested to hear if applying this patch resolves the issue for you.
As for the .zfs/snapshot issue I can explain what's going on there. A lesser known feature of ZFS is that you can create/destroy/rename snapshots with mkdir/rmdir/mv in the .zfs/snapshot directory. Because you must have snapdir=visible set rsync traversed in to this directory and then created it at the destination with mkdir which created a snapshot. That snapshot of course doesn't match what was at the source but that's why it happened.
Brian Behlendorf notifications@github.com 2014-12-01 13:43:
@uwkhahn Yes, that patch looks good.
@bpkroth I'd be very interested to hear if applying this patch resolves the issue for you.
We're working on testing it in a VM now. Having a little bit of trouble reproducing the original issue without the patch so far, but with the patch it doesn't appear to fail catastrophically, so that's good at least.
That said, his patch is just the one for fixing the dirty spill block corruption.
It seemed like the rest of the issues we were experiencing with the zfs 7b2d78a and spl 917fef2 builds were related to other code updates that were brought along with them.
As for the .zfs/snapshot issue I can explain what's going on there. A lesser known feature of ZFS is that you can create/destroy/rename snapshots with mkdir/rmdir/mv in the .zfs/snapshot directory. Because you must have snapdir=visible set rsync traversed in to this directory and then created it at the destination with mkdir which created a snapshot. That snapshot of course doesn't match what was at the source but that's why it happened.
Huh, that's kinda cool/weird. I'd have expected .zfs/snapshot and all of it's children to be read-only.
...
I was about to say that we didn't need snapdir=visible anyways since the point was to allow users to browse backups for themselves via NFS and that that was broken, but it looks like that's about to be fixed too: https://github.com/zfsonlinux/zfs/pull/2797
Thanks for the info.
Cheers, Brian
FYI, we've been running with Ken's patch in [1] and all volumes resynced fresh with xattr=sa now for about a week. Things seem stable and there's no zfs_iput_taskq task spinning.
Thanks, Brian
[1] https://github.com/zfsonlinux/zfs/issues/2863#issuecomment-64027283
We use ZFS in a home grown backup solution that previously worked on zfs on solaris. It is a series of scripts doing periodic rsync and ZFS snapshot creation.
Now in moving it to zfs on Debian (Wheezy) Linux I'm seeing a number of errors where a directory created by rsync gives me an error upon trying to cd to it. For example:
When I try to take a closer look with an strace ls etc, I see the following:
The rsync we use looks like:
zfs properties on the filesystem:
Debian relevant zfs packages:
uname info:
Not long after seeing this problem, I will see the kernel crash: