openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.64k stars 1.75k forks source link

xattr=sa: listxattr EFAULT #1978

Closed chrisrd closed 10 years ago

chrisrd commented 10 years ago

Unfortunately it seems there's still something wrong with xattr=sa :-(

I had given up on using xattrs at all for a while, but with the recent commits to resolve #1890 and #1648, I was optimistic that xattr=sa would be trouble free. However I'm running into #503 again.

On linux-3.10.22 w/ zfsonlinux/zfs@c1ab64d3931cbab45fbb197588cb27cb6fd10c33 + cherry-picked nedbass/zfs@915aa8eda7158c69a7cd0f5621e7707eb6deafd2 nedbass/zfs@00edcdc31a6297cd4255750a10ab4e6d54ef775f

(created before these commits made it into master), I'm getting:

$ getfattr -d file
getfattr: file: Bad address
$ strace getfattr -d file
...
listxattr("file", (nil), 0) = -1 EFAULT (Bad address)
...

All the xattrs on the filesystem are the result of rsync --fake-super, which means each file and directory has a single xattr which looks like user.rsync.%stat="100666 0,0 65534:65534".

The filesystem was created after installing the above ZoL, and with xattr=sa set at create. There are no symlinks in the filesystem, as shown by:

$ find * -type l | wc -l
0

Per #503, I used this stap script:

function trace(entry_p, extra) {
  %( $# > 1 %? if (tid() in trace) %)
  printf("%s%s%s %s\n",
         thread_indent (entry_p),
         (entry_p>0?"->":"<-"),
         probefunc (),
         extra)
}

%( $# > 1 %?
global trace
probe $2.call {
  trace[tid()] = 1
}
probe $2.return {
  delete trace[tid()]
}
%)

probe module("spl").function("*").call        { trace(1, $$parms) }
probe module("spl").function("*").return      { trace(-1, $$return) }

probe module("zavl").function("*").call       { trace(1, $$parms) }
probe module("zavl").function("*").return     { trace(-1, $$return) }

probe module("zcommon").function("*").call    { trace(1, $$parms) }
probe module("zcommon").function("*").return  { trace(-1, $$return) }

probe module("zfs").function("*").call        { trace(1, $$parms) }
probe module("zfs").function("*").return      { trace(-1, $$return) }

probe module("znvpair").function("*").call    { trace(1, $$parms) }
probe module("znvpair").function("*").return  { trace(-1, $$return) }

probe module("zunicode").function("*").call   { trace(1, $$parms) }
probe module("zunicode").function("*").return { trace(-1, $$return) }

...plus this patch ('cos I couldn't get stap to dump the info itself):

--- a/module/nvpair/nvpair.c
+++ b/module/nvpair/nvpair.c
@@ -3272,7 +3272,10 @@ nvs_xdr_nvpair(nvstream_t *nvs, nvpair_t *nvp, size_t *size)
                        return (EFAULT);

                if (*size > NVS_XDR_MAX_LEN(bytesrec.xc_num_avail))
+{
+printk("<4>otndbg 3: size=%ld > NVS_XDR_MAX_LEN=%ld\n", *size, NVS_XDR_MAX_LEN(bytesrec.xc_num_avail));
                        return (EFAULT);
+}
                break;
        }
     0 getfattr(882):->zfs_sa_get_xattr zp=?
     8 getfattr(882): ->sa_size hdl=? attr=? size=?
    12 getfattr(882):  ->sa_attr_op hdl=? bulk=? count=? data_op=? tx=?
    15 getfattr(882):  <-sa_size return=0x0
    17 getfattr(882): <-zfs_sa_get_xattr return=0x0
    21 getfattr(882): ->sa_spill_alloc flags=?
   298 getfattr(882): <-zfs_sa_get_xattr return=0xffffc90049651158
   305 getfattr(882): ->sa_lookup hdl=? attr=? buf=? buflen=?
   310 getfattr(882):  ->sa_lookup_impl hdl=? bulk=? count=?
   314 getfattr(882):   ->sa_attr_op hdl=? bulk=? count=? data_op=? tx=?
   318 getfattr(882):   <-sa_lookup_impl return=0x0
   321 getfattr(882):  <-sa_lookup return=0x0
   324 getfattr(882): <-zfs_sa_get_xattr return=0x0
   330 getfattr(882): ->nvlist_unpack buf=? buflen=? nvlp=? kmflag=?
   335 getfattr(882):  ->nvlist_xunpack buf=? buflen=? nvlp=? nva=?
   340 getfattr(882):   ->nvlist_xalloc nvlp=? nvflag=? nva=?
   345 getfattr(882):    ->nv_alloc_sleep_spl nva=? size=?
   350 getfattr(882):    <-nvlist_xalloc return=0xffff8801c380e240
   355 getfattr(882):    ->nv_mem_zalloc.isra.13 size=? nvp=?
   358 getfattr(882):     ->nv_alloc_sleep_spl nva=? size=?
   362 getfattr(882):     <-nv_mem_zalloc.isra.13 return=0xffff8814d9de9320
   366 getfattr(882):    <-nvlist_xalloc return=0xffff8814d9de9320
   369 getfattr(882):   <-nvlist_xunpack return=0x0
   373 getfattr(882):   ->nvlist_common nvl=? buf=? buflen=? encoding=? nvs_op=?
   378 getfattr(882):    ->nvs_operation nvs=? nvl=? buflen=?
   383 getfattr(882):     ->nvs_xdr_nvlist nvs=? nvl=? size=?
   387 getfattr(882):     <-nvs_operation return=0x0
   392 getfattr(882):     ->nvs_xdr_nvpair nvs=? nvp=? size=?
   431 getfattr(882):     <-nvs_operation return=0xe                     #### oops
   434 getfattr(882):    <-nvlist_common return=0xe
   436 getfattr(882):   <-nvlist_xunpack return=0xe
   439 getfattr(882):   ->nvlist_free nvl=?
   443 getfattr(882):    ->nv_free_spl nva=? buf=? size=?
   445 getfattr(882):    <-nvlist_free
   447 getfattr(882):    ->nv_free_spl nva=? buf=? size=?
   449 getfattr(882):    <-nvlist_free
   451 getfattr(882):   <-nvlist_xunpack
   453 getfattr(882):  <-nvlist_unpack return=0xe
   455 getfattr(882): <-zfs_sa_get_xattr return=0xe
   460 getfattr(882): ->sa_spill_free obj=?
   462 getfattr(882): <-zfs_sa_get_xattr
   465 getfattr(882):<-zpl_xattr_list return=0xe

And kern.log has:

Dec 16 22:45:01 b5 kernel: [12651.703724] <4>otndbg 3: size=64 > NVS_XDR_MAX_LEN=24

The filesystem was all written in a single rsync session, and has ended up with 713474 files and 109375 directories each with a single xattr per above, and 88 files (no dirs) with corrupt xattrs. I have other filesystems created exactly the same way which have all good xattrs.

86 of the 88 bad files have size=64 per the above debug message, 2 of them have size=56.

I'll put the filesystem aside for the moment in case there's anything else useful to be dug out of it.

dweeezil commented 10 years ago

@chrisrd Could you run zdb -dddd <pool>/<fs> <inode_number> on a few of the bad ones. I'd be interested to know the dnode flags and whether zdb has any success in decoding the SA xattr.

behlendorf commented 10 years ago

@chrisrd Thanks for the detailed debugging, I'd also be very curious in the zdb output. I suspect it may be able to decode the xattr because the stack you posted suggests an bug in the spl's xdr implementation. Specifically in xdr_int(). A different implementation is used in the library which may not suffer from this.

chrisrd commented 10 years ago

Zdb output below.

I had this occur whilst trying to dump:

# time zdb -dddd pool/corrupt 528903
zdb: can't open 'pool/corrupt': Input/output error
real    2m55.939s

I then tried to do exactly the same under strace, and it succeeded. And running it again without strace also succeeded. The other dumps each took around 1m30 - 1m45, so the time taken for the failed dump was also an outlier. There were no kernel messages.

# zdb -dddd pool/corrupt 48147
Dataset pool/corrupt [ZPL], ID 9742, cr_txg 260211, 235G, 822855 objects, rootbp DVA[0]=<2:5ba3cfed000:800> DVA[1]=<3:5db5f9bd000:800> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=433336L/433336P fill=822855 cksum=1c78f55814:983527c54d9:1b62513eb4d86:37e8fdb1a6b035

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     48147    1    16K  14.0K  11.5K  14.0K  100.00  ZFS plain file
                                        208   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /c/xxxxxx/zzzzzzzzzzzzz/aaaaaaaaaaaaa/bbbbbbbbbbb/ccccccccccccccccccccccccccccc.xlsx
        uid     1068
        gid     1000
        atime   Thu Dec 12 12:38:52 2013
        mtime   Fri Aug 23 11:46:46 2013
        ctime   Thu Dec 12 12:38:52 2013
        crtime  Thu Dec 12 12:38:52 2013
        gen     260973
        mode    100676
        size    14174
        parent  47114
        links   1
        pflags  40800000004

# zdb -dddd pool/corrupt 1126091
Dataset pool/corrupt [ZPL], ID 9742, cr_txg 260211, 235G, 822855 objects, rootbp DVA[0]=<2:5ba3cfed000:800> DVA[1]=<3:5db5f9bd000:800> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=433336L/433336P fill=822855 cksum=1c78f55814:983527c54d9:1b62513eb4d86:37e8fdb1a6b035

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
   1126091    1    16K    512  1.50K    512  100.00  ZFS plain file
                                        208   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /c/xxxxxx/yyyyy/ddd/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/ffff/gggggggg.dwl
        uid     1068
        gid     1000
        atime   Thu Dec 12 14:59:42 2013
        mtime   Thu Aug 29 08:45:09 2013
        ctime   Thu Dec 12 14:59:42 2013
        crtime  Thu Dec 12 14:59:42 2013
        gen     264669
        mode    100676
        size    52
        parent  1125032
        links   1
        pflags  40800000004

# zdb -dddd pool/corrupt 528903
Dataset pool/corrupt [ZPL], ID 9742, cr_txg 260211, 235G, 822855 objects, rootbp DVA[0]=<2:5ba3cfed000:800> DVA[1]=<3:5db5f9bd000:800> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=433336L/433336P fill=822855 cksum=1c78f55814:983527c54d9:1b62513eb4d86:37e8fdb1a6b035

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
    528903    1    16K    512  1.50K    512  100.00  ZFS plain file
                                        208   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /c/xxxxxx/yyyyy/hhhhhhhhhhh/iiiiiiiiiiiiiiiiiiiiii/jjjjjjjjjjjjj/kkkkkkkkkkk/llllllll.dwl2
        uid     1068
        gid     1000
        atime   Thu Dec 12 14:32:28 2013
        mtime   Tue Aug 13 13:54:12 2013
        ctime   Thu Dec 12 14:32:28 2013
        crtime  Thu Dec 12 14:32:28 2013
        gen     263787
        mode    100676
        size    201
        parent  525019
        links   1
        pflags  40800000004

c.f. non-corrupt:

# zdb -dddd pool/corrupt 1170239
Dataset pool/corrupt [ZPL], ID 9742, cr_txg 260211, 235G, 822855 objects, rootbp DVA[0]=<2:5ba3cfed000:800> DVA[1]=<3:5db5f9bd000:800> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=433336L/433336P fill=822855 cksum=1c78f55814:983527c54d9:1b62513eb4d86:37e8fdb1a6b035

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
   1170239    2    16K   128K   705K   768K  100.00  ZFS plain file
                                        256   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 5
        path    /c/xxxxxx/yyyyy/mmmmmmmmmmmmmmm/nnnnnnnnnnnnnnn/oooooooooooooo/ppppppp/qqqqqqqqq/rrrrrrrrrrrrrrrrrrrrrrrrrrr.pdf
        uid     1068
        gid     1000
        atime   Thu Dec 12 15:12:42 2013
        mtime   Wed Oct  9 12:26:00 2013
        ctime   Thu Dec 12 15:12:42 2013
        crtime  Thu Dec 12 15:12:42 2013
        gen     265032
        mode    100666
        size    723562
        parent  1168977
        links   1
        pflags  40800000004
        SA xattrs: 80 bytes, 1 entries

                user.rsync.%stat = 100666 0,0 65534:65534

I hadn't noticed before, but looking at the inode numbers shows they come in groups: I imagine that's because they're allocated in sequential order (are they?) and so the groups are time-correlated. Perhaps this relates to activity on the system when they were being created, e.g. a higher than normal load (the box is generally relatively lightly loaded). If so, this may be pointing to some failure under memory pressure, or a locking problem?

All corrupt inodes:

48147 48882 48885 48886 48887 48888 48889 49829 49835 49836 49840 51000 51001 
51002 62052 62157 62342 62343 62347 63175
511698 511699 528084 528085 528088 528089 528092 528093 528096 528097 528100 
528101 528902 528903 
1126091 1126092 1128335 1129232 1129233 1129700 1129701 1130283 1130284 1133232 
1133233 1133609 1133610 1143125 1143394 1143395 1145824 1145825 1157664 1157665 
1158676 1158677 1159241 1159242 1160401 1160402 1160906 1160907 1163982 1163983 
1167063 1167064 1167684 1167685 1167686 1167687 1167818 1167819 1169091 1169092 
1170100 1170237 1170345 1170352 1170875 1170876 1177353 1177873 1181391 1181392 
1181624 1181625 1182908 1182909

Also worthy of note, I recreated the entire filesystem last night in exactly the same manner as the original, and the new version doesn't have any xattr corruption.

dweeezil commented 10 years ago

@chrisrd Could you please try "zdb -dddddd" (6 d's) from dweeezil/zfs@803578e on some of the bad ones?

dweeezil commented 10 years ago

@chrisrd As a follow-on to my request, since it's not even trying to print any SA xattrs on the bad objects, the nvlist is likely corrupted and I'm hoping we can figure out the nature of the corruption by dumping its contents. Unfortunately, this simplistic debugging code assumes we are getting a reasonable length, but it's a start. One other useful piece of information from your zdb output is that we don't seem to have any spill blocks in use (which I'd not expect to see with the relatively short xattrs used by rsync).

It looks like all rsync ought to be doing with fake-super is a whole bunch of lsetxattr() calls. I do find it very interesting that you couldn't reproduce the problem after trying the exact same thing again.

As to the inode number, on a clean filesystem with xattr=sa, I'd expect them to be allocated pretty much sequentially. The fact that your bad objects were in clusters is interesting and likely relevant but it's not much to go on at this point.

Can you think of anything else your system may have been doing at the time when you ran the first rsync that created the corrupted filesystem. Also, was that first rsync writing to a brand new clean filesystem?

chrisrd commented 10 years ago

G'day @dweeezil - zdb -dddddd output below, for the same inodes per previous.

Interestingly (scarily?!?), I got an error the first time dumping the first inode (and there was no other output):

zdb: ../../module/zfs/dsl_deadlist.c:103: Assertion `0 == dmu_bonus_hold(os, object, dl, &dl->dl_dbuf) (0x0 == 0x34)' failed.

The next dump of the same inode worked.

For each rsync the filesystem was freshly created just prior to the rsync (it's part of a script so it's consistent).

The box only receives these rsyncs, and runs ceph on some other disks (4 x osds, with the storage attached to VMs running on a separate box). The first (corrupted) rsync was running at 8am Sunday - there wasn't much else going on!

zdb -dddddd output:

Dataset pool/corrupt [ZPL], ID 9742, cr_txg 260211, 235G, 822855 objects, rootbp DVA[0]=<2:5ba3cfed000:800> DVA[1]=<3:5db5f9bd000:800> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=433336L/433336P fill=822855 cksum=1c78f55814:983527c54d9:1b62513eb4d86:37e8fdb1a6b035

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     48147    1    16K  14.0K  11.5K  14.0K  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                        208   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /c/xxxxxx/aaaaaaaaaaaaa/bbbbbbbbbbbbb/ccccccccccc/ddddddddddddddddddddddddddddd.xlsx
    uid     1068
    gid     1000
    atime   Thu Dec 12 12:38:52 2013
    mtime   Fri Aug 23 11:46:46 2013
    ctime   Thu Dec 12 12:38:52 2013
    crtime  Thu Dec 12 12:38:52 2013
    gen 260973
    mode    100676
    size    14174
    parent  47114
    links   1
    pflags  40800000004
    dump_znode_sa_xattr: sa_xattr_size=24 sa_size error=0
    SA packed dump sa_xattr_size=24: \001\001\000\000\000\000\000\000\000\000\000\001\000\000\000\074\000\000\000\100\000\000\000\020
    SA xattr dump:
    dump_znode_sa_xattr: nvlist_unpack error=14
Indirect blocks:
               0 L0 1:480119cb800:4000 3800L/2c00P F=1 B=260973/260973

        segment [0000000000000000, 0000000000003800) size 14.0K

Dataset pool/corrupt [ZPL], ID 9742, cr_txg 260211, 235G, 822855 objects, rootbp DVA[0]=<2:5ba3cfed000:800> DVA[1]=<3:5db5f9bd000:800> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=433336L/433336P fill=822855 cksum=1c78f55814:983527c54d9:1b62513eb4d86:37e8fdb1a6b035

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
    528903    1    16K    512  1.50K    512  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                        208   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /c/xxxxxx/eeeee/fffffffffff/gggggggggggggggggggggg/hhhhhhhhhhhhh/iiiiiiiiiiiiiiiiiiii.dwl2
    uid     1068
    gid     1000
    atime   Thu Dec 12 14:32:28 2013
    mtime   Tue Aug 13 13:54:12 2013
    ctime   Thu Dec 12 14:32:28 2013
    crtime  Thu Dec 12 14:32:28 2013
    gen 263787
    mode    100676
    size    201
    parent  525019
    links   1
    pflags  40800000004
    dump_znode_sa_xattr: sa_xattr_size=24 sa_size error=0
    SA packed dump sa_xattr_size=24: \001\001\000\000\000\000\000\000\000\000\000\001\000\000\000\074\000\000\000\100\000\000\000\020
    SA xattr dump:
    dump_znode_sa_xattr: nvlist_unpack error=14
Indirect blocks:
               0 L0 1:4881970a000:800 200L/200P F=1 B=263787/263787

        segment [0000000000000000, 0000000000000200) size   512

Dataset pool/corrupt [ZPL], ID 9742, cr_txg 260211, 235G, 822855 objects, rootbp DVA[0]=<2:5ba3cfed000:800> DVA[1]=<3:5db5f9bd000:800> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=433336L/433336P fill=822855 cksum=1c78f55814:983527c54d9:1b62513eb4d86:37e8fdb1a6b035

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
   1126091    1    16K    512  1.50K    512  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                        208   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /c/xxxxxx/jjjjj/kkk/llllllllllllllllllllllllllllllllllll/mmmm/nnnnnnnn.dwl
    uid     1068
    gid     1000
    atime   Thu Dec 12 14:59:42 2013
    mtime   Thu Aug 29 08:45:09 2013
    ctime   Thu Dec 12 14:59:42 2013
    crtime  Thu Dec 12 14:59:42 2013
    gen 264669
    mode    100676
    size    52
    parent  1125032
    links   1
    pflags  40800000004
    dump_znode_sa_xattr: sa_xattr_size=24 sa_size error=0
    SA packed dump sa_xattr_size=24: \001\001\000\000\000\000\000\000\000\000\000\001\000\000\000\074\000\000\000\100\000\000\000\020
    SA xattr dump:
    dump_znode_sa_xattr: nvlist_unpack error=14
Indirect blocks:
               0 L0 4:43ad7e7a800:800 200L/200P F=1 B=264669/264669

        segment [0000000000000000, 0000000000000200) size   512
dweeezil commented 10 years ago

@chrisrd That's good information and shows that the xattr SA is being truncated in some manner. The part that is there appears to be totally correct but it's cut off where the actual string array should begin. I'll poke around the xattr SA code a bit to see whether I can identify any place where this might happen. Since you were able to perform the rsync again from scratch with no problems, however, it sounds like we're looking at a race condition or some other odd corner case.

One other interesting thing is that the two corrupted objects you looked at have 0676 perms which is a bit odd. I don't suppose that, by some chance, all the corrupted objects have that mode? There's no reason that a particular mode should cause this problem but it might be a clue that points us in the right direction.

chrisrd commented 10 years ago

@dweeezil Good pickup. All those files on the source, and only those files, that have mode 0676 or 0670 (find * -type f \( -perm 0676 -o -perm 0670 \)) have the corrupt xattrs on the destination.

Note: the "source" files have been dumped onto an ext4 fs from external sources, using rsync --fake-super, thus creating user.rsync.%stat xattrs. These files are then copied to the ZFS filesystems using rsync -XX to copy the user.rsync.%stat xattrs (and any other xattrs that might have snuck in - but there aren't any others on the filesystems in question).

The plot thickens...

I started seeing corruption on the filesystem reported above as having been recreated and with no corruption afterwards. I thought I might have gone nuts somewhere along the line and reported the wrong thing somehow, so I once again recreated that filesystem (zfs create + rsync), along with 2 other filesystems also experiencing the corruption. After these filesystems were filled I checked for corruption (using getfattr -Rd *) and found none.

Then, after a reboot, all of those filesystems are exhibiting corrupt xattrs!

I.e. something in memory is not making it to disk, or is only partly making it to disk.

Hmmm, that's ringing some bells... I vaguely recall some other issue where this was happening. @behlendorf - do you remember something like this?

More weirdness...

The first-reported filesystem is still exhibiting the same symptoms after the recreation and reboot, i.e. all and only source files with mode 0676 or 0670 have corrupted xattrs on the destination. One of the other filesystems is the same (except there are 9 source files with mode 0676 or 0670 without xattrs, so these files don't have any xattrs on the destination to be corrupt). For the remaining filesystem, it's all and only files with mode 0650 that are corrupt, and files with 0676 or 0670 aren't corrupt!

It doesn't appear to be the length of the xattr. I investigated one of the filesystems (with mode 0676 or 0670 corruptions) and found (counting the entire length of name and value dumped by getfattr -d, e.g. user.rsync.%stat="100676 0,0 1003:104" == 38):

Corrupt xattrs (length on source)
--------------
  Count Length
      3 34
   1380 38

Non-corrupt xattrs (length on source)
--------------
  Count Length
    943 33
    168 34
    472 35
  75856 38
      1 39

Where to from here??

dweeezil commented 10 years ago

@chrisrd That's all very good information. After noticing the file mode issue, I did actually try recreating the problem by hand but wasn't able to. I think the next step is to strace rsync to find its exact sequence of system calls involving the creation of the file, the setting of the xattr and any other inode-changing operations such as chmod, etc. I gather the first rsync (with fake-super) is running as a non-root user. Is the second (-XX) rsync running as root?

Your observation about seeing the corruption following a reboot certainly indicates cached-but-not-flushed metadata. You could check the output of dbufstat.py -bx or one of its other forms and see what's being cached (look for objects matching the bad inodes).

I'm going to try to reproduce this today.

chrisrd commented 10 years ago

@dweeezil Yup, first rsync is non-root, second is root.

I removed one of the files ("UPS.txt") with the corruption, then straced an rsync of just that file using the same args as for the full set (-aH --delete --numeric-ids -XX --inplace). I think the relevant part of the strace (egrep 'UPS\.txt|chmod|xattr|stat|open' - let me know if you'd like more) is:

10604 01:53:41.340033 open("UPS.txt", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000077>
10604 01:53:41.340159 open("UPS.txt", O_WRONLY|O_CREAT, 0600) = 3 <0.000248>
10604 01:53:41.340787 close(3)          = 0 <0.000010>
10604 01:53:41.340835 lstat("UPS.txt", {st_mode=S_IFREG|0600, st_size=52, ...}) = 0 <0.000011>
10604 01:53:41.340898 llistxattr("UPS.txt", 0xc0a410, 1024) = 0 <0.000013>
10604 01:53:41.340938 lsetxattr("UPS.txt", "user.rsync.%stat", "100650 0,0 43874288:43874288", 28, 0) = 0 <0.000037>
10604 01:53:41.341007 utimensat(AT_FDCWD, "UPS.txt", {UTIME_NOW, {1295479844, 0}}, AT_SYMLINK_NOFOLLOW) = 0 <0.000016>
10604 01:53:41.341051 lchown("UPS.txt", 1079, 1000) = 0 <0.000019>
10604 01:53:41.341092 chmod("UPS.txt", 0650) = 0 <0.000023>

OK, looks like I can consistently reproduce the problem with that file. I'm not too sure what to look at in the dbufstat.py -bx output, but here's a record of removing the file, recreating it with rsync, seeing the good xattr, re-mounting the filesystem and seeing a corrupt attr.

...actually, the zdb before the remount shows it's already stuffed before the remount.

# f="/path/UPS.txt"
# rm "${f}"
# rsync -aH --delete --numeric-ids -XX --inplace --protect-args host:"${f}" "${f}"
# getfattr -d "${f}"
getfattr: Removing leading '/' from absolute path names
# file: /path/UPS.txt
user.rsync.%stat="100650 0,0 43874288:43874288"
# ino=$(stat -c %i "${f}"); echo ino=$ino
ino=24577
# dbufstat.py -bx | egrep -w "objset|${ino}"
           pool  objset      object  level     blkid        offset   dbsize  meta  state  dbholds  list    atype  index     flags  count    asize      access    mru   gmru    mfu   gmfu     l2  l2_dattr  l2_asize                l2_comp  aholds                        dtype                        btype  data_bs  meta_bs   bsize    lvls  dholds    blocks         dsize  
           pool   13879       24577      0         0             0      512     0      4        0     1        0     -1     0x260      1      512  4298179477      0      0      0      0      0  16270848       512       ZIO_COMPRESS_OFF       0   DMU_OT_PLAIN_FILE_CONTENTS                    DMU_OT_SA      512      16K     208       1       2         1           512  
# time zdb -dddddd pool/filesystem ${ino}
Dataset filesystem [ZPL], ID 13879, cr_txg 476433, 13.5G, 22432 objects, rootbp DVA[0]=<5:64000131800:800> DVA[1]=<0:66000131000:800> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=488220L/488220P fill=22432 cksum=16d458dc08:76bbfba9d76:14a5aedb85836:28a1281bede5b2

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     24577    1    16K    512  1.50K    512  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                        208   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /path/UPS.txt
    uid     1079
    gid     1000
    atime   Thu Dec 19 02:31:48 2013
    mtime   Thu Jan 20 10:30:44 2011
    ctime   Thu Dec 19 02:31:48 2013
    crtime  Thu Dec 19 02:31:48 2013
    gen 488220
    mode    100650
    size    52
    parent  61
    links   1
    pflags  40800000004
    dump_znode_sa_xattr: sa_xattr_size=24 sa_size error=0
    SA packed dump sa_xattr_size=24: \001\001\000\000\000\000\000\000\000\000\000\001\000\000\000\100\000\000\000\110\000\000\000\020
    SA xattr dump:
    dump_znode_sa_xattr: nvlist_unpack error=14
Indirect blocks:
               0 L0 5:6400012c800:800 200L/200P F=1 B=488220/488220

        segment [0000000000000000, 0000000000000200) size   512

# zfs unmount pool/filesystem
# zfs mount pool/filesystem
# getfattr -d "${f}"
getfattr: /path/UPS.txt: Bad address
# ino=$(stat -c %i "${f}"); echo ino=$ino
ino=24577
# zdb -dddddd pool/filesystem ${ino}
Dataset pool/filesystem [ZPL], ID 13879, cr_txg 476433, 13.5G, 22432 objects, rootbp DVA[0]=<5:64000131800:800> DVA[1]=<0:66000131000:800> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=488220L/488220P fill=22432 cksum=16d458dc08:76bbfba9d76:14a5aedb85836:28a1281bede5b2

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     24577    1    16K    512  1.50K    512  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                        208   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /path/UPS.txt
    uid     1079
    gid     1000
    atime   Thu Dec 19 02:31:48 2013
    mtime   Thu Jan 20 10:30:44 2011
    ctime   Thu Dec 19 02:31:48 2013
    crtime  Thu Dec 19 02:31:48 2013
    gen 488220
    mode    100650
    size    52
    parent  61
    links   1
    pflags  40800000004
    dump_znode_sa_xattr: sa_xattr_size=24 sa_size error=0
    SA packed dump sa_xattr_size=24: \001\001\000\000\000\000\000\000\000\000\000\001\000\000\000\100\000\000\000\110\000\000\000\020
    SA xattr dump:
    dump_znode_sa_xattr: nvlist_unpack error=14
Indirect blocks:
               0 L0 5:6400012c800:800 200L/200P F=1 B=488220/488220

        segment [0000000000000000, 0000000000000200) size   512

# dbufstat.py -bx | egrep -w "objset|${ino}"
           pool  objset      object  level     blkid        offset   dbsize  meta  state  dbholds  list    atype  index     flags  count    asize      access    mru   gmru    mfu   gmfu     l2  l2_dattr  l2_asize                l2_comp  aholds                        dtype                        btype  data_bs  meta_bs   bsize    lvls  dholds    blocks         dsize  
#
dweeezil commented 10 years ago

@chrisrd Thanks for the detailed information. I've made a trivial reproducer and am looking into it now.

dweeezil commented 10 years ago

@chrisrd I found the cause of the problem: The xattr SA is clobbered when a mode change occurs that requires additional (or, possibly fewer) DACL entries. Unfortunately, the stock zdb program doesn't show them but my WIP enhanced version does (at least show the count).

A mode 0644 file looks like this:

        11    1    16K    512    512    512  100.00  ZFS plain file
                                        168   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /blah644
    uid     0
    gid     0
    atime   Wed Dec 18 11:23:10 2013
    mtime   Wed Dec 18 11:23:10 2013
    ctime   Wed Dec 18 11:23:10 2013
    crtime  Wed Dec 18 11:23:10 2013
    gen 61547
    mode    100644
    size    5
    parent  4
    links   1
    pflags  40800000004
    ndacl   3

but changing the mode to 606 requires an additional 2 DACL entries:

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        12    1    16K    512    512    512  100.00  ZFS plain file
                                        184   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /blah606
    uid     0
    gid     0
    atime   Wed Dec 18 11:23:27 2013
    mtime   Wed Dec 18 11:23:27 2013
    ctime   Wed Dec 18 11:23:27 2013
    crtime  Wed Dec 18 11:23:27 2013
    gen 61551
    mode    100606
    size    5
    parent  4
    links   1
    pflags  40800000004
    ndacl   5

which is why the bonus increases to 184 from 168.

I've not yet figured out why this breaks the xattr but it's most certainly the cause.

EDIT: This is likely the cause of some of the other issues related to a corrupted bonus buffer when SA xattrs are in use.

dweeezil commented 10 years ago

I've got a fix for this. Pull request to come shortly.

chrisrd commented 10 years ago

@dweeezil Fantasic! I can stop trying to find my way through the twisty little passages of the zfs SA code and on-disk layout. For what it's worth, a simple reproducer I've been using, with your zdb patch dweeezil/zfs@803578e, is:

zfs=pool/fs
f=/${zfs}/foo 

rm $f
touch $f
setfattr -n 'user.rsync.%stat' -v '100650 0,0 43874288:43874288' $f
zdb -dddddd ${zfs} $(stat -c %i $f)   ### before corruption
chmod 0650 $f
zdb -dddddd ${zfs} $(stat -c %i $f)   ### after corruption
behlendorf commented 10 years ago

@chrisrd @dweeezil Thanks for getting to the root of this issue! I'm happy to have another variable length SA issue put to rest.

chrisrd commented 10 years ago

@behlendorf I'm kinda hoping that might be the last of them!

behlendorf commented 10 years ago

@chrisrd Me too.