Closed ltaulell closed 10 years ago
@ltaulell Were any of the fileysstems created as ZPL versions < 5 and the upgraded to version 5? If so, I'd say it's a good chance to be zfsonlinux/zfs#2025. The SA magic is totally bogus so this is not a simple single-bit memory error (and it sure looks like this system must have ECC in any case).
@dweeezil : Yes, mainly all filesystems were created on Solaris (zfs v3 or v4), then upgraded after send/recv on new systems (debian/zol).
And yes, all r720xd servers have ECC memory.
Despite the tiny diff (zfs_vfsops.c:351 vs zfs_vfsops.c:390), I agree with you, it a good hit to https://github.com/zfsonlinux/zfs/issues/2025.
Again, if you need additionnals, please ask.
@ltaulell I've not yet had a chance to look into Andriy's analysis in zfsonlinux/zfs#2025 but at first glance it sounds reasonable (I've not studied the code path required to upgrade from DMU_OT_ZNODE to DMU_OT_SA).
In the mean time, given that this is most certainly not an SPL bug and is also very likely to be the same as zfsonlinux/zfs#2025, my suggestion would be to close this issue (zfsonlinux/spl#352) and add your information to zfsonlinux/zfs#2025.
Who does that ? 0;-)
copied to zfsonlinux/zfs/issues/2025 => closing
SPLError: 2657:0:(zfs_vfsops.c:351:zfs_space_delta_cb()) SPL PANIC
From production servers (HPC center, /home nfs servers), running for about 1y1/2 without problems, recently got theses messages (one I have been able to save) :
Mar 31 19:30:59 r720data3 kernel: [ 7563.266511] VERIFY3(sa.sa_magic == 0x2F505A) failed (1383495966 == 3100762) Mar 31 19:30:59 r720data3 kernel: [ 7563.266599] SPLError: 2593:0:(zfs_vfsops.c:351:zfs_space_delta_cb()) SPL PANIC Mar 31 19:30:59 r720data3 kernel: [ 7563.266630] SPL: Showing stack for process 2593 Mar 31 19:30:59 r720data3 kernel: [ 7563.266639] Pid: 2593, comm: txg_sync Tainted: P W O 3.2.0-4-amd64 #1 Debian 3.2.54-2 Mar 31 19:30:59 r720data3 kernel: [ 7563.266644] Call Trace: Mar 31 19:30:59 r720data3 kernel: [ 7563.266730] [] ? spl_debug_dumpstack+0x24/0x2a [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266737] [] ? spl_debug_bug+0x7f/0xc8 [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266767] [] ? zfs_space_delta_cb+0xcf/0x150 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266782] [] ? should_resched+0x5/0x23
Mar 31 19:30:59 r720data3 kernel: [ 7563.266798] [] ? dmu_objset_userquota_get_ids+0x1b4/0x2ae [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266805] [] ? mutex_lock+0xd/0x2d
Mar 31 19:30:59 r720data3 kernel: [ 7563.266808] [] ? should_resched+0x5/0x23
Mar 31 19:30:59 r720data3 kernel: [ 7563.266823] [] ? dnode_sync+0x8d/0x78a [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266833] [] ? buf_hash_remove+0x65/0x91 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266837] [] ? should_resched+0x5/0x23
Mar 31 19:30:59 r720data3 kernel: [ 7563.266841] [] ? _cond_resched+0x7/0x1c
Mar 31 19:30:59 r720data3 kernel: [ 7563.266844] [] ? mutex_lock+0xd/0x2d
Mar 31 19:30:59 r720data3 kernel: [ 7563.266857] [] ? dmu_objset_sync_dnodes+0x6f/0x88 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266869] [] ? dmu_objset_sync+0x1f3/0x263 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266878] [] ? arc_cksum_compute+0x83/0x83 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266887] [] ? arc_hdr_destroy+0x1b6/0x1b6 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266903] [] ? dsl_pool_sync+0xbf/0x475 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266923] [] ? spa_sync+0x4f4/0x90b [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266927] [] ? ktime_get_ts+0x5c/0x82
Mar 31 19:30:59 r720data3 kernel: [ 7563.266949] [] ? txg_sync_thread+0x2bd/0x49a [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266969] [] ? txg_thread_wait.isra.2+0x23/0x23 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266975] [] ? thread_generic_wrapper+0x6a/0x75 [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266981] [] ? __thread_create+0x2be/0x2be [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266986] [] ? kthread+0x76/0x7e
Mar 31 19:30:59 r720data3 kernel: [ 7563.266991] [] ? kernel_thread_helper+0x4/0x10
Mar 31 19:30:59 r720data3 kernel: [ 7563.266994] [] ? kthread_worker_fn+0x139/0x139
Mar 31 19:30:59 r720data3 kernel: [ 7563.266997] [] ? gs_change+0x13/0x13
then all txg_sync are hung, all knfsd are hung and uninterruptible, load goes to stars => hard reboot.
I can't force to reproduce that bug, but it appears randomly (on 3 differents servers, all with same config hard+soft) as nfs usage goes (once a week for one server, twice a day for another).
I scrubbed all pools after hangs/reboots => "No known data errors".
Data were imported from older pools (solaris x86 -> Debian x86_64) via zfs end/recv, then upgraded (via zpool/zfs upgrade).
Maybe related with https://github.com/zfsonlinux/zfs/issues/1303 and https://github.com/zfsonlinux/zfs/issues/2025 ?
gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.7.2 (Debian 4.7.2-5)
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT baie1 43,5T 21,6T 21,9T 49% 1.00x ONLINE - baie2 43,5T 8,43T 35,1T 19% 1.00x ONLINE - baie3 21,8T 7,57T 14,2T 34% 1.00x ONLINE - front1 21,8T 561G 21,2T 2% 1.00x ONLINE -
zpool status baie1 pool: baie1 state: ONLINE scan: scrub repaired 0 in 17h17m with 0 errors on Sat Mar 29 05:30:59 2014 config:
errors: No known data errors
zpool get all baie1 NAME PROPERTY VALUE SOURCE baie1 size 43,5T - baie1 capacity 49% - baie1 altroot - default baie1 health ONLINE - baie1 guid 14312441928248404290 default baie1 version - default baie1 bootfs - default baie1 delegation on default baie1 autoreplace off default baie1 cachefile - default baie1 failmode wait default baie1 listsnapshots off default baie1 autoexpand off default baie1 dedupditto 0 default baie1 dedupratio 1.00x - baie1 free 21,9T - baie1 allocated 21,6T - baie1 readonly off - baie1 ashift 0 default baie1 comment - default baie1 expandsize 0 - baie1 freeing 0 default baie1 feature@async_destroy enabled local baie1 feature@empty_bpobj active local baie1 feature@lz4_compress enabled local
zfs list NAME USED AVAIL REFER MOUNTPOINT baie1 14,4T 14,1T 53,9K none baie1/users 14,4T 14,1T 53,9K none baie1/users/phys 14,4T 5,58T 11,8T /users/phys baie2 5,62T 22,9T 53,9K none baie2/users 5,62T 22,9T 53,9K none baie2/users/geol 5,62T 10,4T 5,60T /users/geol baie3 5,04T 9,22T 53,9K none baie3/users 5,04T 9,22T 53,9K none baie3/users/ilm 2,87T 1,13T 2,83T /users/ilm baie3/users/insa 42,0K 1024G 42,0K /users/insa baie3/users/ipag 113K 1024G 113K /users/ipag baie3/users/lasim 1,49T 522G 1,49T /users/lasim baie3/users/lmfa 42,0K 1024G 42,0K /users/lmfa baie3/users/lmfaecl 694G 330G 604G /users/lmfaecl front1 466G 17,3T 44,8K none front1/tmp 466G 17,3T 466G none
zfs get all baie1/users/phys NAME PROPERTY VALUE SOURCE baie1/users/phys type filesystem - baie1/users/phys creation dim. oct. 27 10:47 2013 - baie1/users/phys used 14,4T - baie1/users/phys available 5,58T - baie1/users/phys referenced 11,8T - baie1/users/phys compressratio 1.00x - baie1/users/phys mounted yes - baie1/users/phys quota 20T local baie1/users/phys reservation none default baie1/users/phys recordsize 128K default baie1/users/phys mountpoint /users/phys local baie1/users/phys sharenfs off default baie1/users/phys checksum on default baie1/users/phys compression off default baie1/users/phys atime off inherited from baie1 baie1/users/phys devices on default baie1/users/phys exec on default baie1/users/phys setuid on default baie1/users/phys readonly off default baie1/users/phys zoned off default baie1/users/phys snapdir hidden default baie1/users/phys aclinherit restricted default baie1/users/phys canmount on default baie1/users/phys xattr on default baie1/users/phys copies 1 default baie1/users/phys version 5 - baie1/users/phys utf8only off - baie1/users/phys normalization none - baie1/users/phys casesensitivity sensitive - baie1/users/phys vscan off default baie1/users/phys nbmand off default baie1/users/phys sharesmb off default baie1/users/phys refquota none default baie1/users/phys refreservation none default baie1/users/phys primarycache all default baie1/users/phys secondarycache all default baie1/users/phys usedbysnapshots 2,62T - baie1/users/phys usedbydataset 11,8T - baie1/users/phys usedbychildren 0 - baie1/users/phys usedbyrefreservation 0 - baie1/users/phys logbias latency default baie1/users/phys dedup off default baie1/users/phys mlslabel none default baie1/users/phys sync standard default baie1/users/phys refcompressratio 1.00x - baie1/users/phys written 36,0G - baie1/users/phys snapdev hidden default
These are production servers, I can not play with debug, but if you need any additionnal data, please ask, I'll do what I can.
Regards, Loïs