openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.64k stars 1.75k forks source link

Kernel General Protection Fault #2471

Closed BobBurrow closed 10 years ago

BobBurrow commented 10 years ago

Hi,

I'm runing ZoL (munix9 0.6.3+git.1404249623 from http://software.opensuse.org/package/zfs) on an OpenSuSE 12.3 SMP (i7-2600K, 16 GB memory, kernel 3.7.10-1.36-default) machine using raidz (5 x 2 TB) with about 3 TB of data. I'm getting frequent GPFs.

The information from the rpm package: Name : zfs Version : 0.6.3+git.1404249623 Release : 2.1 Architecture: x86_64 Install Date: Mon 07 Jul 2014 02:55:10 PM BRT Group : System/Filesystems Size : 845791 License : CDDL-1.0 Signature : RSA/SHA1, Wed 02 Jul 2014 01:27:02 PM BRT, Key ID 8e1d7d30529999cc Source RPM : zfs-0.6.3+git.1404249623-2.1.src.rpm Build Date : Wed 02 Jul 2014 01:26:37 PM BRT Build Host : cloud101 Relocations : (not relocatable) Vendor : obs://build.opensuse.org/home:munix9:zfs URL : http://zfsonlinux.org/ Summary : Commands to control the kernel modules

The kernel trace: [20196.311120] general protection fault: 0000 [#1] SMP [20196.332122] Modules linked in: fuse bnep bluetooth rfkill af_packet sep3_15(O) pax(O) w83627ehf hwmon_vid quota_v2 quota_tree snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_pcm snd_seq snd_timer snd_seq_device snd_mixer_oss acpi_cpufreq mperf coretemp snd kvm_intel kvm ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 i915 xts iTCO_wdt drm_kms_helper drm i2c_algo_bit gf128mul cypress_m8 joydev hid_logitech usbserial gpio_ich soundcore iTCO_vendor_support video firewire_ohci sg snd_page_alloc sata_sil24 firewire_core aic7xxx crc_itu_t scsi_transport_spi lpc_ich mei mfd_core button i2c_i801 pcspkr e1000e serio_raw shpchp pci_hotplug microcode zfs(PO) zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) autofs4 hid_generic hid_logitech_dj btrfs zlib_deflate libcrc32c usbhid dm_mod linear raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 raid10 crc32c_intel xhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh fan processor ata_piix ata_generic thermal thermal_sys [20196.461045] CPU 0 [20196.461185] Pid: 8273, comm: rsync Tainted: P O 3.7.10-1.36-default #1 /DH67GD [20196.516542] RIP: 0010:[] [] spl_kmem_cache_alloc+0x43/0x840 [spl] [20196.544519] RSP: 0018:ffff88016bc59638 EFLAGS: 00010292 [20196.572368] RAX: 00000000bc51ef78 RBX: 906604ebe10ab941 RCX: ffffffffa040b120 [20196.600770] RDX: 00000001f39be000 RSI: 0000000000000230 RDI: 906604ebe10ab941 [20196.628937] RBP: 0000000000000230 R08: 7d9ab06276beabce R09: ffff88027b0041c0 [20196.657079] R10: ffff88016bc596c0 R11: 906604ebe10ac9a9 R12: ffff8803e2bff270 [20196.685330] R13: ffff8803e2bff2a0 R14: ffffc9006ed5d180 R15: 0000000000000000 [20196.713739] FS: 00002aac522ad680(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000 [20196.742562] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [20196.770960] CR2: 00002b645d8c9000 CR3: 00000001b7930000 CR4: 00000000000407f0 [20196.799886] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [20196.828291] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [20196.856923] Process rsync (pid: 8273, threadinfo ffff88016bc58000, task ffff88027b0041c0) [20196.885847] Stack: [20196.914190] ffff88041fdf6c20 0000000000000000 0000000000000202 ffffffffa033893c [20196.943586] 0000000000000000 0000000000df6c28 0000000000000246 ffffffffa0338ecb [20196.972379] ffff8803f02bd2c8 ffffffffa0139165 ffff8802ed4115d8 ffff8801d5d2db80 [20197.001510] Call Trace: [20197.030529] [] arc_get_data_buf.isra.21+0x365/0x4c0 [zfs] [20197.060109] [] arc_buf_alloc+0xe4/0x120 [zfs] [20197.088948] [] arc_read+0x218/0xa10 [zfs] [20197.117489] [] dbuf_read+0x28b/0x830 [zfs] [20197.145484] [] dmu_spill_hold_by_dnode+0x4e/0x170 [zfs] [20197.173551] [] dmu_spill_hold_existing+0x15b/0x170 [zfs] [20197.201752] [] sa_get_spill.part.10+0x20/0x80 [zfs] [20197.229861] [] sa_attr_op+0x30f/0x420 [zfs] [20197.257985] [] sa_lookup+0x40/0x60 [zfs] [20197.286175] [] zfs_dirent_lock+0x322/0x640 [zfs] [20197.314674] [] zfs_get_xattrdir+0x68/0x160 [zfs] [20197.343166] [] zfs_lookup+0x1dd/0x350 [zfs] [20197.371343] [] __zpl_xattr_get+0x85/0x220 [zfs] [20197.399524] [] zpl_xattr_get+0x66/0x140 [zfs] [20197.426893] [] zpl_get_acl+0xe6/0x230 [zfs] [20197.453484] [] zpl_xattr_acl_get+0x48/0x100 [zfs] [20197.479967] [] generic_getxattr+0x50/0x80 [20197.506105] [] vfs_getxattr+0xa9/0xd0 [20197.532288] [] getxattr+0xb2/0x1f0 [20197.558497] [] sys_getxattr+0x4d/0x80 [20197.584746] [] system_call_fastpath+0x1a/0x1f [20197.610974] [<00002aac515ef0b9>] 0x2aac515ef0b8 [20197.636870] Code: ac 24 90 00 00 00 48 89 fb 4c 89 a4 24 98 00 00 00 4c 89 ac 24 a0 00 00 00 89 f5 4c 89 b4 24 a8 00 00 00 4c 89 bc 24 b0 00 00 00 ff 87 68 10 00 00 f6 87 48 10 00 00 80 0f 84 91 00 00 00 4c [20197.664984] RIP [] spl_kmem_cache_alloc+0x43/0x840 [spl] [20197.692580] RSP [20197.923717] ---[ end trace 110658111cda017b ]---

I was using zfs several months with a raid10 configuration without any problems. Recently I switched to raidz & added xattr=sa and acltype=posixacl.

The zfs configuration: pool: Data state: ONLINE scan: scrub in progress since Mon Jul 7 14:21:03 2014 744G scanned out of 1.58T at 10.8M/s, 23h3m to go 0 repaired, 45.94% done config:

    NAME                                          STATE     READ WRITE CKSUM
    Data                                          ONLINE       0     0     0
      raidz1-0                                    ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WMC4M2517589  ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M0722959  ONLINE       0     0     0
        ata-WDC_WD20EFRX-68AX9N0_WD-WCC300084811  ONLINE       0     0     0
        ata-WDC_WD20EFRX-68AX9N0_WD-WCC300078775  ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M3147849  ONLINE       0     0     0

errors: No known data errors

zfs get all Data:

NAME PROPERTY VALUE SOURCE Data type filesystem - Data creation Fri Jul 4 22:39 2014 - Data used 3.25T - Data available 5.82T - Data referenced 2.00T - Data compressratio 1.50x - Data mounted yes - Data quota none default Data reservation none default Data recordsize 128K default Data mountpoint /Data default Data sharenfs off default Data checksum on default Data compression lz4 local Data atime on default Data devices on default Data exec on default Data setuid on default Data readonly off default Data zoned off default Data snapdir hidden default Data aclinherit restricted default Data canmount on default Data xattr sa local Data copies 1 default Data version 5 - Data utf8only off - Data normalization none - Data casesensitivity sensitive - Data vscan off default Data nbmand off default Data sharesmb off local Data refquota none default Data refreservation none default Data primarycache all default Data secondarycache all default Data usedbysnapshots 0 - Data usedbydataset 2.00T - Data usedbychildren 1.25T - Data usedbyrefreservation 0 - Data logbias latency default Data dedup on local Data mlslabel none default Data sync standard default Data refcompressratio 1.58x - Data written 2.00T - Data logicalused 4.65T - Data logicalreferenced 3.02T - Data snapdev hidden default Data acltype posixacl local Data context none default Data fscontext none default Data defcontext none default Data rootcontext none default Data relatime off default

zfs list: NAME USED AVAIL REFER MOUNTPOINT Data 3.25T 5.82T 2.00T /Data Data/Frames 602G 5.82T 601G /Frames Data/home 668G 5.82T 668G /home-zfs

A google search came up with old issues related to xen, but I'm using just a stock kernel.

Any solutions?

BobBurrow commented 10 years ago

Just to update, after a kernel & zfs package update, and set xattr=off for the zfs pool, I still got a Kernel General Protection Fault. Kernel: Linux ewald.base.ufsm.br 3.7.10-1.36-default #1 SMP Thu Jun 12 10:14:12 UTC 2014 (fcb6f8f) x86_64 x86_64 x86_64 GNU/Linux Source Timestamp: 2014-05-08 02:09:34 +0200 GIT Revision: 5978d00ca60519ecd208c475160e6b57b244419d GIT Branch: openSUSE-12.3 Distribution: openSUSE 12.3 Name : kernel-default Version : 3.7.10 Release : 1.36.1 Architecture: x86_64 Install Date: Mon 07 Jul 2014 01:32:15 PM BRT Group : System/Kernel Size : 160254303 License : GPL-2.0 Signature : RSA/SHA256, Tue 01 Jul 2014 06:13:53 AM BRT, Key ID b88b2fd43dbdc284 Source RPM : kernel-default-3.7.10-1.36.1.nosrc.rpm Build Date : Fri 13 Jun 2014 12:09:57 PM BRT Build Host : build33 Relocations : (not relocatable) Packager : http://bugs.opensuse.org Vendor : openSUSE URL : http://www.kernel.org/ Summary : The Standard Kernel Description : The standard kernel for both uniprocessor and multiprocessor systems.

ZFS: Name : zfs Version : 0.6.3+git.1404335069 Release : 3.1 Architecture: x86_64 Install Date: Thu 10 Jul 2014 08:25:37 PM BRT Group : System/Filesystems Size : 845791 License : CDDL-1.0 Signature : RSA/SHA1, Wed 09 Jul 2014 11:21:14 AM BRT, Key ID 8e1d7d30529999cc Source RPM : zfs-0.6.3+git.1404335069-3.1.src.rpm Build Date : Wed 09 Jul 2014 11:20:48 AM BRT Build Host : build22 Relocations : (not relocatable) Vendor : obs://build.opensuse.org/home:munix9:zfs URL : http://zfsonlinux.org/ Summary : Commands to control the kernel modules Description : This package contains support utilities for the zfs file system. Distribution: home:munix9:zfs / openSUSE_12.3

[ 747.143234] general protection fault: 0000 [#1] SMP [ 747.156482] Modules linked in: af_packet sep3_15(O) pax(O) w83627ehf hwmon_vid quota_v2 quota_tree snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_pcm snd_seq snd_timer snd_seq_device snd_mixer_oss acpi_cpufreq mperf coretemp kvm_intel snd i915 kvm drm_kms_helper ghash_clmulni_intel drm aesni_intel firewire_ohci cypress_m8 ablk_helper cryptd lrw soundcore aes_x86_64 aic7xxx xts i2c_algo_bit iTCO_wdt zfs(PO) video gf128mul usbserial snd_page_alloc gpio_ich iTCO_vendor_support sg scsi_transport_spi firewire_core crc_itu_t sata_sil24 button serio_raw i2c_i801 e1000e lpc_ich pcspkr mfd_core shpchp mei pci_hotplug microcode zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) autofs4 btrfs zlib_deflate libcrc32c usbhid dm_mod linear raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 raid10 crc32c_intel xhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh fan processor ata_piix ata_generic thermal thermal_sys [ 747.299985] CPU 1 [ 747.300119] Pid: 3658, comm: rsync Tainted: P O 3.7.10-1.36-default #1 /DH67GD [ 747.334111] RIP: 0010:[] [] spl_kmem_cache_alloc+0x43/0x840 [spl] [ 747.351819] RSP: 0018:ffff8804058b7638 EFLAGS: 00010292 [ 747.369518] RAX: 00000001f39be000 RBX: dededede00000000 RCX: ffffffffa0591120 [ 747.387478] RDX: ffff880159ab6900 RSI: 0000000000000230 RDI: dededede00000000 [ 747.405460] RBP: 0000000000000230 R08: 7d9ab06276beabce R09: ffff8804054b26c0 [ 747.423568] R10: ffff8804058b76c0 R11: dededede00001068 R12: ffff880159cc9660 [ 747.441564] R13: ffff880159cc9690 R14: ffffc90070878180 R15: 0000000000000000 [ 747.459514] FS: 00002aae9516f680(0000) GS:ffff88041fa40000(0000) knlGS:0000000000000000 [ 747.477294] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 747.494811] CR2: 00002aaea1c6efb8 CR3: 0000000405967000 CR4: 00000000000407e0 [ 747.512369] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 747.530042] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 747.547346] Process rsync (pid: 3658, threadinfo ffff8804058b6000, task ffff8804054b26c0) [ 747.564903] Stack: [ 747.582167] ffff88041fdf6c20 0000000000000000 0000000000000202 ffffffffa04be93c [ 747.600119] ffff88040b328200 0000000000df6c28 0000000000000246 ffffffffa04beecb [ 747.617884] ffff8803ed4aa5d8 ffffffffa0111165 ffff8801b44bfad8 ffff88015aeada18 [ 747.635680] Call Trace: [ 747.653379] [] arc_get_data_buf.isra.21+0x365/0x4c0 [zfs] [ 747.671660] [] arc_buf_alloc+0xe4/0x120 [zfs] [ 747.689960] [] arc_read+0x218/0xa10 [zfs] [ 747.708168] [] dbuf_read+0x28b/0x830 [zfs] [ 747.726297] [] dmu_spill_hold_by_dnode+0x4e/0x170 [zfs] [ 747.744582] [] dmu_spill_hold_existing+0x15b/0x170 [zfs] [ 747.762901] [] sa_get_spill.part.10+0x20/0x80 [zfs] [ 747.781249] [] sa_attr_op+0x30f/0x420 [zfs] [ 747.799549] [] sa_lookup+0x40/0x60 [zfs] [ 747.817748] [] zfs_dirent_lock+0x322/0x640 [zfs] [ 747.836025] [] zfs_get_xattrdir+0x68/0x160 [zfs] [ 747.854217] [] zfs_lookup+0x1dd/0x350 [zfs] [ 747.872420] [] __zpl_xattr_get+0x85/0x220 [zfs] [ 747.890774] [] zpl_xattr_get+0x66/0x140 [zfs] [ 747.909108] [] zpl_get_acl+0xe6/0x230 [zfs] [ 747.927490] [] zpl_xattr_acl_get+0x48/0x100 [zfs] [ 747.945933] [] generic_getxattr+0x50/0x80 [ 747.964318] [] vfs_getxattr+0xa9/0xd0 [ 747.982556] [] getxattr+0xb2/0x1f0 [ 748.000361] [] sys_getxattr+0x4d/0x80 [ 748.017776] [] system_call_fastpath+0x1a/0x1f [ 748.034519] [<00002aae944b10b9>] 0x2aae944b10b8 [ 748.050758] Code: ac 24 90 00 00 00 48 89 fb 4c 89 a4 24 98 00 00 00 4c 89 ac 24 a0 00 00 00 89 f5 4c 89 b4 24 a8 00 00 00 4c 89 bc 24 b0 00 00 00 ff 87 68 10 00 00 f6 87 48 10 00 00 80 0f 84 91 00 00 00 4c [ 748.083651] RIP [] spl_kmem_cache_alloc+0x43/0x840 [spl] [ 748.099355] RSP [ 748.188531] ---[ end trace d2a157239599df08 ]---

BobBurrow commented 10 years ago

Looking more into this problem, I came across a file that ls -l and getfacl always caused a general protection fault & might be related to Issue #2021.

stat Artigos

File: ‘Artigos’ Size: 72 Blocks: 129 IO Block: 16384 directory Device: 20h/32d Inode: 3218536 Links: 6 Access: (2770/drwxrws---) Uid: ( 1050/ rafael) Gid: ( 1050/ rafael) Access: 2014-07-07 14:02:09.751559886 -0300 Modify: 2013-05-22 08:46:09.000000000 -0300 Change: 2014-07-05 03:48:10.084292280 -0300 Birth: -

zdb -dddd Data/home 3218536

Dataset Data/home [ZPL], ID 40, cr_txg 194, 670G, 1783055 objects, rootbp DVA[0]=<0:210015d4000:2000> DVA[1]=<0:38002c8e000:2000> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=142347L/142347P fill=1783055 cksum=188f7e8bc9:82202a0d000:17a43bb9da846:30feafd39a37ac

Object  lvl   iblk   dblk  dsize  lsize   %full  type

3218536 2 16K 16K 64K 32K 100.00 ZFS directory 292 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED SPILL_BLKPTR dnode maxblkid: 1 path /lmi/Rafael/Artigos uid 1050 gid 1050 atime Mon Jul 7 14:02:09 2014 mtime Wed May 22 08:46:09 2013 ctime Sat Jul 5 03:48:10 2014 crtime Sat Jul 5 03:47:43 2014 gen 9333 mode 42770 size 72 parent 2664342 links 6 pflags 40800000044 Segmentation fault

dmesg shows: [ 1317.240178] zdb[4040]: segfault at 2ab18576a158 ip 00002ab18545d587 sp 00007fff9d98bc80 error 4 in libzpool.so.2.0.0[2ab1853b6000+121000]

It appears the ACL is corrupted for this file.

dweeezil commented 10 years ago

@BobBurrow There were a number of bugs related to variable-sized system attributes prior to 0.6.3 which have since been fixed. Any pool prior to 0.6.3. could potentially have some corrupted SAs which is what I suspect is the case with yours. The main clue in your case is "SPILL_BLKPTR". The last major bug fixed involved certain SAs which needed to be expanded to include a spill block.

Your only fix is to copy the files to use 0.6.3 and copy the files to a new filesystem (skipping the troublesome files). You don't need to craete a new pool; the problem is in the individual files' dnodes. Other classic symptoms are symlinks with bogus sizes or link counts and any other file which causes a panic when you access it.

BobBurrow commented 10 years ago

Thanks a lot for the tip, @dweeezil . I believe I was using ZoL 0.6.3 to create & populate the filesystem: Version : 0.6.3+git.1404249623. Maybe fixes were made that munix9 didn't incorporate in his rpm.

I still have all the data on the original drives. I can destroy the zfs pool and recreate it, copying the files to the new pool/filesystem, even dropping the ACLs which I can easily recreate.

behlendorf commented 10 years ago

This issue is believe to be resolved in master.