openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.45k stars 1.73k forks source link

ZFS data corruption #3990

Closed gkkovacs closed 5 years ago

gkkovacs commented 8 years ago

I have installed Proxmox 4 (zfs 0.6.5) on a server using ZFS RAID10 in the installer. The disks are brand new (4x2GB, attached to the Intel motherboard SATA connectors), and there are no SMART errors / reallocated sectors on them. I have run a memtest for 30 minutes, everything seems fine hardware-wise.

zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool             92.8G  3.42T    96K  /rpool
rpool/ROOT        59.8G  3.42T    96K  /rpool/ROOT
rpool/ROOT/pve-1  59.8G  3.42T  59.8G  /
rpool/swap        33.0G  3.45T   144K  -

After restoring a few Vms (a hundred or so gigabytes), the system reported read errors in some files. Scrubbing the pool shows permanent read errors in the recently restored guest files:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h4m with 1 errors on Thu Nov  5 21:30:02 2015
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     1
      mirror-0  ONLINE       0     0     2
        sdc2    ONLINE       0     0     2
        sdf2    ONLINE       0     0     2
      mirror-1  ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        //var/lib/vz/images/501/vm-501-disk-1.qcow2

If I delete the VMs and scrub the pool again, the errors are gone. If I restore new VMs, the errors are back. Anybody have any idea what could be happening here?

zdb -mcv rpool
Traversing all blocks to verify checksums and verify nothing leaked ...

loading space map for vdev 1 of 2, metaslab 30 of 116 ...
50.1G completed ( 143MB/s) estimated time remaining: 0hr 01min 09sec        zdb_blkptr_cb: Got error 52 reading <50, 61726, 0, 514eb>  -- skipping
59.8G completed ( 145MB/s) estimated time remaining: 0hr 00min 00sec
Error counts:

    errno  count
       52  1

    No leaks (block sum matches space maps exactly)

    bp count:          928688
    ganged count:           0
    bp logical:    115011845632      avg: 123843
    bp physical:   62866980352      avg:  67694     compression:   1.83
    bp allocated:  64258899968      avg:  69193     compression:   1.79
    bp deduped:             0    ref>1:      0   deduplication:   1.00
    SPA allocated: 64258899968     used:  1.61%

    additional, non-pointer bps of type 0:       4844
    Dittoed blocks on same vdev: 297
kernelOfTruth commented 8 years ago

What's new in Proxmox VE 4.0

  • Debian Jessie 8.2 and 4.2 Linux kernel
  • Linux Containers (LXC)
kernelOfTruth commented 8 years ago

any mcelog or edac-utils errors ? dmesg errors for ECC RAM ?

are the SATA-cables fine ? (your SMART tests indicate: yes)

any known NCQ and/or firmware issues with the drives ? what drives ?

hardware (mainboard) info ? kernel info ?

update to 0.6.5.2 or 0.6.5.3 available ?

gkkovacs commented 8 years ago

I have installed mcelog, and repeated the restore of a couple of VMs until the errors surface again. /var/log/mcelog is empty.

dmesg has nothing suspicious, here is /var/log/messages from last boot: http://pastebin.com/7PkNUnxr

SATA cables should be fine, I have initiated even SMART self diagnostics of the drives, nothing pops up. The drives are brand new 2GB Toshiba DT01ACA200 models. Motherboard is ASUS P8H67, 32GB (non-ECC) DDR3-1333 RAM, Core i7-2600 CPU. There is also a 3TB Toshiba HDD (not part of the pool), and a Samsung SSD for ZIL / L2ARC (not used at the moment).

According to messages, ZFS: Loaded module v0.6.5.2-55_g61d75b3 Will try upgrading to a later release and retest.

tomposmiko commented 8 years ago

You use zfs 0.6.5.

0d00e812d9f3780803e390ab52a6963482f0ab1d

Is that not, what you're looking for?

gkkovacs commented 8 years ago

@kernelOfTruth I have upgraded to 0.6.5.3, rebooted the system and repeated the restores on an error-free pool.

Nov  6 13:14:28 proxmox3 kernel: [    5.358851] ZFS: Loaded module v0.6.5.3-1, ZFS pool version 5000, ZFS filesystem version 5

Unfortunately, the file corruption is happening again, mcelog is still empty.

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h3m with 1 errors on Fri Nov  6 13:38:47 2015
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     1
      mirror-0  ONLINE       0     0     2
        sdc2    ONLINE       0     0     2
        sdf2    ONLINE       0     0     2
      mirror-1  ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        //var/lib/vz/images/400/vm-400-disk-1.qcow2

Any other ideas?

tomposmiko commented 8 years ago

OK, attentive enough and thought the error is inside the VM....and in addition you use file image for the VM. Sorry about that.

Are you sure it's not a HW failure? My bet also would be on that.

gkkovacs commented 8 years ago

@kernelOfTruth @tomposmiko

After the upgrade to 0.6.5.3 proved ineffective, I have removed the trays and backplane from the server, and connected the drives directly to the mainboard with new, shorter SATA3 cables. Still, the errors come back.

What I don't understand is this: if a cable were bad, and would only introduce an error sometimes, it still would not be able to corrupt both copies of the actual data (we are running RAID10, so all data must be mirrored). So how is it possible that after copying a single large file to the array, there is suddenly a "Permanent error" in it, meaning ZFS is unable to correct it even if it has TWO COPIES of it? Logically thinking, this issue can't be caused by cables or drives, since the chance of two cables or two drives causing an error in the exact same place or time is essentially zero (hence the corrupted data would be correctable).

On the Proxmox forum people keep saying this could easily be a memory (or even CPU) error, but I kind of doubt that, since this server has been running Windows with Intel RAID for two years without any issues, and nothing has changed apart from the drives.

I'm baffled at the moment, not sure what to do next. If you (or anyone) got any more ideas about what and how to test, I certainly welcome them.

kernelOfTruth commented 8 years ago

@gkkovacs I had similar errors in the past when my Motherboard and Processor weren't fully supported by Linux (haswell and the EDAC components, ECC RAM) and I got errors almost everywhere on mirrors or additional backup media - there were even DIMM errors in dmesg

1-2 kernel releases later it went away and never came back

Are you running the HD Graphics 3000 of the GPU ?

Please configure your memory and MTRR optimally

like e.g. so via kernel append

enable_mtrr_cleanup mtrr_spare_reg_nr=1 mtrr_gran_size=64K mtrr_chunk_size=16M

so that "lose cover RAM" equals 0G

I doubt that this leads to errors but I've had weird behavior with the iGPU enabled and these errors being shown.

Still you might want to do some memory, CPU or system stress tests,

ZFS is known to stress all components in more intense ways than usual and thus expose hardware defects early on.

Googling for

zdb_blkptr_cb: Got error 52 reading -- skipping

lead to a few hits about metadata corruption and some older reports but it wasn't so clear ...

30 minutes of mem test is clearly too short, it needs at least 3 passes (memtest86+ or other) [http://forum.proxmox.com/threads/24358-BUG-ZFS-data-corruption-on-Proxmox-4?p=122583#post122583],

I've read recommendations of 24-72 hours to be on the safe side (depending on the amount of RAM of course and a sane number of passes, stressing)

Other factors:

Check: Mainboard, PSU, RAM, CPU, iommu, dedicated GPU instead of iGPU, driver issues that lead to data or metadata corruption, slub_nomerge memory_corruption_check=1, NCQ, libata driver, ...

tomposmiko commented 8 years ago

I just noticed, what you wrote:

Motherboard is ASUS P8H67, 32GB (non-ECC) DDR3-1333 RAM, Core i7-2600 CPU.

Check your memory seriously. But I personally would not use ZFS with non-ECC RAM.

Anyway I have similar errors, in a machine with poor seagate SATA disks. In that case those disks go down and need replacing from time to time.

gkkovacs commented 8 years ago

@kernelOfTruth @tomposmiko

I have done the MTRR kernel configuration, unfortunately there was no possible way to achieve that lose cover RAM is zero (CPU only has 7 registers, would need 9 or 10 for that). Nevertheless It did not help the data corruption issue.

Here is the latest /var/log/messages http://pastebin.com/VJRfF3U0

I have tested the RAM with memtest for another 9 hours (no errors ofcourse), but I have found something: if I only use a 2 DIMMS, there is no data corruption error. RAM speed (1066 or 1333), size (8 GB, 16 GB, 24 GB, 32 GB) or timings (CL9, CL10) were all tested, but don't matter, only the number of DIMMs.

4 DIMMs = data corruption, 2 DIMMS = no problem. The rate is approximately 4-6 checksum errors per 100 GB written (checksum errors are in all member drives, therefore uncorrectable). Have you seen anything like this?

Later I read about the H67 SATA port issue that plagued the early versions of this chipset, not sure if applies here, but I have ordered a new motherboard. Fingers crossed.

shoeper commented 8 years ago

Does it matter in which slots you put in the DIMMs? Maybe one of the 4 slots is malfunctional while the ram works.

kernelOfTruth commented 8 years ago

Agreed or perhaps the two pairs are incompatible ? Are they both of the same model ? Is the mainboard chipset (or type) known to have issues with the specific RAM you're using ?

gkkovacs commented 8 years ago

@kernelOfTruth @tomposmiko @shoeper

Things are getting more interesting. I have installed a replacement motherboard (Q77 this time), and lo and behold the corruption is still there! So it's not the motherboard / chipset then.

I have done some RAM LEGO again: with 1 or 2 DIMMs (doesn't matter what size or speed or slot, tried several different pairs) there is no corruption, with 4 DIMMs the data gets corrupted. So it's not the RAM then.

What's left, really? The CPU? Started to fiddle with BIOS settings, and when I disabled Turbo and EIST (Enhanced Intel Speedstep Technology), the corruption did not happen became less likely! So it's either a kernel / ZFS regression that is happening when the CPU scales up and down AND all the RAM slots are in use, or my CPU is defective. Will test with another CPU this weekend.

aikudinov commented 8 years ago

I've seen a box with 2 identical Kingston DIMMs about 5-7 years ago that caused Windows crashing to BSOD randomly(maybe once a day, sometimes more or less often), removing either one of them solved the problem. Custom built PC, but had a warranty, so they said it is some kind of incompatibility and exchanged both DIMMs for a different brand.

gkkovacs commented 8 years ago

@aikudinov

As I wrote above, I have tried several different DIMMS, with varying sizes, speeds and manufacturers. All of them produce the data corruption issue if 4 is installed, and none of them if 1 or 2.

dswartz commented 8 years ago

That had got took be done kind of chipset or bios issue?

gkkovacs notifications@github.com wrote:

@aikudinov

As I wrote above, I have tried several different DIMMS, with varying sizes, speeds and manufacturers. All of them produce the data corruption issue if 4 is installed, and none of them if 1 or 2.

— Reply to this email directly or view it on GitHub.

gkkovacs commented 8 years ago

@dswartz

Should you have read the thread first, it would have become apparent that I have since replaced the motherboard with another model (first it was H67, now it's Q77), and I also tried a number of BIOS settings. It's NOT a RAM issue, and it's not a chipset / BIOS issue. Will see if it's a CPU issue...

fling- commented 8 years ago

I'm experiencing i/o errors with 4.2 but they disappear after a reboot. No errors with 4.1

dswartz commented 8 years ago

I thought it might be a team boundary issue but you said you tried different sizes of ram, so... yeah at this point a different cpu is the only thing I can think of...

gkkovacs notifications@github.com wrote:

@dswartz

Should you have read the thread first, it would have become apparent that I have since replaced the motherboard with another model (first it was H67, now it's Q77), and I also tried a number of BIOS settings. It's NOT a RAM issue, and it's not a chipset / BIOS issue. Will see if it's a CPU issue...

— Reply to this email directly or view it on GitHub.

Stoatwblr commented 8 years ago

On 28/11/15 14:07, dswartz wrote:

Should you have read the thread first, it would have become apparent that I have since replaced the motherboard with another model (first it was H67, now it's Q77), and I also tried a number of BIOS settings. It's NOT a RAM issue, and it's not a chipset / BIOS issue. Will see if it's a CPU issue...

These are consumer-grade chipsets, cpu, etc and are more likely to have bitflip errors than server-grade ones.

Have you run full sets of memory checks? (memtest86+ and friends, multiple iterations over a few days)

Then there's the PSU. I've seen a number of issues of random data loss which were fixed by replacing this. The ability of many to cope with load spikes is surprisingly poor.

gkkovacs commented 8 years ago

@Stoatwblr @kernelOfTruth @behlendorf @dswartz

After weeks of testing, I have concluded that this is most likely a software issue: ZFS on the 4.2 kernel produce irreparable checksum errors under the following conditions:

I can't reproduce it on another, similar box with different drives, but another user reported a very similar issue: http://list.zfsonlinux.org/pipermail/zfs-discuss/2015-November/023883.html

Why it's not a faulty disk / cable: All the disks are brand new, and self-diagnosed with SMART several times (not a single error), also cables were replaced early on. Please note that checksum errors get created in the same numbers on the mirror members, so the blocks that get written out are already corrupted in memory.

Why it's not a faulty memory module: I have run 4 hours of SMP and 9 hours of single-core memtest86 on the originally installed memory. Also the errors come out with any kind of memory modules, I have tested at least 5 different pairs of DDR3 DIMMs (3 manufacturers, 4GB and 8GB sizes) in several configurations.

Why it's not a faulty motherboard / chipset / CPU / PSU: I have ordered an Intel Q77 motherboard to replace the ASUS H67 motherboard used previously. After it produced the same errors I have tried another, different CPU as well, same result. I even replaced the PSU with another one, no luck.

I have replaced every single piece of hardware in my machine apart from the drives. The only hardware that is connected to this issue for sure is the number of memory modules installed: 4 DIMMs produce the checksum errors, 2 DIMMs do not.

I am out of options, still looking for ideas on what to test. If a ZFS developer wants to look into this system, I can keep it online for a few days, otherwise I will accept defeat and reinstall it with a HW RAID card and ext4/LVM.

kernelOfTruth commented 8 years ago

@gkkovacs I'm sure the following issue should be fixed by now, right ?

http://techreport.com/news/20326/intel-finds-flaw-in-sandy-bridge-chipsets-halts-shipments

also it was related to the S-ATA bus - so it's highly unlikely that it's that issue

also it occurs on the Q77 chipset and i7-2600 ...

Yes, would be interesting to see what Nemesiz did to fix his problem

gkkovacs commented 8 years ago

@kernelOfTruth

Yes, the H67 SATA issue is fixed, not applicable to Q77. Since last time I tested the following things:

Disable C-states I have put the intel_idle.max_cstate=0 kernel option into grub, and verified with i7z that the CPU did not go below C1 at all. Unfortunately, the checksum errors still get created.

Adaptec controller insted of Intel ICH I have installed an Adaptec SAS RAID controller, configured the disks as simple volumes, and reinstalled Proxmox with ZFS RAID10 (the original setup), so I could eliminate the Intel ICH SATA controller from the mix. Unfortunately, the checksum errors are still there.

behlendorf commented 8 years ago

@gkkovacs I know I'm late to the conversation but it definitely looks like you've eliminated almost everything except software. Have you tried using older Linux kernels and older versions of ZoL to determine when and where this issue was introduced?

gkkovacs commented 8 years ago

@behlendorf

Regarding kernels: I only used the kernel (4.2) that comes with Proxmox 4, because that's the platform our servers are based on. This particular server has since been reinstalled with LVM/ext4, so I can't test it with ZFS anymore.

However we have another, very similar server that's going to be reinstalled, and since Proxmox 3.4 also supports ZFS with it's RHEL6-based 2.6.32 kernel, I can try to reproduce the problem on it under both kernels. Will report back in a few days.

kernelOfTruth commented 8 years ago

@gkkovacs the only way to rule out an hardware error (close to 100%, not certain if Btrfs would stress all components in a similar strenuous way) would be to have another checksumming filesystem in a similar configuration which not only does checksums on metadata but also data (e.g. Btrfs)

behlendorf commented 8 years ago

@gkkovacs please let us know what you determine because if there is a software bug in ZFS or in the Linux 4.2 kernel we definitely want to know about it. And the best way to determine that is to roll back to an older Linux kernel and/or version of ZFS. Finally, and I'm sure you're aware of this, but if this is a 4.2 kernel issue then you may end up having a similar problem with LVM/ext4 and just no way to detect it.

gkkovacs commented 8 years ago

@kernelOfTruth Unfortunately the Proxmox installer does not support Btrfs, and I don't really have the time or the motivation to test it beyond that, since Btrfs has many other issues that exclude it from our use-case.

@behlendorf I have tested the LVM/ext4 setup extensively by copying several hundred gigabytes to the filesystem (just like I did with ZFS), and compared checksums of the files with checksums computed on the source. Not a single checksum difference was detected, while with ZFS there were already dozens on the same size of data.

priyadarshan commented 8 years ago

I am experiencing similar symptoms on a HP z620 workstation, 16GB ECC ram.

We have a 12TB pool, created on FreeBSD 10.2. I scrubbed it, with no issues. As we need to use Linux, I did the following:

  1. I have installed Ubuntu 15.10, with latest stable ZOL from ppa, kernel 4.2, on a second boot drive
  2. Imported the pool
  3. Move some data to it
  4. Scrubbed it again. zpool status reports one drive as unrecoverable. Also, the boot filesystem (ext4) suddendly becomes read-only, shutting off ssh connection. Even sudo from local terminal becomes impossible.
  5. But, after rebooting in single-user mode, doing a fsck reports no issues at all.
  6. I then rebooted, this time from FreeBSD, imported pool again. zpool status reported resilvering process (a few hundred Kbs)
  7. Then, I scrubbed again, this time zpool status reports no defects.
    sudo zpool status

      pool: tank
     state: ONLINE
      scan: none requested
    config:

        NAME                                          STATE     READ WRITE CKSUM
        tank                                          ONLINE       0     0     0
          mirror-0                                    ONLINE       0     0     0
            ata-WDC_WD60EFRX-68MYMN1_WD-WX11D557YNXY  ONLINE       0     0     0
            ata-WDC_WD60EFRX-68L0BN1_WD-WX41D95PASRY  ONLINE       0     0     0
          mirror-1                                    ONLINE       0     0     0
            ata-WDC_WD60EFRX-68L0BN1_WD-WX41D95PAHV2  ONLINE       0     0     0
            ata-WDC_WD60EFRX-68MYMN1_WD-WX41D94RNKXL  ONLINE       0     0     0

    errors: No known data errors

I repetead this three times (from 1 to 7), with exact same results.

For now, we can't use Linux, since ZFS access is a must for us. We need to stay on FreeBSD for a little while longer, but I wanted to report it here for others.

gkkovacs commented 8 years ago

@behlendorf @kernelOfTruth @priyadarshan

To test this issue extensively on different ZFS and kernel versions, we pulled this server from production once again, and thanks to the help of Dietmar Maurer from Proxmox we were able to test the following configurations on the same hardware:

After a clean install of Proxmox on four disk ZFS RAID10, we restored a 126GB OpenVZ container backup from NFS, then scrubbed the pool. Looks like the checksum error issue affects all above kernel and ZFS versions, so the problem is most likely in ZFS.

Hardware is same as before: Q77 motherboard, Core i7-2600 CPU, 4x 8GB RAM, Adaptec 6805E controller used in JBOD/simple volume mode, 4x 2TB Toshiba HDD.

Linux proxmox 2.6.32-39-pve #1 SMP Fri May 8 11:27:35 CEST 2015 x86_64 GNU/Linux kernel: ZFS: Loaded module v0.6.4.1-1, ZFS pool version 5000, ZFS filesystem version 5

     NAME                                             STATE     READ WRITE CKSUM
     rpool                                            ONLINE       0     0    35
       mirror-0                                       ONLINE       0     0    34
         scsi-SAdaptec_Morphed_JBOD_00FABE6527-part2  ONLINE       0     0    42
         scsi-SAdaptec_Morphed_JBOD_01E1CE6527-part2  ONLINE       0     0    44
       mirror-1                                       ONLINE       0     0    36
         scsi-SAdaptec_Morphed_JBOD_025EDA6527        ONLINE       0     0    48
         scsi-SAdaptec_Morphed_JBOD_0347E66527        ONLINE       0     0    45

Linux proxmox 2.6.32-44-pve #1 SMP Sun Jan 17 15:59:36 CET 2016 x86_64 GNU/Linux kernel: ZFS: Loaded module v0.6.5.4-1, ZFS pool version 5000, ZFS filesystem version 5

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0    30
      mirror-0  ONLINE       0     0    16
        sda2    ONLINE       0     0    27
        sdb2    ONLINE       0     0    31
      mirror-1  ONLINE       0     0    44
        sdc     ONLINE       0     0    53
        sdd     ONLINE       0     0    54

Linux proxmox 4.2.6-1-pve #1 SMP Thu Jan 21 09:34:06 CET 2016 x86_64 GNU/Linux kernel: ZFS: Loaded module v0.6.5.4-1, ZFS pool version 5000, ZFS filesystem version 5

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0    11
      mirror-0  ONLINE       0     0     8
        sda2    ONLINE       0     0    11
        sdb2    ONLINE       0     0    11
      mirror-1  ONLINE       0     0    14
        sdc     ONLINE       0     0    18
        sdd     ONLINE       0     0    19
kernelOfTruth commented 8 years ago

@gkkovacs it could just be statistical noise & variation but it's interesting that the newest kernel has less checksum errors

also in total the checksum errors appear to have increased significantly compared to your initial report

Thanks for the extensive testing !

Some points:

:balloon: does the possibility exist to connect one or two drives directly to the motherboard leaving out the Adaptec 6805E in the equation ? ( already has ) :balloon: is there a newer firmware available for that controller ? :balloon: are we talking about the same system all the time ? (since the beginning the checksum error number appeared to significantly have increased - which could indicate hardware aging) :balloon: you mentioned that the harddrives were connected to the motherboard directly (https://github.com/zfsonlinux/zfs/issues/3990#issue-115404878) - still exhibiting this behavior :balloon: changing the motherboard and processor didn't change showing the errors

:balloon: there exist similar (silent data corruption) issues with the same controller - but then turning out to be kernel, VM related data corruption: https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/eHadbC9QR6c , http://list.zfsonlinux.org/pipermail/zfs-discuss/2013-September/011175.html

:question: some more "exotic" question: are the server located near a nuclear or coal power plant somewhere ? ;) :question: attempt to lesson memory timings in BIOS for RAM to be less strict, in dual channel (4 ram modules), the stress is the highest and sometimes tends to cause issues :question: since the IMC (integrated memory controller) is on the cpu since sandy bridge & ivy bridge; could recent microcode updates for the processor help ?

:balloon: I guess testing one harddrive directly connected to the sata-connector of the motherboard would again show checksum errors :balloon: btrfs probably also would show checksum errors

:exclamation: looks like RAM or flaky hardware (RAM timings, memory controller), etc. to me:

I have done some RAM LEGO again: with 1 or 2 DIMMs (doesn't matter what size or speed or slot, tried several different pairs) there is no corruption, with 4 DIMMs the data gets corrupted. So it's not the RAM then.

:exclamation: suggestion: either try different brand of memory, memory kits that are said to work together (if not already), lessen the memory timings, microcode update, etc.

:arrow_backward: ZFS appears to detect a hardware or otherwise caused issue that destroys data and might go unnoticed without further checksumming

(the possibility, of course, exists that it's something else but I wouldn't trust that RAM stack (4 DIMMs) in dual channel in combination with the other hardware)

proceed at own peril

@priyadarshan:

thanks for the report !

Your test-case and usage pattern might expose a hidden data corruption issue in the kernel, or other component: an interesting case was reported in https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/eHadbC9QR6c

Using only a singleton distribution and kernel release is limiting diagnostical conclusiveness that ZFS (or here: zfsonlinux) is "not ready" for production, etc.

Hopefully you'll have more luck on your next attempt to test it on Linux

gkkovacs commented 8 years ago

@kernelOfTruth @behlendorf

Number of checksum errors vary by the amount of restored data and number of files it seems, so a 100GB file produces more errors than a 10GB file, but hundreds of files produce even more errors. In the past I tested with restores of KVM machines (single large qcow2 file), but yesterday I tested all 3 setups with restoring the same 126GB OpenVZ backup, which effectively restores a filesystem to a folder on ZFS, hence the much larger number of errors.

So to reiterate, this is not a hardware issue, this is not a kernel issue, this is most likely a ZFS issue that has come up for other people as well, (like @priyadarshan). Instead of blaming hardware, I would welcome more ideas of what to test for, as we will need to put this server back in production pretty soon.

kernelOfTruth commented 8 years ago

@gkkovacs I'm currently out of other ideas, but you tried current master whether that also shows up errors ?

alternatively: you tried building the same modules on your own (thinking about compiler and cflags related issues here)

edit:

this

RAM and flaky hardware was discussed and tested for before, I have tried at least three different brands of RAM modules, with different sizes and speeds, in MANY configurations (4x4GB, 4x8GB, 2x4GB + 2x8GB, etc.) and all exhibited the same issue when FOUR modules were installed

and this

haven't tested with btrfs, but have tested with ext4 (copied 1TB+, about 10x as much data as with ZFS testing, and checksummed both the source and the destination, no errors what so ever)

don't match up - ZFS indicates that there's a memory (hardware) issue when all memory banks are occupied,

but ext4 doesn't (checksums before and after)

I wonder why that is

dweeezil commented 8 years ago

Poking my nose into this issue with a bit of trepidation :) After reading the complete set of issue comments, the one statement from @gkkovacs that sticks out to me is "restored from NFS". Is the ZFS system to which they're being restored the NFS client or is it the server (are the files being pulled onto the server or are they being pushed)? It might be interesting to try using SSH for the copy or something else to eliminate NFS completely.

gkkovacs commented 8 years ago

@dweeezil

During testing, we tried moving data to the ZFS pool from NFS (restoring virtual machines and containers via Proxmox web interface), file copy with MC from a single disk, and copy through SSH from another server as well.

@kernelOfTruth @behlendorf

I decided to take the plunge and test with btrfs as well. I have installed Debian Jessie on an SSD, and created an identical 4 disk RAID10 array with btrfs (same server, same disks). Copied 124 GB of VM images to it through SSH, then scrubbed the filesystem:

 # uname -a
 Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04) x86_64 GNU/Linux

 # mkfs.btrfs -f -m raid10 -d raid10 /dev/sda /dev/sdb /dev/sdc /dev/sdd

 # btrfs scrub status /mnt/bt
 scrub status for c38ef7d3-d42a-4c2d-9907-bd2879a7dbe0
         scrub started at Mon Jan 25 08:58:51 2016 and finished after 340 seconds
         total bytes scrubbed: 244.34GiB with 6 errors
         error details: csum=6
         corrected errors: 2, uncorrectable errors: 4, unverified errors:

So it looks like the checksum errors get in fact created on this very system, and for some reason they happen on ZFS much more often than btrfs, and rarely (if ever) on ext4. But the even more interesting questions are:

I am out of ideas. I can try to test with different disks...

richardelling commented 8 years ago

FYI, btrfs has a weaker checksum, crc32c, than any checksum available in ZFS. The weaker checksum is faster, but has smaller hamming distance, so one could argue it is best when the blocksize is 64k or less.

Drives can fail such that one head or bank can have more difficulty reading and correcting data than the others. These failure modes are particularly annoying because the file systems are largely unaware of the physical to logical mapping, even if it can be determined from the OS.

You can also try to look at the drive error counters, using smartctl or sg3_utils

Finally, before replacing the drives, you can try ddrescue to /dev/null and see if it is able to read everything without errors.

gkkovacs commented 8 years ago

@richardelling You should have read this page thoroughly first... your advice (while sound) has nothing to do with the issue that we are facing. We are way beyond smartctl here.

TLDR: We have 4 brand new drives in a server, all showing checksum errors after tens or hundreds of GB written, but only when 4 memory modules are installed. If there are 2 DIMMS only there is no error. Motherboard, CPU, PSU, cables were all replaced, at least 4 different speed/size/manufacturer RAM modules were tried in many variations, both Intel ICH and Adaptec controllers were used, still the checksum errors are there when a lot of data is being written. Tried different kernels (from 2.6.32, 3.10 and 4.2), ZFS versions, btrfs, all vulnerable.

richardelling commented 8 years ago

I was posting for posterity.

Clearly, your hardware is broken or out of spec.

gkkovacs commented 8 years ago

@richardelling If it's so clear to you, would you mind telling me which part of my hardware is broken? Or alternatively, could you please not hijack this thread with irrelevant comments? This is an issue that proves very hard to diagnose, and we could really use some new ideas.

kernelOfTruth commented 8 years ago

@gkkovacs can you post some more specific data ?

specific motherboard model (revision), specific cpu model (cpuid, /proc/cpuinfo), harddrives, RAM, etc.

Concerning the RAM and motherboard: are the RAM modules on the compatibility list of the mainboard manufacturer ?

Okay, you lessened the timings on the RAM, but did you also try out to run the RAM actually slower than specs ?

I've read (anecdotally) in the past about cases where RAM run in a dual channel setup would cause data corruption issues ( not so sure if that was on Linux or Windows ) so running it not at specs but slower would allow it to work again,

not wanting to play the blame game but if all possible combinations don't work - a different motherboard brand (or processor, since we're talking about sandy bridge here all the time) could make a difference.

If I recall correctly the 2 (or few more) examples I read about, the solutions were as follows:

:balloon: running the RAM below spec (timings or frequency)

:balloon: switching to a different motherboard brand

(in some seldom cases a BIOS update could help, well, since the BIOS is a black box and sometimes BUGs are squashed which aren't officially publicized)

@richardelling interesting bit about the crc32c checksum !

Important is also to take btrfs' and ZFS' very nature into account (COW, additional redundancy - in case of ZFS 2-3 times metadata - thus more writes & reads, or similar; don't know any specifics about the internals) - they tend to put significantly more stress on the underlying harddrives used - which on the one side lets the components age somewhat faster but also unearthes dormant and otherwise unknown issues to the data's integrity beforehand e.g. S.M.A.R.T. or others show it

priyadarshan commented 8 years ago

@gkkovacs if you are in a position to do it, you could try to install FreeBSD 10.2 (or PC-BSD, which has an easier installer) on a different/temporary disk, import your zpool, and see if that happens again.

As said earlier on, in our case, a 12TB zpool, we were observing the exact same symptoms as you, but switching to FreeBSD fixed them.

richardelling commented 8 years ago

Also, sometimes people set overclocking and forget that it impacts the memory performance as you add DIMMs.

gkkovacs commented 8 years ago

@priyadarshan Can you post your hardware? CPU, motherboard, RAM and HDD specs please! Will see if I have the time to check FreeBSD.

@richardelling @kernelOfTruth When I first encountered this issue, this server had the following specs: ASUS P8H67-V motherboard, rev. 1.04, Intel Core i7-2600 CPU, 4x 8GB DDR3-1333 RAM.

During the investigation of this issue I have upgraded to an Intel DQ77-MK motherboard, tried out an i5-2500K CPU, and installed an Adaptec 6805E SAS RAID controller and a brand new PSU. Both motherboards and the Adaptec controller had their BIOS flashed to latest, also CPU microcode was updated. None of these helped.

Currently there are 4x 8GB DDR3-1333 modules installed, 2 Kingston and 2 Corsair DIMMs IIRC. I had tried 4x 4GB 1333 configuration (Kingmax), 2x 4GB 1600 (Kingston IIRC), 4x 8GB 1600 (Kingston), and many combinations of these. 1 or 2 DIMMs never showed checksum errors, 4 always.

Disks are brand new 2TB Toshiba DT01ACA200 drives. Here is complete lshw output: http://pastebin.com/6D3AnUFS

I am not sure what you mean by running RAM slower: I had certainly tried to run DDR3-1600 and DDR3-1333 RAM modules at 1066 MHz and 800 MHz, also with relaxed timings (CAS latency). None of these had any effect on the errors, only the number of DIMMs installed.

kernelOfTruth commented 8 years ago

@gkkovacs

I am not sure what you mean by running RAM slower: I had certainly tried to run DDR3-1600 and DDR3-1333 RAM modules at 1066 MHz and 800 MHz, also with relaxed timings (CAS latency). None of these had any effect on the errors, only the number of DIMMs installed.

yeah, that :)

Running memory modules of two different brands can work, but don't have to.

What type of RAM is that ? (price range: low range, mid range, high end enthusiast),

Value RAM or most consumer RAM are fine but I've read about cases where they tended to cause trouble when being fully stressed, what @Stoatwblr wrote

What's left, really? The CPU? Started to fiddle with BIOS settings, and when I disabled Turbo and EIST (Enhanced Intel Speedstep Technology), the corruption did not happen became less likely! So it's either a kernel / ZFS regression that is happening when the CPU scales up and down AND all the RAM slots are in use, or my CPU is defective. Will test with another CPU this weekend.

To be honest, it looked like that,

but:

Sandy Bridge CPU (i5-2500K and i7-2600) and compatible chipset (H67 and Q77)

What the ... ? Is that a kind of design issue of the Sandy Bridge CPU series ?

Well, then I took a look at:

http://ark.intel.com/products/52209/Intel-Core-i5-2500-Processor-6M-Cache-up-to-3_70-GHz http://ark.intel.com/products/52214/Intel-Core-i7-2600K-Processor-8M-Cache-up-to-3_80-GHz

Run your kernel with

intel_iommu=on intel_pstate=disable transparent_hugepage=never slub_nomerge memory_corruption_check=1

, ...

this takes into account several factors that could mess up the system and isolates and/or disables them (on the memory_corruption_check: https://bbs.archlinux.org/viewtopic.php?pid=1473020#p1473020 )

taking NCQ firmware issues into account

libata.force=noncq you could also try

Take a look at the options of the iGPU and change the acceleration method - I've read oftentimes about the integrated GPU in the position of being able to cause trouble

https://wiki.gentoo.org/wiki/Intel https://wiki.archlinux.org/index.php/intel_graphics http://www.thinkwiki.org/wiki/Intel_HD_Graphics

There are other kernel module options (aspm, pci/iommu, etc. related) but I seemingly couldn't find a page dedicated to those right now

Otherwise, if there are still problems with MTRR memory assignment,

plug in a dedicated low performance GPU - that might help (it did for me)

7Z0t99 commented 8 years ago

Hi guys, I found it surprising to see mtrr_cleanup at all in the first /var/log/messages, as the default in the Kconfig files is to disable it. I mean goal of the feature seems to be to fix the MTRR on broken BIOSes, but if your BIOS isn't broken, why try to fix it? Some distros seem to enable it though, including Promox. Anyway, I would be very interested to see if the errors go away when you add disable_mtrr_cleanup to your kernel command line.

kernelOfTruth commented 8 years ago

+1

gkkovacs commented 8 years ago

@kernelOfTruth Thanks for your suggestions! I have rebooted with your kernel options, unfortunately there is no change, checksum errors everywhere. IIRC I have already tested with a PCI-E videocard before (with integrated graphics turned off), but it did not help then, might try it again though. I reckon disabling NCQ is unnecessary on an Adaptec controller, since it's not controlled by libata.

I'm not sure what you mean by Sandy Bridge design issue, but the thought of trying an Ivy Bridge CPU crossed my mind.

Will plug in different disks tonight, test and report back after.

lnxbil commented 8 years ago

Hi, I followed the discussion and i'm also very interested in the problem. I encountered similar problems, but did not dig to deep, I ended up buying an used HP DL360 and MSA60 and since then, a lot of problems related to ZFS are gone. These machines are not very expensive and cost less than an recent Sandy Bridge. Maybe you can try that too.

gkkovacs commented 8 years ago

@kernelOfTruth @behlendorf I have installed four different disks in the server to put an end to the (weak) argument that the disks may be defective. Needless to say that the checksum errors are still there, so basically I have replaced every single piece of hardware in my system and it still creates checksum errors.

Let's recap: motherboard (to different chipset and brand), RAM (multiple speeds and sizes), CPU (to another Sandy Bridge), PSU, cables, disks were all replaced. Dedicated GPU was tested as well.

Conclusion so far: Sandy Bridge desktop CPU with 4 memory DIMMs installed, running Linux 2.6.32 to 4.2 has an IO stability issue that causes data corruption detected by checksumming filesystems. When only 2 DIMMS are installed, there is no data corruption.

There surely must be another hidden variable or two, since the internet is not full with reports of similar issues. Interestingly, when we tested ext4 over Adaptec RAID there was not a single error in over a terabyte written, while ZFS/Btrfs produce several errors per a hundred gigabyte, so they probably stress hardware more. I opened this issue close to 3 months ago, and I still haven't the faintest idea of what could be happening here.

7Z0t99 commented 8 years ago

It might make sense to contact some linux kernel or the btrfs mailing list, especially since you can reproduce it using btfs. I guess kernel developers are always a bit skeptic about out of tree modules like ZFS