openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.45k stars 1.73k forks source link

ZFS data corruption #3990

Closed gkkovacs closed 5 years ago

gkkovacs commented 8 years ago

I have installed Proxmox 4 (zfs 0.6.5) on a server using ZFS RAID10 in the installer. The disks are brand new (4x2GB, attached to the Intel motherboard SATA connectors), and there are no SMART errors / reallocated sectors on them. I have run a memtest for 30 minutes, everything seems fine hardware-wise.

zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool             92.8G  3.42T    96K  /rpool
rpool/ROOT        59.8G  3.42T    96K  /rpool/ROOT
rpool/ROOT/pve-1  59.8G  3.42T  59.8G  /
rpool/swap        33.0G  3.45T   144K  -

After restoring a few Vms (a hundred or so gigabytes), the system reported read errors in some files. Scrubbing the pool shows permanent read errors in the recently restored guest files:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h4m with 1 errors on Thu Nov  5 21:30:02 2015
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     1
      mirror-0  ONLINE       0     0     2
        sdc2    ONLINE       0     0     2
        sdf2    ONLINE       0     0     2
      mirror-1  ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        //var/lib/vz/images/501/vm-501-disk-1.qcow2

If I delete the VMs and scrub the pool again, the errors are gone. If I restore new VMs, the errors are back. Anybody have any idea what could be happening here?

zdb -mcv rpool
Traversing all blocks to verify checksums and verify nothing leaked ...

loading space map for vdev 1 of 2, metaslab 30 of 116 ...
50.1G completed ( 143MB/s) estimated time remaining: 0hr 01min 09sec        zdb_blkptr_cb: Got error 52 reading <50, 61726, 0, 514eb>  -- skipping
59.8G completed ( 145MB/s) estimated time remaining: 0hr 00min 00sec
Error counts:

    errno  count
       52  1

    No leaks (block sum matches space maps exactly)

    bp count:          928688
    ganged count:           0
    bp logical:    115011845632      avg: 123843
    bp physical:   62866980352      avg:  67694     compression:   1.83
    bp allocated:  64258899968      avg:  69193     compression:   1.79
    bp deduped:             0    ref>1:      0   deduplication:   1.00
    SPA allocated: 64258899968     used:  1.61%

    additional, non-pointer bps of type 0:       4844
    Dittoed blocks on same vdev: 297
shoeper commented 8 years ago

What about row hammering? Could the ZFS workflow possibly lead to RAM bitflips? Maybe you could test it with https://github.com/google/rowhammer-test or some other test.

kernelOfTruth commented 8 years ago

@shoeper good idea,

thought about that too, but then discarded the idea (it's just a simply "transfer", right ?)

but let's see what that leads to

There's still the factor of NFS, that also dweezil mentioned

During testing, we tried moving data to the ZFS pool from NFS (restoring virtual machines and containers via Proxmox web interface), file copy with MC from a single disk, and copy through SSH from another server as well.

@gkkovacs so that means you already tried to copy files on the server locally and then verify if the checksums changed, correct ?

and that also lead to the issues ?

if not it could be an issue with the network (ethernet adapter) driver ...

shoeper commented 8 years ago

@kernelOfTruth

if not it could be an issue with the network (ethernet adapter) driver ...

wouldn't ZFS in this case also calculate a checksum of the wrong file and think the files is correct afterwards? How should ZFS now a checksum it never had?

gkkovacs commented 8 years ago

@kernelOfTruth I have tried copying files from external SATA drive to the ZFS and Btrfs pools, thereby completely escaping the network.

@7Z0t99 Good idea, I might write an essay about this for the Btrfs developers.

@shoeper I have tried rowhammer-test, it exited (detected a bit-flip) after 110 seconds on the first run.

 Iteration 98 (after 110.02s)
   Took 101.7 ms per address set
   Took 1.01711 sec in total for 10 address sets
   Took 23.544 nanosec per memory access (for 43200000 memory accesses)
   This gives 339786 accesses per address per 64 ms refresh period
 error at 0x7f0c84d2a808: got 0xffffffffefffffff
   Checking for bit flips took 0.104848 sec
 ** exited with status 256 (0x100)

If I understand correctly this does not mean that my RAM is defective, since rowhammer puts an extremely high stress on a single row of DRAM, about 100 thousand times a second, which would never happen during a normal workload. According to the information I came across this is a general vulnerability of DDR3, which got fixed in the DDR4 standard.

drescherjm commented 8 years ago

Does this memory test also fail with only 2 of the 4 dimms installed? You may have to run it much longer.

kernelOfTruth commented 8 years ago

@shoeper I was thinking along of issues in a broader sense:

general memory corruption due to buggy driver,

in the case of file transfers - yeah, there should be checksums & thus correct

slub_nomerge should partially account for buggy components - but in this case it didn't offer any help

@drescherjm

Currently there are 4x 8GB DDR3-1333 modules installed, 2 Kingston and 2 Corsair DIMMs IIRC. I had tried 4x 4GB 1333 configuration (Kingmax), 2x 4GB 1600 (Kingston IIRC), 4x 8GB 1600 (Kingston), and many combinations of these. 1 or 2 DIMMs never showed checksum errors, 4 always.

@gkkovacs how about 3 modules ? (e.g. 2x4 + 1x8, if you have those available)

Are 2 DIMMs and 4 DIMMs each time placed in dual channel setup ?

that's the thing I'm permanently keeping in the back of my head: trouble with dual channel memory with full utilization

heyjonathan commented 8 years ago

On Fri, Nov 6, 2015 at 6:17 AM, gkkovacs notifications@github.com wrote:

Motherboard is ASUS P8H67, 32GB (non-ECC) DDR3-1333 RAM, Core i7-2600 CPU

I admit I know almost nothing here, but want to double check. All of the testing you've done to eliminate bad RAM has involved swapping various combinations of non-ECC ram with other combinations of non-ECC ram?

What happens when you use ECC ram?

Jonathan

gkkovacs commented 8 years ago

@kernelOfTruth I think @drescherjm meant testing rowhammer with 2 DIMMs only. I can certainly try that, although failing rowhammer does not mean your RAM is defective, it simply exploits a weakness in the DDR3 design. What rowhammer does never happens in real life workloads, it's kind of a DDOS attack against your RAM. Also running it damages the RAM (overheats some row lines), so I would like to keep it to a minimum, not going to run it for long.

@heyjonathan There is no ECC support on H67/Q77 chipsets, I have only tested non-ECC DIMMs in many configurations.

kernelOfTruth commented 8 years ago

@gkkovacs you did every test in JBOD configuration ?

does the VIA controller (VIA VT6415 controller ) exhibit this behavior ? (specs: http://www.asus.com/Motherboards/P8H67V/specifications/ )

Could be what @drescherjm meant, but please nonetheless clarify on how the DIMMs got installed in relation to dual channel status,

and if applicable - test 3-DIMM configuration (not rowhammer, copying over data)

gkkovacs commented 8 years ago

@kernelOfTruth I have tested with the H67 and Q77 on-board Intel ICH controller in AHCI mode, and with the Adaptec SAS RAID controller in the following modes:

All modes exhibited the checksum errors, although in HW RAID mode there were considerably less errors per same amount of data written (10x less with Btrfs, extremely rare with ext4).

I did not test the VIA controller, nor can I, since that motherboard has been replaced.

drescherjm commented 8 years ago

Yes. I meant to test rowhammer with 2 DIMMs. At work and otherwise I have seen quite a few RAM problems over the years with all slots populated especially when using DIMMs that are higher density than the system initially supported.

7Z0t99 commented 8 years ago

I'm afraid I don't see why rowhammer is relevant here, e.g. what would a shorter or longer time until the first error tell us? I mean we know that probably any DDR3 module made on a small process geometry is susceptible to rowhammer and as a countermeasure some vendors updated their BIOSes to refresh the modules more often. Newer processors might have more and better countermeasures though.

gkkovacs commented 8 years ago

@7Z0t99 I agree

@drescherjm As I wrote above, passing or failing rowhammer is not indicative of DRAM stability or defects. It simply shows that DDR3 is vulnerable by design to a row overload. I'm not saying this issue is not a memory problem, but rowhammer results won't get us closer to solving it.

I have tried many speeds, sizes and manufacturers RAM modules during investigation of this issue, and I tried underclocking RAM as well to put much less strain on it, none of these helped.

drescherjm commented 8 years ago

I saw that but I do not agree with the conclusion.

7Z0t99 commented 8 years ago

I'm not sure if this has been answered yet, but I would like to know if the errors get introduced during writing or reading. One way to test this would be to write the data on Linux and reboot into e.g. FreeBSD to do the scrub and vice versa. Or you move the disks between problematic machine and a known working one.

gkkovacs commented 8 years ago

@7Z0t99 I'm pretty sure the errors get there during writing. Repeated scrubs turn up the same amount of errors on the same disks, at the same places.

gkkovacs commented 8 years ago

@kernelOfTruth @7Z0t99 I have been testing the server in production for 4 days now with 2 DIMMs only (2x 8GB DDR3-1333 in dual channel mode), and it has been rock solid. Before putting the real workload back, I have written over 2TB of test data on it, and there was not a single checksum error. TBH I'm still baffled at all this, still no clue if that's a hardware or software error.

7Z0t99 commented 8 years ago

Well, since you tried so many different combinations of hardware, and since you say there are no errors when using FreeBSD, I am leaning more towards Software bug. I can just reiterate that contacting the btrfs / kernel mailing list might be a good idea, since there is a lot more expertise about the internals of the linux kernel.

ryao commented 8 years ago

@gkkovacs The only time I have ever seen something as bizarre as what is described here was when my ATX case had a standoff making contact with the motherboard right behind the DIMM slots. Could something like that be happening here?

gkkovacs commented 8 years ago

@ryao The thought has crossed my mind as well, and I remember testing the first (H67) motherboard outside the case to check for this. I haven't tested the second (Q77) board this way yet, but I have ordered an i7-3770 CPU (to see if this is a Sandy Bridge IMC problem or not), and when I replace it, I will certainly do more tests outside the case.

BTW the server is in production for a month now with 2 DIMMs (2x 8GB DDR3-1333), using ZFS on Adaptec JBOD, and there was not a single error during that time.

gkkovacs commented 8 years ago

@ryao @kernelOfTruth @7Z0t99 @behlendorf

So I have tested the very same server with an i7-3770 (Ivy Bridge) CPU to eliminate Sandy Bridge from the mix. Needless to say that the ZFS checksum errors still happen in the newly created files.

Let's recap: on the hardware side two motherboards (H67 first, Q77 now), three processors (i7-2600 first, i5-2500K after, i7-3770 now), two power supplies, two SATA controllers (motherboard ICH, Adaptec PCIe), SATA cables (backplane, regular cables), PCIe GPU, different sets of disks and RAM modules were all tested. MB outside the case was tested again. On the software side: different ZFS versions (atm running the latest), btrfs, different kernels (2.6.32 and 4.2), and many kernel options were tested.

None of the above made any difference: when using 3 or 4 RAM modules (regardless of 12, 16, 24 or 32 GB), the system creates checksum errors in newly created files. With only 2 RAM modules installed, there are no errors.

This is starting to drive me mad, anyone has any idea remaining?

Should I buy an expensive, overclockable 4 identical piece kit of DDR3-1866 RAM? (Inexpensive 4 identical piece kit of DDR3-1333 RAM was already tested.)

ghfields commented 8 years ago

I am curious if you can indeed make the checksum errors occur with only two modules if you place them both in the same memory channel.

On your Intel motherboard, you usually install memory in matched colored slots (blacks, then blues). This occupies both memory channels evenly. Could you NOT do that and place one in the first black and another in the first blue? This will load them onto a single memory channel. This could help identify if it is related to total quantity of modules or quantity of modules per channel.

(Sorry if you have reported this already and I missed it in the previous 70 comments)

ryao commented 8 years ago

@gkkovacs This sounds somehow power related. The only thing that you do not appear to have tried is using a power line conditioner:

https://www.tripplite.com/products/power-conditioners~23

You have not stated that you use a UPS, although your average UPS model does not actually do anything for spikes and drops that last <10ms.

Before you go out and buy another piece of hardware, I would like some more information on the PSU and its replacement. What are their model numbers? What case do you have and how is it mounted inside the case? What do the temperature sensors inside your case read when the system is warm?

This is bizzare enough that I am thinking about how the combination of many different sub-optimal things might combine to cause what you describe. So far, I am thinking maybe you have a combination of bad power, excessive heat and a PSU model that is not designed to supply sufficient voltage on the 3.3V rail.

Lastly, if it is not too much trouble, would you post a list of the exact parts that you used as if you were telling me how to build a complete replica of your system?

gkkovacs commented 8 years ago

@ghfields Great idea! Will test the single memory channel setup tonight or tomorrow and report back.

@ryao The server is in a data center, so I believe power is top notch, and they provide UPS and generators for backup, also cooling is a constant 20 degrees IIRC. Cureent PSU supplies 24A on the 3.3V rail, previous PSU was a Chieftec 400W model (don't have specifics). GPU tested was a Sapphire Radeon HD6870.

Current hardware parts list below.

System

Drives Other manufacturer's drives, and many combinations (RAID1, RAID0, RAID10, RAIDZ) were also tested. With and without SSD caching. Both Intel and Adaptec controller.

Memory Most possible combinations were tested. Possibly other kits I have since forgotten.

sanyo-ok commented 8 years ago

Cannot it be an electromagnetic issue with some type of a electromagnetic wave noise coming from power line or somewhere else?

Do you have a good grounding?

How much a meter like: http://www.ebay.com/itm/Electromagnetic-Radiation-Tester-Detector-EMF-Meter-Dosimeter-Digital-No-Error-B-/391313457457?hash=item5b1c198131:g:GXEAAOSwGzlTtoMs

would indicate on its display?

It shall show about zero or small values like 1-10 for a well grounded computer

Otherwise it may show 1500-2000 and it can be a reason of badly working Gigabit Ethernet, PCI slots, USB, SATA, etc.

I guess it may lead to other less noticeable issues with lower values like 1000-1500, may be it is just your situation?

ethchest commented 6 years ago

Was there ever any update/solution here? @gkkovacs Personally I would have just invested in a board with ECC RAM, especially as you bought so much new stuff anyway.

gkkovacs commented 6 years ago

@ethchest The final conclusion was that the memory modules were either faulty or simply unstable (probably because of power delivery) in dual channel, quad DIMM configuration. Since then we have decommissioned this server and been only using dual Xeon motherboards with fully buffered ECC RAM.

ethchest commented 6 years ago

Thanks for the reply/update!

ghfields commented 5 years ago

This issue can be closed.