Closed gkkovacs closed 5 years ago
What about row hammering? Could the ZFS workflow possibly lead to RAM bitflips? Maybe you could test it with https://github.com/google/rowhammer-test or some other test.
@shoeper good idea,
thought about that too, but then discarded the idea (it's just a simply "transfer", right ?)
but let's see what that leads to
There's still the factor of NFS, that also dweezil mentioned
During testing, we tried moving data to the ZFS pool from NFS (restoring virtual machines and containers via Proxmox web interface), file copy with MC from a single disk, and copy through SSH from another server as well.
@gkkovacs so that means you already tried to copy files on the server locally and then verify if the checksums changed, correct ?
and that also lead to the issues ?
if not it could be an issue with the network (ethernet adapter) driver ...
@kernelOfTruth
if not it could be an issue with the network (ethernet adapter) driver ...
wouldn't ZFS in this case also calculate a checksum of the wrong file and think the files is correct afterwards? How should ZFS now a checksum it never had?
@kernelOfTruth I have tried copying files from external SATA drive to the ZFS and Btrfs pools, thereby completely escaping the network.
@7Z0t99 Good idea, I might write an essay about this for the Btrfs developers.
@shoeper I have tried rowhammer-test, it exited (detected a bit-flip) after 110 seconds on the first run.
Iteration 98 (after 110.02s)
Took 101.7 ms per address set
Took 1.01711 sec in total for 10 address sets
Took 23.544 nanosec per memory access (for 43200000 memory accesses)
This gives 339786 accesses per address per 64 ms refresh period
error at 0x7f0c84d2a808: got 0xffffffffefffffff
Checking for bit flips took 0.104848 sec
** exited with status 256 (0x100)
If I understand correctly this does not mean that my RAM is defective, since rowhammer puts an extremely high stress on a single row of DRAM, about 100 thousand times a second, which would never happen during a normal workload. According to the information I came across this is a general vulnerability of DDR3, which got fixed in the DDR4 standard.
Does this memory test also fail with only 2 of the 4 dimms installed? You may have to run it much longer.
@shoeper I was thinking along of issues in a broader sense:
general memory corruption due to buggy driver,
in the case of file transfers - yeah, there should be checksums & thus correct
slub_nomerge should partially account for buggy components - but in this case it didn't offer any help
@drescherjm
Currently there are 4x 8GB DDR3-1333 modules installed, 2 Kingston and 2 Corsair DIMMs IIRC. I had tried 4x 4GB 1333 configuration (Kingmax), 2x 4GB 1600 (Kingston IIRC), 4x 8GB 1600 (Kingston), and many combinations of these. 1 or 2 DIMMs never showed checksum errors, 4 always.
@gkkovacs how about 3 modules ? (e.g. 2x4 + 1x8, if you have those available)
Are 2 DIMMs and 4 DIMMs each time placed in dual channel setup ?
that's the thing I'm permanently keeping in the back of my head: trouble with dual channel memory with full utilization
On Fri, Nov 6, 2015 at 6:17 AM, gkkovacs notifications@github.com wrote:
Motherboard is ASUS P8H67, 32GB (non-ECC) DDR3-1333 RAM, Core i7-2600 CPU
I admit I know almost nothing here, but want to double check. All of the testing you've done to eliminate bad RAM has involved swapping various combinations of non-ECC ram with other combinations of non-ECC ram?
What happens when you use ECC ram?
Jonathan
@kernelOfTruth I think @drescherjm meant testing rowhammer with 2 DIMMs only. I can certainly try that, although failing rowhammer does not mean your RAM is defective, it simply exploits a weakness in the DDR3 design. What rowhammer does never happens in real life workloads, it's kind of a DDOS attack against your RAM. Also running it damages the RAM (overheats some row lines), so I would like to keep it to a minimum, not going to run it for long.
@heyjonathan There is no ECC support on H67/Q77 chipsets, I have only tested non-ECC DIMMs in many configurations.
@gkkovacs you did every test in JBOD configuration ?
does the VIA controller (VIA VT6415 controller ) exhibit this behavior ? (specs: http://www.asus.com/Motherboards/P8H67V/specifications/ )
Could be what @drescherjm meant, but please nonetheless clarify on how the DIMMs got installed in relation to dual channel status,
and if applicable - test 3-DIMM configuration (not rowhammer, copying over data)
@kernelOfTruth I have tested with the H67 and Q77 on-board Intel ICH controller in AHCI mode, and with the Adaptec SAS RAID controller in the following modes:
All modes exhibited the checksum errors, although in HW RAID mode there were considerably less errors per same amount of data written (10x less with Btrfs, extremely rare with ext4).
I did not test the VIA controller, nor can I, since that motherboard has been replaced.
Yes. I meant to test rowhammer with 2 DIMMs. At work and otherwise I have seen quite a few RAM problems over the years with all slots populated especially when using DIMMs that are higher density than the system initially supported.
I'm afraid I don't see why rowhammer is relevant here, e.g. what would a shorter or longer time until the first error tell us? I mean we know that probably any DDR3 module made on a small process geometry is susceptible to rowhammer and as a countermeasure some vendors updated their BIOSes to refresh the modules more often. Newer processors might have more and better countermeasures though.
@7Z0t99 I agree
@drescherjm As I wrote above, passing or failing rowhammer is not indicative of DRAM stability or defects. It simply shows that DDR3 is vulnerable by design to a row overload. I'm not saying this issue is not a memory problem, but rowhammer results won't get us closer to solving it.
I have tried many speeds, sizes and manufacturers RAM modules during investigation of this issue, and I tried underclocking RAM as well to put much less strain on it, none of these helped.
I saw that but I do not agree with the conclusion.
I'm not sure if this has been answered yet, but I would like to know if the errors get introduced during writing or reading. One way to test this would be to write the data on Linux and reboot into e.g. FreeBSD to do the scrub and vice versa. Or you move the disks between problematic machine and a known working one.
@7Z0t99 I'm pretty sure the errors get there during writing. Repeated scrubs turn up the same amount of errors on the same disks, at the same places.
@kernelOfTruth @7Z0t99 I have been testing the server in production for 4 days now with 2 DIMMs only (2x 8GB DDR3-1333 in dual channel mode), and it has been rock solid. Before putting the real workload back, I have written over 2TB of test data on it, and there was not a single checksum error. TBH I'm still baffled at all this, still no clue if that's a hardware or software error.
Well, since you tried so many different combinations of hardware, and since you say there are no errors when using FreeBSD, I am leaning more towards Software bug. I can just reiterate that contacting the btrfs / kernel mailing list might be a good idea, since there is a lot more expertise about the internals of the linux kernel.
@gkkovacs The only time I have ever seen something as bizarre as what is described here was when my ATX case had a standoff making contact with the motherboard right behind the DIMM slots. Could something like that be happening here?
@ryao The thought has crossed my mind as well, and I remember testing the first (H67) motherboard outside the case to check for this. I haven't tested the second (Q77) board this way yet, but I have ordered an i7-3770 CPU (to see if this is a Sandy Bridge IMC problem or not), and when I replace it, I will certainly do more tests outside the case.
BTW the server is in production for a month now with 2 DIMMs (2x 8GB DDR3-1333), using ZFS on Adaptec JBOD, and there was not a single error during that time.
@ryao @kernelOfTruth @7Z0t99 @behlendorf
So I have tested the very same server with an i7-3770 (Ivy Bridge) CPU to eliminate Sandy Bridge from the mix. Needless to say that the ZFS checksum errors still happen in the newly created files.
Let's recap: on the hardware side two motherboards (H67 first, Q77 now), three processors (i7-2600 first, i5-2500K after, i7-3770 now), two power supplies, two SATA controllers (motherboard ICH, Adaptec PCIe), SATA cables (backplane, regular cables), PCIe GPU, different sets of disks and RAM modules were all tested. MB outside the case was tested again. On the software side: different ZFS versions (atm running the latest), btrfs, different kernels (2.6.32 and 4.2), and many kernel options were tested.
None of the above made any difference: when using 3 or 4 RAM modules (regardless of 12, 16, 24 or 32 GB), the system creates checksum errors in newly created files. With only 2 RAM modules installed, there are no errors.
This is starting to drive me mad, anyone has any idea remaining?
Should I buy an expensive, overclockable 4 identical piece kit of DDR3-1866 RAM? (Inexpensive 4 identical piece kit of DDR3-1333 RAM was already tested.)
I am curious if you can indeed make the checksum errors occur with only two modules if you place them both in the same memory channel.
On your Intel motherboard, you usually install memory in matched colored slots (blacks, then blues). This occupies both memory channels evenly. Could you NOT do that and place one in the first black and another in the first blue? This will load them onto a single memory channel. This could help identify if it is related to total quantity of modules or quantity of modules per channel.
(Sorry if you have reported this already and I missed it in the previous 70 comments)
@gkkovacs This sounds somehow power related. The only thing that you do not appear to have tried is using a power line conditioner:
https://www.tripplite.com/products/power-conditioners~23
You have not stated that you use a UPS, although your average UPS model does not actually do anything for spikes and drops that last <10ms.
Before you go out and buy another piece of hardware, I would like some more information on the PSU and its replacement. What are their model numbers? What case do you have and how is it mounted inside the case? What do the temperature sensors inside your case read when the system is warm?
This is bizzare enough that I am thinking about how the combination of many different sub-optimal things might combine to cause what you describe. So far, I am thinking maybe you have a combination of bad power, excessive heat and a PSU model that is not designed to supply sufficient voltage on the 3.3V rail.
Lastly, if it is not too much trouble, would you post a list of the exact parts that you used as if you were telling me how to build a complete replica of your system?
@ghfields Great idea! Will test the single memory channel setup tonight or tomorrow and report back.
@ryao The server is in a data center, so I believe power is top notch, and they provide UPS and generators for backup, also cooling is a constant 20 degrees IIRC. Cureent PSU supplies 24A on the 3.3V rail, previous PSU was a Chieftec 400W model (don't have specifics). GPU tested was a Sapphire Radeon HD6870.
Current hardware parts list below.
System
Drives Other manufacturer's drives, and many combinations (RAID1, RAID0, RAID10, RAIDZ) were also tested. With and without SSD caching. Both Intel and Adaptec controller.
Memory Most possible combinations were tested. Possibly other kits I have since forgotten.
Cannot it be an electromagnetic issue with some type of a electromagnetic wave noise coming from power line or somewhere else?
Do you have a good grounding?
How much a meter like: http://www.ebay.com/itm/Electromagnetic-Radiation-Tester-Detector-EMF-Meter-Dosimeter-Digital-No-Error-B-/391313457457?hash=item5b1c198131:g:GXEAAOSwGzlTtoMs
would indicate on its display?
It shall show about zero or small values like 1-10 for a well grounded computer
Otherwise it may show 1500-2000 and it can be a reason of badly working Gigabit Ethernet, PCI slots, USB, SATA, etc.
I guess it may lead to other less noticeable issues with lower values like 1000-1500, may be it is just your situation?
Was there ever any update/solution here? @gkkovacs Personally I would have just invested in a board with ECC RAM, especially as you bought so much new stuff anyway.
@ethchest The final conclusion was that the memory modules were either faulty or simply unstable (probably because of power delivery) in dual channel, quad DIMM configuration. Since then we have decommissioned this server and been only using dual Xeon motherboards with fully buffered ECC RAM.
Thanks for the reply/update!
This issue can be closed.
I have installed Proxmox 4 (zfs 0.6.5) on a server using ZFS RAID10 in the installer. The disks are brand new (4x2GB, attached to the Intel motherboard SATA connectors), and there are no SMART errors / reallocated sectors on them. I have run a memtest for 30 minutes, everything seems fine hardware-wise.
After restoring a few Vms (a hundred or so gigabytes), the system reported read errors in some files. Scrubbing the pool shows permanent read errors in the recently restored guest files:
If I delete the VMs and scrub the pool again, the errors are gone. If I restore new VMs, the errors are back. Anybody have any idea what could be happening here?