Closed gordan-bobic closed 2 years ago
Edit: Tried switching off lz4 compression and switching setting checksums to sha256 just on the off chance the problem could be in in the compression of fletcher implementations. It still happens, so not one of those.
@gordan-bobic this is a very interesting issue, could you upload somewhere the kernel module binaries? Also you say that
Taking out the SD card and scrubbing it on a different machine makes all the errors disappear
did you try to zdb -cc
the pool from the ARMv5 box? Can you also post compiler version/flags and configure options used to build ZoL?
Thanks
# gcc --version
gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9)
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
No special compiler flags specified, Makefile only seems to list -g -O2 as CFLAGS.
No configure options other than pointing the kernel source/obj path on the 3.16.39 that I built manually, just to make sure it was using the correct ones for the kernel I was installing against.
Pool on the SD card:
# zdb -cc sentinel
Traversing all blocks to verify checksums and verify nothing leaked ...
loading space map for vdev 0 of 1, metaslab 64 of 116 ...
1.22G completed ( 6MB/s) estimated time remaining: 0hr 02min 19sec zdb_blkptr_cb: Got error 52 reading <104, 83962, 0, 0> -- skipping
2.15G completed ( 6MB/s) estimated time remaining: 0hr 00min 01sec
No leaks (block sum matches space maps exactly)
bp count: 156783
ganged count: 0
bp logical: 2961904128 avg: 18891
bp physical: 1746962432 avg: 11142 compression: 1.70
bp allocated: 2321174528 avg: 14805 compression: 1.28
bp deduped: 0 ref>1: 0 deduplication: 1.00
SPA allocated: 2321174528 used: 7.45%
Dittoed blocks on same vdev: 57683
The following pool on the eSATA disk (pool got suspended at the end of zfs receive due to an error explosion). Note that this is backed by a file on ext4 mounted via /dev/loop0 in this particular case, but other wise the same thing happens as if it were running raw. Pool not reimported after the crash, zdb running directly without importing first, hundreds and hundreds of errors similar to the above reported (there was only one zdb_blkptr_cb error on the SD pool, for some reason the problem isn't as severe when using slow USB attached SD card). There is nothing in dmesg or smartctl that even remotely indicates any kind of a SATA error (no bus/DMA CRC errors showing up anywhere).
# zdb -cc test
[...]
No leaks (block sum matches space maps exactly)
bp count: 155558
ganged count: 0
bp logical: 2933120000 avg: 18855
bp physical: 1732285440 avg: 11135 compression: 1.69
bp allocated: 2303848448 avg: 14810 compression: 1.27
bp deduped: 0 ref>1: 0 deduplication: 1.00
SPA allocated: 2303848448 used: 22.01%
Dittoed blocks on same vdev: 57270
@gordan-bobic can you please also share the kernel module binaries? Or is this built in the kernel?
Sure, here is a link: http://ftp.redsleeve.org/pub/debug-tmp/zfs-modules-3.16.39.tar.gz
@gordan-bobic it would be interesting to see if you're able to reproduce the issue with 0.7.0-rc3. The updated code contains arm specific checksum optimizations.
More details available on this mailing list thread: http://list.zfsonlinux.org/pipermail/zfs-discuss/2017-January/027151.html In the end, it looks like the issue is reproducible (at least statistically). Higher load / slower hardware may be a factor.
I will try again with 0.7.0-rc3 at the earliest opportunity.
FWIW, I just remembered I tripped what feels like the exact same problem 1-2 years ago, when I was using a QNAP TS-421 NAS (also Marvell Kirkwood ARMv5). In testing, ZoL almost immediately trashed the pool (4 SATA drives). I figured it was all down to dodgy kernel sources (QNAP kernel sources are an unbelievable mess) resulting in some build time incompatibility. So I gave up, went back to zfs-fuse and didn't think any more of it. But now I wonder if I actually hit this exact bug. On the same machine, zfs-fuse has worked without a single spurious error for years.
So that's 3 separate cases, with 3 different types of hardware (DreamPlug (3 different ones, kernels 3.10.x, 3.16.x, RSEL7), SheevaPlug (kernel 3.16.x, Fedora 18), QNAP TS-421 (QNAP 3.4.x kernel, RSEL6). What they all have in common is that all of the above are based around Marvell Kirkwood armv5tel.
Anybody with a Raspberry Pi fancy giving this a shot? The original Pi is even slower than a Sheeva/Dream plug.
Just to add more info to the case: years ago i was running zfs-0.6.2 on the Pi1 Model B (ARMv6?) without this kind of issues (usb vdevs), but then i lost interest because it was too slow to use it as a NAS.
Today i'm running zfs-0.6.5.8 on 2 different BananaPi (ARMv7?). One of those is running mysql, zabbix, samba, nfs and docker with vmalloc=512M IIRC, ARC limited to 200M, with 2 pools: one pool is a 3-way mirror of usb disks, the other is a "backup" pool (single usb disk, copies=2). Again, leaving aside performance, no issues.
I still have my first Pi sitting somewhere, i could probably try to resurrect it and test the current stable release, but without any real knowledge on the matter i'm starting to blame the memory alignment thingy on the ARMv5.
I'm not convinced it's memory alignment because I am running with fixup enabled, and it gets set in the initrd. Also, presumably @Ringdingcoder has alignment fixup enabled before even loading spl and zfs modules. Alignment fixup should work for both kernel and userspace code.
It also doesn't explain why the errors per GB seem to be low on slow USB attached devices and high on fast eSATA devices. @Ringdingcoder is seeing single figure error counts on a USB attached disk, I was seing double to triple figure error counts on USB attached SSD card (fast read IOPS), and I'm seeing thousands of errors per second when scrubbing the same data set on an eSATA attached SSD.
I would expect alignment issues to be consistent in counts for a given operation, and not be exacerbated by faster media. I still think this feels like a race condition or an atomic/mutex failure of some sort.
Also note that all ARMs except very recent ones have the same memory alignment requirement. The only difference is that it is enabled by default for some SoCs.
I still think this feels like a race condition or an atomic/mutex failure of some sort.
You can rule out word breaking issues with the SPL atomics by building the SPL with the --enable-atomic-spinlocks
configure option. This effectively will serialize everything so not great for performance but it will be correct. Though this shouldn't explain the issues you're seeing, ZFS only uses these for non-critical statistic tracking.
As for the mutexs if they weren't working properly I'd expect far worse than checksum errors.
I've just tested the kirkwood image/pool (on top of an ext4 formatted USB disk) from http://ftp.redsleeve.org/pub/el7/images/kirkwood.img.xz on my Pi1 Model B: no issues here.
root@raspberrypi:/mnt# cat /proc/cpuinfo
processor : 0
model name : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS : 2.00
Features : swp half thumb fastmult vfp edsp java tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xb76
CPU revision : 7
Hardware : BCM2708
Revision : 000e
Serial : 0000000********
root@raspberrypi:/mnt# modinfo zfs | head -2
filename: /lib/modules/3.11.10/extra/zfs/zfs.ko
version: 0.6.5.8-1
root@raspberrypi:/mnt# zpool status -v
pool: kirkwood
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not support
feature flags.
scan: scrub repaired 0 in 0h2m with 0 errors on Thu Jan 26 21:59:59 2017
config:
NAME STATE READ WRITE CKSUM
kirkwood ONLINE 0 0 0
loop0p1 ONLINE 0 0 0
errors: No known data errors
root@raspberrypi:/mnt#
Right, so thus far it is either armv5tel specific, or possibly even Marvell Kirkwood specific (it's not easy to find an ARMv5 that isn't Marvell Kirkwood).
I'll re-test with @behlendorf 's suggestions above on a QNAP TS-421.
Well it takes me half an hour to get one cksum error. And I never get one when the machine is otherwise idle. So I wouldn't consider one successful scrub an indicator of the problem's absence.
It would be interesting to run an armv5 kernel on armv6 (or v7) hardware, if this is even possible. I don't know enough about the arm system level architecture.
Basically the problem can be anywhere. The CPU itself, the armv5 kernel, which nobody is likely to care about anymore. Also, from glancing over the ZFS source code, I'm not convinced that enough care has been taken to make everything work reliably on 32 bit architectures. It looks like code that has been written with a 64 bit architecture in mind, and even one with a rather well behaved memory model (x86 and SPARC are both very convenient in this regard).
I'm not convinced that enough care has been taken to make everything work reliably on 32 bit architectures
32-bit systems were never the target platform so it wouldn't surprise me if there are subtle issues which can crop up on certain architectures. Now when possible we have layered things like Linux's standard atomics / locks which should be solid and may even be optimized for the given architecture.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
Describe the problem you're observing
FS operations, particular scrubs, incorrectly detect a huge amount of checksum errors, seemingly proportional to the speed of the device being scrubbed.
On a SD card connected via USB, that may be 100-150 for a 2GB pool. On an eSATA connected real SSD that is more like 150,000 for the same 2GB data set (created using zfs send|receive on a different system).
The problem is so severe that on the eSATA disk, the pool will become corrupt and completely unimportable at some point during the scrub.
Both storage devices scrub perfectly clean on different systems (Chromebook 2 (ARMv7) running same armv5tel userspace with 4.4.x series kernels, x86-64 running 3.18.x kernels).
The problem seems to be specific to running on ARMv5 hardware (guessing here, but maybe related to something like an atomic not being atomic and introducing a race).
Indicators that this is NOT a hardware stability issue: 1) zfs-fuse exhibits no similar problems on the same hardware. 2) Hardware instability has thus far been unreproducible after 24 hours of CPU, memory and disk stress-testing (generating, checksumming and re-checksumming random patterns in RAM, on a raw block device, and on top of different file systems. 3) Three different DreamPlugs have been tested, with different SSD disks and SD cards, and they all exhibit the same spurious checksum failures.
Describe how to reproduce the problem
1) Get a Marvell Kirkwood based machine (DreamPlug, GuruPlug, SheevaPlug) Note: I am happy to provide ssh access to such a device running the above OS, or I can post such a device to an interested ZoL contributor for troubleshooting purposes.
2) Install a kernel and distro of your choice on it, build ZoL 0.6.5.8
3) create a pool on whatever block device the device has (USB stick, SD card, eSATA connected disk):
4) Create a few GB of files from /dev/urandom onto tank/test.
5) In one terminal window run:
watch zpool status -v tank
In another terminal run:zpool scrub tank
You will see the number of cksum error counts go through the roof in short order.
I have tried reducing the max_sectors value (minimum of 8 allows) when testing this on USB attached storage devices. This slows down the scrub somewhat and correspondingly seems to somewhat reduce the number of errors encountered - perhaps by as much as half - but still results in many errors. Disabling NCQ on the eSATA port and limiting SATA speed to 1.5Gbit results in no measurable reduction in the number of errors (the pool on the eSATA disk will end up completely trashed typically in less time than it takes to type zpool scrub -s to cancel the scrub. The pool ends up in a faulted state and ends up being suspended. Subsequent imports, even on another machine, fail even with zpool import -F.
Given that this seems to happen with both a USB device (ehci-orion) and eSATA (sata-mv), I originally suspected a DMA or PIC bug, but I cannot reproduce checksum failures manually, e.g. by repeatedly reading and checksumming files on the raw disk, and looking for a checksum mismatch, which seems to imply that the problem really is somehow ZoL specific.
Another thing worth pointing out is that the checksum errors aren't permanent. If I re-scrub, some files listed as corrupted will disappear from the list, and others will appear instead, assuming the pool itself doesn't get trashed as described above (only seen it get bad enough to trash the pool on eSATA). Taking out the SD card and scrubbing it on a different machine makes all the errors disappear (no data redundancy in the pool, obviously, apart from metadata's default copies=2).