Spurious checksum failures on ARMv5 (but not on ARMv7)

gordan-bobic commented 7 years ago

System information

Type	Version/Name
Distribution Name	RedSleeve Linux
Distribution Version	7.x
Linux Kernel	3.10.92, 3.10.104, 3.16.39
Architecture	armv5tel
ZFS Version	0.6.5.8, 0.6.5.8-fedora-25
SPL Version	0.6.5.8, 0.6.5.8-fedora-25

Describe the problem you're observing

FS operations, particular scrubs, incorrectly detect a huge amount of checksum errors, seemingly proportional to the speed of the device being scrubbed.

On a SD card connected via USB, that may be 100-150 for a 2GB pool. On an eSATA connected real SSD that is more like 150,000 for the same 2GB data set (created using zfs send|receive on a different system).

The problem is so severe that on the eSATA disk, the pool will become corrupt and completely unimportable at some point during the scrub.

Both storage devices scrub perfectly clean on different systems (Chromebook 2 (ARMv7) running same armv5tel userspace with 4.4.x series kernels, x86-64 running 3.18.x kernels).

The problem seems to be specific to running on ARMv5 hardware (guessing here, but maybe related to something like an atomic not being atomic and introducing a race).

Indicators that this is NOT a hardware stability issue: 1) zfs-fuse exhibits no similar problems on the same hardware. 2) Hardware instability has thus far been unreproducible after 24 hours of CPU, memory and disk stress-testing (generating, checksumming and re-checksumming random patterns in RAM, on a raw block device, and on top of different file systems. 3) Three different DreamPlugs have been tested, with different SSD disks and SD cards, and they all exhibit the same spurious checksum failures.

Describe how to reproduce the problem

1) Get a Marvell Kirkwood based machine (DreamPlug, GuruPlug, SheevaPlug) Note: I am happy to provide ssh access to such a device running the above OS, or I can post such a device to an interested ZoL contributor for troubleshooting purposes.

2) Install a kernel and distro of your choice on it, build ZoL 0.6.5.8

3) create a pool on whatever block device the device has (USB stick, SD card, eSATA connected disk):

    zpool create -d 
    -o ashift=12 \
    -o feature@async_destroy=enabled \
    -o feature@empty_bpobj=enabled \
    -o feature@lz4_compress=enabled \
    tank /dev/sdb

    zfs create \
    -o atime=off \
    -o logbias=throughput \
    -o xattr=sa \
    -o sync=disabled \
    tank/test

4) Create a few GB of files from /dev/urandom onto tank/test.

5) In one terminal window run: watch zpool status -v tank In another terminal run: zpool scrub tank

You will see the number of cksum error counts go through the roof in short order.

I have tried reducing the max_sectors value (minimum of 8 allows) when testing this on USB attached storage devices. This slows down the scrub somewhat and correspondingly seems to somewhat reduce the number of errors encountered - perhaps by as much as half - but still results in many errors. Disabling NCQ on the eSATA port and limiting SATA speed to 1.5Gbit results in no measurable reduction in the number of errors (the pool on the eSATA disk will end up completely trashed typically in less time than it takes to type zpool scrub -s to cancel the scrub. The pool ends up in a faulted state and ends up being suspended. Subsequent imports, even on another machine, fail even with zpool import -F.

Given that this seems to happen with both a USB device (ehci-orion) and eSATA (sata-mv), I originally suspected a DMA or PIC bug, but I cannot reproduce checksum failures manually, e.g. by repeatedly reading and checksumming files on the raw disk, and looking for a checksum mismatch, which seems to imply that the problem really is somehow ZoL specific.

Another thing worth pointing out is that the checksum errors aren't permanent. If I re-scrub, some files listed as corrupted will disappear from the list, and others will appear instead, assuming the pool itself doesn't get trashed as described above (only seen it get bad enough to trash the pool on eSATA). Taking out the SD card and scrubbing it on a different machine makes all the errors disappear (no data redundancy in the pool, obviously, apart from metadata's default copies=2).

gordan-bobic commented 7 years ago

Edit: Tried switching off lz4 compression and switching setting checksums to sha256 just on the off chance the problem could be in in the compression of fletcher implementations. It still happens, so not one of those.

loli10K commented 7 years ago

@gordan-bobic this is a very interesting issue, could you upload somewhere the kernel module binaries? Also you say that

Taking out the SD card and scrubbing it on a different machine makes all the errors disappear

did you try to zdb -cc the pool from the ARMv5 box? Can you also post compiler version/flags and configure options used to build ZoL?

Thanks

gordan-bobic commented 7 years ago

# gcc --version
gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9)
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

No special compiler flags specified, Makefile only seems to list -g -O2 as CFLAGS.

No configure options other than pointing the kernel source/obj path on the 3.16.39 that I built manually, just to make sure it was using the correct ones for the kernel I was installing against.

Pool on the SD card:

# zdb -cc sentinel

Traversing all blocks to verify checksums and verify nothing leaked ...

loading space map for vdev 0 of 1, metaslab 64 of 116 ...
1.22G completed (   6MB/s) estimated time remaining: 0hr 02min 19sec        zdb_blkptr_cb: Got error 52 reading <104, 83962, 0, 0>  -- skipping
2.15G completed (   6MB/s) estimated time remaining: 0hr 00min 01sec        
    No leaks (block sum matches space maps exactly)

    bp count:          156783
    ganged count:           0
    bp logical:    2961904128      avg:  18891
    bp physical:   1746962432      avg:  11142     compression:   1.70
    bp allocated:  2321174528      avg:  14805     compression:   1.28
    bp deduped:             0    ref>1:      0   deduplication:   1.00
    SPA allocated: 2321174528     used:  7.45%
    Dittoed blocks on same vdev: 57683

The following pool on the eSATA disk (pool got suspended at the end of zfs receive due to an error explosion). Note that this is backed by a file on ext4 mounted via /dev/loop0 in this particular case, but other wise the same thing happens as if it were running raw. Pool not reimported after the crash, zdb running directly without importing first, hundreds and hundreds of errors similar to the above reported (there was only one zdb_blkptr_cb error on the SD pool, for some reason the problem isn't as severe when using slow USB attached SD card). There is nothing in dmesg or smartctl that even remotely indicates any kind of a SATA error (no bus/DMA CRC errors showing up anywhere).

# zdb -cc test
[...]
    No leaks (block sum matches space maps exactly)

    bp count:          155558
    ganged count:           0
    bp logical:    2933120000      avg:  18855
    bp physical:   1732285440      avg:  11135     compression:   1.69
    bp allocated:  2303848448      avg:  14810     compression:   1.27
    bp deduped:             0    ref>1:      0   deduplication:   1.00
    SPA allocated: 2303848448     used: 22.01%
    Dittoed blocks on same vdev: 57270

loli10K commented 7 years ago

@gordan-bobic can you please also share the kernel module binaries? Or is this built in the kernel?

gordan-bobic commented 7 years ago

Sure, here is a link: http://ftp.redsleeve.org/pub/debug-tmp/zfs-modules-3.16.39.tar.gz

behlendorf commented 7 years ago

@gordan-bobic it would be interesting to see if you're able to reproduce the issue with 0.7.0-rc3. The updated code contains arm specific checksum optimizations.

gordan-bobic commented 7 years ago

More details available on this mailing list thread: http://list.zfsonlinux.org/pipermail/zfs-discuss/2017-January/027151.html In the end, it looks like the issue is reproducible (at least statistically). Higher load / slower hardware may be a factor.

I will try again with 0.7.0-rc3 at the earliest opportunity.

FWIW, I just remembered I tripped what feels like the exact same problem 1-2 years ago, when I was using a QNAP TS-421 NAS (also Marvell Kirkwood ARMv5). In testing, ZoL almost immediately trashed the pool (4 SATA drives). I figured it was all down to dodgy kernel sources (QNAP kernel sources are an unbelievable mess) resulting in some build time incompatibility. So I gave up, went back to zfs-fuse and didn't think any more of it. But now I wonder if I actually hit this exact bug. On the same machine, zfs-fuse has worked without a single spurious error for years.

So that's 3 separate cases, with 3 different types of hardware (DreamPlug (3 different ones, kernels 3.10.x, 3.16.x, RSEL7), SheevaPlug (kernel 3.16.x, Fedora 18), QNAP TS-421 (QNAP 3.4.x kernel, RSEL6). What they all have in common is that all of the above are based around Marvell Kirkwood armv5tel.

loli10K commented 7 years ago

Anybody with a Raspberry Pi fancy giving this a shot? The original Pi is even slower than a Sheeva/Dream plug.

Just to add more info to the case: years ago i was running zfs-0.6.2 on the Pi1 Model B (ARMv6?) without this kind of issues (usb vdevs), but then i lost interest because it was too slow to use it as a NAS.

Today i'm running zfs-0.6.5.8 on 2 different BananaPi (ARMv7?). One of those is running mysql, zabbix, samba, nfs and docker with vmalloc=512M IIRC, ARC limited to 200M, with 2 pools: one pool is a 3-way mirror of usb disks, the other is a "backup" pool (single usb disk, copies=2). Again, leaving aside performance, no issues.

I still have my first Pi sitting somewhere, i could probably try to resurrect it and test the current stable release, but without any real knowledge on the matter i'm starting to blame the memory alignment thingy on the ARMv5.

gordan-bobic commented 7 years ago

I'm not convinced it's memory alignment because I am running with fixup enabled, and it gets set in the initrd. Also, presumably @Ringdingcoder has alignment fixup enabled before even loading spl and zfs modules. Alignment fixup should work for both kernel and userspace code.

It also doesn't explain why the errors per GB seem to be low on slow USB attached devices and high on fast eSATA devices. @Ringdingcoder is seeing single figure error counts on a USB attached disk, I was seing double to triple figure error counts on USB attached SSD card (fast read IOPS), and I'm seeing thousands of errors per second when scrubbing the same data set on an eSATA attached SSD.

I would expect alignment issues to be consistent in counts for a given operation, and not be exacerbated by faster media. I still think this feels like a race condition or an atomic/mutex failure of some sort.

Also note that all ARMs except very recent ones have the same memory alignment requirement. The only difference is that it is enabled by default for some SoCs.

behlendorf commented 7 years ago

I still think this feels like a race condition or an atomic/mutex failure of some sort.

You can rule out word breaking issues with the SPL atomics by building the SPL with the --enable-atomic-spinlocks configure option. This effectively will serialize everything so not great for performance but it will be correct. Though this shouldn't explain the issues you're seeing, ZFS only uses these for non-critical statistic tracking.

As for the mutexs if they weren't working properly I'd expect far worse than checksum errors.

loli10K commented 7 years ago

I've just tested the kirkwood image/pool (on top of an ext4 formatted USB disk) from http://ftp.redsleeve.org/pub/el7/images/kirkwood.img.xz on my Pi1 Model B: no issues here.

root@raspberrypi:/mnt# cat /proc/cpuinfo 
processor   : 0
model name  : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS    : 2.00
Features    : swp half thumb fastmult vfp edsp java tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part    : 0xb76
CPU revision    : 7

Hardware    : BCM2708
Revision    : 000e
Serial      : 0000000********
root@raspberrypi:/mnt# modinfo zfs | head -2
filename:       /lib/modules/3.11.10/extra/zfs/zfs.ko
version:        0.6.5.8-1
root@raspberrypi:/mnt# zpool status -v
  pool: kirkwood
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
    still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
    pool will no longer be accessible on software that does not support
    feature flags.
  scan: scrub repaired 0 in 0h2m with 0 errors on Thu Jan 26 21:59:59 2017
config:

    NAME        STATE     READ WRITE CKSUM
    kirkwood    ONLINE       0     0     0
      loop0p1   ONLINE       0     0     0

errors: No known data errors
root@raspberrypi:/mnt#

gordan-bobic commented 7 years ago

Right, so thus far it is either armv5tel specific, or possibly even Marvell Kirkwood specific (it's not easy to find an ARMv5 that isn't Marvell Kirkwood).

I'll re-test with @behlendorf 's suggestions above on a QNAP TS-421.

Ringdingcoder commented 7 years ago

Well it takes me half an hour to get one cksum error. And I never get one when the machine is otherwise idle. So I wouldn't consider one successful scrub an indicator of the problem's absence.

It would be interesting to run an armv5 kernel on armv6 (or v7) hardware, if this is even possible. I don't know enough about the arm system level architecture.

Basically the problem can be anywhere. The CPU itself, the armv5 kernel, which nobody is likely to care about anymore. Also, from glancing over the ZFS source code, I'm not convinced that enough care has been taken to make everything work reliably on 32 bit architectures. It looks like code that has been written with a 64 bit architecture in mind, and even one with a rather well behaved memory model (x86 and SPARC are both very convenient in this regard).

behlendorf commented 7 years ago

I'm not convinced that enough care has been taken to make everything work reliably on 32 bit architectures

32-bit systems were never the target platform so it wouldn't surprise me if there are subtle issues which can crop up on certain architectures. Now when possible we have layered things like Linux's standard atomics / locks which should be solid and may even be optimized for the given architecture.

stale[bot] commented 4 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

openzfs / zfs