lanconnected / EnhanceIO

EnhanceIO Open Source for Linux
Other
101 stars 31 forks source link

checksum error with btrfs (lru+wb, fifo+ro too) #20

Open d-a-v opened 7 years ago

d-a-v commented 7 years ago

Hello, On ubuntu 16.04, kernel 4.10, 32GiB ram, ssd partition of 64GB btrfs filesystem on spinning hard drive, bonnie++ (creating 64GiB data) to check performances, I get these errors:

[ 1004.048598] BTRFS warning (device sdb3): csum failed ino 258 off 66056749056 csum 1995617934 expected csum 3912343800
[ 1055.395836] BTRFS warning (device sdb3): csum failed ino 258 off 1221169152 csum 1152196041 expected csum 3153844041
[ 1073.145194] BTRFS warning (device sdb3): csum failed ino 258 off 1963663360 csum 2858193683 expected csum 1022787776
[ 1074.487867] BTRFS warning (device sdb3): csum failed ino 258 off 2022420480 csum 3031311257 expected csum 355173908
[ 1075.927695] BTRFS warning (device sdb3): csum failed ino 258 off 2076913664 csum 1105526511 expected csum 2993740284
...
[ 1295.325269] BTRFS warning (device sdb3): csum failed ino 258 off 10913980416 csum 2406927829 expected 
csum 1599545349
[ 1300.186300] __readpage_endio_check: 8419 callbacks suppressed
[ 1300.186303] BTRFS warning (device sdb3): csum failed ino 258 off 10913980416 csum 2406927829 expected 
csum 1599545349
...

...

# dmesg | grep 'sdb3.*csum failed' | wc -l
230

and bonnie complains about corrupted files.

d-a-v commented 7 years ago

configuration: lru + write-back.

d-a-v commented 7 years ago

I made further tests, and

I guess (ro,fifo) is easier to debug. I'd be happy to provide more relevant informations or run more tests.

d-a-v commented 7 years ago

Here is the stats log of an error. The three columns happen at T, T+10s, T+20s. The error (btrfs checksum failure) happens between column 2 and 3. T is about 4000s after the beginning of the test (bonnie). eio's error file says nothing (all 0).

configuration is (ro, fifo) with a 4GB zram blockdev as ssd, and bonnie continuously restarting its tests with 14GB data (bonnie -r 14000): (host's memory tested with memtest86, hard drive tested with smartctl, working without eio). The hard drive partition is empty before starting eio and bonnie.

src_name   /dev/sdb3
ssd_name   /dev/zram0
src_size   808998960
ssd_size   972544
set_size          256
block_size       4096
mode                2
eviction            1
num_sets         3799
num_blocks     972544
metadata        large
state        normal
flags      0x00000020

stats at T, T+10s, T+20s:

reads                     2864       2864       4134032
writes                    977538488  979759640  981952584
read_hits                 16         16         4121752
read_hit_pct              0          0          99
write_hits                0          0          0
write_hit_pct             0          0          0
dirty_write_hits          0          0          0
dirty_write_hit_pct       0          0          0
cached_blocks             267        267        1339
rd_replace                0          0          0
wr_replace                0          0          0
noroom                    0          0          107
cleanings                 0          0          0
md_write_dirty            0          0          0
md_write_clean            0          0          0
md_ssd_writes             0          0          0
do_clean                  0          0          0
nr_blocks                 972544     972544     972544
nr_dirty                  0          0          0
nr_sets                   3799       3799       3799
clean_index               0          0          0
uncached_reads            29         29         38
uncached_writes           509497     510592     511801
uncached_map_size         508895     509990     511198
uncached_map_uncacheable  602        602        603
disk_reads                2848       2848       12280
disk_writes               977538488  979759640  981952584
ssd_reads                 16         16         4121760
ssd_writes                2848       2848       11424
ssd_readfills             2848       2848       11424
ssd_readfill_unplugs      29         29         38
readdisk                  29         29         38
writedisk                 29         29         38
readcache                 2          2          515220
readfill                  356        356        1428
writecache                356        356        1428
readcount                 31         31         515258
writecount                509497     510592     511801
kb_reads                  1432       1432       2067020
kb_writes                 488769244  489879820  490976292
rdtime_ms                 184        184        3672
wrtime_ms                 639811952  641276356  642510548
unaligned_ios             0          0          0

I'm going to provide some more (hopefully useful) data. Is all of this of any interest to anyone ?

lanconnected commented 7 years ago

Hi! Thanks for testing! I'll have a look at the issue once i'm back from vacation. As always, it is very helpful to have precise steps to reproduce the issue, i.e. commands for cache creation, fs creation, bonnie test etc.

d-a-v commented 7 years ago

For the record: same error happens with kernel 4.4.

@lanconnected Hi, thanks!

To reproduce the issue I use:

mount/umount hooks are currently not really working with CentOS7, because zramctl is not available. I will change the zram commands with something compatible with all OSes. In that case let me know if you need the exact walkthrough with regular eio_cli commands.

d-a-v commented 7 years ago

Hi,

Here are simple steps to reproduce the bug: only one unused partition is needed (here sdc1). In the example, sdc1's FS is btrfs (with native crc check), of size greater than 2* host's ram. bonnie/bonnie++ must be installed. zramctl (unavailable in CentOS) is not needed.

#!/bin/bash

device=/dev/sdc1        # eio's HD
cache=/dev/zram0        # eio's ssd
cache_gb=1              # zram size used as eio's SSD

set -x
set -e

modprobe zram
echo $((cache_gb * 1024 * 1024 * 1024)) > /sys/block/zram0/disksize

eio_cli create -d $device -s $cache -p fifo -m ro -c test

more /proc/enhanceio/test/{config,stats}

mount $device /mnt
cd /mnt
bonnie -u 0
d-a-v commented 7 years ago

@lanconnected Were you able to reproduce this issue ? Is there a way I can help ?

lanconnected commented 7 years ago

sry for delay, i'll get to it this week.

dmytroleonenko commented 6 years ago

any updates? I'm not worried about btrfs but rather about data consistency

libgradev commented 4 years ago

Tried this yesterday, still broken. Arch 5.5.13 kernel, EnhanceIO from git.

Default cache options in WriteThrough mode led to a ton of CSUM errors on the underlying BTRFS partition.

hradec commented 3 years ago

Hi there. I'm having the same issues with BTRFS and eio read only. Is anyone working on it or it's been stale since last year?

Ristovski commented 3 years ago

@hradec: Safe to assume it has been abandoned for well over a year. EnhanceIO does not even work on newer kernels anymore, due to a breaking change in block IO subsystem.

hradec commented 3 years ago

@Ristovski: That's sad news! I really loved the simplicity and flexibility of enhancedIO! Being able to add/remove ssd/nvme caching to a filesystem on the fly is an amazing feature!

If I had enough kernel IO knowledge, I would try to help to keep this alive... but unfortunately I'm not there yet.