kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
527 stars 239 forks source link

checksum verify failure ignored as "no error found" #816

Open tribbloid opened 2 weeks ago

tribbloid commented 2 weeks ago

Encountered on v6.2 (Debian 12) and v6.7 (Fedora 40)

sample output:

$ sudo btrfs check --repair --check-data-csum /dev/disk/by-label/Home
enabling repair mode
WARNING:

    Do not use --repair unless you are advised to do so by a developer
    or an experienced user, and then only after having accepted that no
    fsck can successfully repair all types of filesystem corruption. E.g.
    some software or hardware bugs can fatally damage a volume.
    The operation will start in 10 seconds.
    Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/disk/by-label/Home
UUID: a5c81116-78a5-4edb-b57c-b08e90e1391b
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
checksum verify failed on 58720256 wanted 0x8f087114 found 0xb87d3f03
No device size related problem found
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking csums against data
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 423218954240 bytes used, no error found
total csum bytes: 19136412
total tree bytes: 2665676800
total fs tree bytes: 2427846656
total extent tree bytes: 210829312
btree space waste bytes: 549918529
file data blocks allocated: 488460259328
 referenced 428168486912

"no error found" is self-contradicting with "checksum verify failed on 58720256 wanted 0x8f087114 found 0xb87d3f03"

also no repair was attempted despite --repair option is used

adam900710 commented 2 weeks ago

No csum repair support, so it will do nothing.

And I tried it with latest v6.9 progs, it reports it correctly as an error, no matter if it's --repair:

[adam@btrfs-vm ~]$ btrfs check --check-data-csum --repair --force /dev/test/scratch1
enabling repair mode
Opening filesystem to check...
Checking filesystem on /dev/test/scratch1
UUID: 64e210b4-34f1-4b47-98cf-52ce991841e2
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
No device size related problem found
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking csums against data
mirror 1 bytenr 298848256 csum 0x13fec125 expected csum 0x98757625
ERROR: errors found in csum tree
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 134512640 bytes used, error(s) found
total csum bytes: 131072
total tree bytes: 294912
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 154164
file data blocks allocated: 134217728
 referenced 134217728
[adam@btrfs-vm ~]$ echo $?
1

And non-repair mode:

[adam@btrfs-vm ~]$ btrfs check --check-data-csum  /dev/test/scratch1
Opening filesystem to check...
Checking filesystem on /dev/test/scratch1
UUID: 64e210b4-34f1-4b47-98cf-52ce991841e2
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking csums against data
mirror 1 bytenr 298848256 csum 0x13fec125 expected csum 0x98757625
ERROR: errors found in csum tree
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 134512640 bytes used, error(s) found
total csum bytes: 131072
total tree bytes: 294912
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 154139
file data blocks allocated: 134217728
 referenced 134217728
[adam@btrfs-vm ~]$ echo $?
1

So I believe this is already fixed.

tribbloid commented 2 weeks ago

hmm ... let me try 6.9 later, at this moment is is not shipped in any distro so it may have compound symptom

Forza-tng commented 2 weeks ago

hmm ... let me try 6.9 later, at this moment is is not shipped in any distro so it may have compound symptom

You can try the statically built btrfs-progs available from github: https://github.com/kdave/btrfs-progs/releases/download/v6.9/btrfs.static

Zygo commented 2 weeks ago

also no repair was attempted despite --repair option is used

There is an underlying documentation / user expectation issue here, as check should never be used as described. Scrub is the appropriate tool for verifying csums and repairing failures at the device level.

Check and scrub have different goals with mutually exclusive assumptions. Check assumes that if a csum mismatch occurs, the data is correct and the csum is wrong, i.e. the csum failure is due to a kernel bug putting the wrong csum on the block, or a DRAM fault corrupting the data before the csum is calculated, or some other error which occurs above the device level.

Scrub assumes the opposite, that if a csum mismatch occurs, the csum is correct, and the data is wrong, i.e. the error occurs at or below the device level.

Scrub will read other mirror copies of the data and repair the bad copy if there's a recoverable good copy, or do no further harm if it is not possible to perform a correct repair. Check will try to incorporate the bad data into the filesystem, which will conceal errors at best, and catastrophically damage the filesystem at worst. In some cases this is desirable, as there are consistency checks within btrfs check that can repair old and well-understood kernel bugs, but most of the time, importing garbage metadata from a device doesn't end well.

Generally, if something goes wrong, the first step is to run scrub, and if that doesn't resolve the issue, escalate to other recovery methods in order of increasing risk of data loss. check --repair is somewhere in the middle of that list of methods.

There is a potential enhancement here, where check could get an option to strictly reject all blocks that fail the device-level consistency checks (csum failure, tree block address, parent transid, etc), try to do an in-place repair from a mirror, and if repair is not possible, abort its operation to avoid further damage (continuing is not possible until check learns how to reconstruct interior nodes of the metadata tree). That would allow check to be used safely on a filesystem that has had corruption at the device level, because it would have a built-in pre-scrub function filtering out bad data.