kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
527 stars 239 forks source link

recurrent csum error even after successfully corrected by btrfs scrub #817

Open tribbloid opened 2 weeks ago

tribbloid commented 2 weeks ago

sample output:

liveuser@localhost-live:~$ sudo btrfs scrub status /dev/disk/by-label/Home
UUID:             a5c81116-78a5-4edb-b57c-b08e90e1391b
Scrub started:    Wed Jun 12 22:48:20 2024
Status:           running
Duration:         0:01:45
Time left:        0:00:38
ETA:              Wed Jun 12 22:50:46 2024
Total to scrub:   396.64GiB
Bytes scrubbed:   290.10GiB  (73.14%)
Rate:             2.76GiB/s
Error summary:    csum=4
  Corrected:      4
  Uncorrectable:  0
  Unverified:     0
liveuser@localhost-live:~$ sudo btrfs scrub status /dev/disk/by-label/Home
UUID:             a5c81116-78a5-4edb-b57c-b08e90e1391b
Scrub started:    Wed Jun 12 22:48:20 2024
Status:           finished
Duration:         0:02:21
Total to scrub:   396.64GiB
Rate:             2.81GiB/s
Error summary:    csum=4
  Corrected:      4
  Uncorrectable:  0
  Unverified:     0

So the 4 errors was listed as "corrected" but running the same command again yield the same 4 errors.

Using dmesg indicates that these error are on the same cluster:

[  883.471813] BTRFS info (device nvme0n1p7): scrub: started on devid 1
[  883.604772] BTRFS warning (device nvme0n1p7): tree block 58720256 mirror 1 has bad bytenr, has 67108864 want 58720256
[  883.605394] BTRFS error (device nvme0n1p7): fixed up error at logical 58720256 on dev /dev/nvme0n1p7 physical 67108864
[  883.605397] BTRFS error (device nvme0n1p7): fixed up error at logical 58720256 on dev /dev/nvme0n1p7 physical 67108864
[  883.605400] BTRFS error (device nvme0n1p7): fixed up error at logical 58720256 on dev /dev/nvme0n1p7 physical 67108864
[  883.605402] BTRFS error (device nvme0n1p7): fixed up error at logical 58720256 on dev /dev/nvme0n1p7 physical 67108864
[ 1025.166353] BTRFS info (device nvme0n1p7): scrub: finished on devid 1 with status: 0
[ 1230.506226] BTRFS info (device nvme0n1p7): scrub: started on devid 1
[ 1230.664557] BTRFS warning (device nvme0n1p7): tree block 58720256 mirror 1 has bad bytenr, has 67108864 want 58720256
[ 1230.665320] BTRFS error (device nvme0n1p7): fixed up error at logical 58720256 on dev /dev/nvme0n1p7 physical 67108864
[ 1230.665325] BTRFS error (device nvme0n1p7): fixed up error at logical 58720256 on dev /dev/nvme0n1p7 physical 67108864
[ 1230.665328] BTRFS error (device nvme0n1p7): fixed up error at logical 58720256 on dev /dev/nvme0n1p7 physical 67108864
[ 1230.665332] BTRFS error (device nvme0n1p7): fixed up error at logical 58720256 on dev /dev/nvme0n1p7 physical 67108864
[ 1371.847410] BTRFS info (device nvme0n1p7): scrub: finished on devid 1 with status: 0

it appears that btrfs scrub doesn't do anything in this case

adam900710 commented 2 weeks ago

Kernel version please?

I tried 6.10-rc and it correctly detects and fixed it.

[adam@btrfs-vm ~]$ sudo btrfs scrub start -fB /mnt/btrfs/
Starting scrub on devid 1
scrub done for b069e93c-fa69-4b46-ac41-27025aafe0eb
Scrub started:    Thu Jun 13 13:13:31 2024
Status:           finished
Duration:         0:00:00
Total to scrub:   128.34MiB
Rate:             128.34MiB/s
Error summary:    verify=1
  Corrected:      1
  Uncorrectable:  0
  Unverified:     0
WARNING: errors detected during scrubbing, 1 corrected
[adam@btrfs-vm ~]$ sudo btrfs scrub start -fB /mnt/btrfs/
Starting scrub on devid 1
scrub done for b069e93c-fa69-4b46-ac41-27025aafe0eb
Scrub started:    Thu Jun 13 13:13:34 2024
Status:           finished
Duration:         0:00:00
Total to scrub:   128.34MiB
Rate:             128.34MiB/s
Error summary:    no errors found
tribbloid commented 2 weeks ago

fedora 40 should be using 6.8~something, let me double check

tribbloid commented 1 week ago

found it: Kernel: 6.8.5-301.fc40.x86_64

Also found a new recurring problem:

liveuser@localhost-live:~$ sudo btrfs check --repair /dev/disk/by-label/Home
enabling repair mode
WARNING:

    Do not use --repair unless you are advised to do so by a developer
    or an experienced user, and then only after having accepted that no
    fsck can successfully repair all types of filesystem corruption. E.g.
    some software or hardware bugs can fatally damage a volume.
    The operation will start in 10 seconds.
    Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/disk/by-label/Home
UUID: a5c81116-78a5-4edb-b57c-b08e90e1391b
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
checksum verify failed on 58720256 wanted 0x8b416f75 found 0xf31ed09a
No device size related problem found
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 421089386496 bytes used, no error found
total csum bytes: 19137532
total tree bytes: 2635005952
total fs tree bytes: 2351742976
total extent tree bytes: 256081920
btree space waste bytes: 576841045
file data blocks allocated: 486477520896
 referenced 426042204160
liveuser@localhost-live:~$ sudo btrfs check --repair /dev/disk/by-label/Home
enabling repair mode
WARNING:

    Do not use --repair unless you are advised to do so by a developer
    or an experienced user, and then only after having accepted that no
    fsck can successfully repair all types of filesystem corruption. E.g.
    some software or hardware bugs can fatally damage a volume.
    The operation will start in 10 seconds.
    Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/disk/by-label/Home
UUID: a5c81116-78a5-4edb-b57c-b08e90e1391b
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
checksum verify failed on 58720256 wanted 0x3caae502 found 0x99f99bda
No device size related problem found
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 421089386496 bytes used, no error found
total csum bytes: 19137532
total tree bytes: 2635005952
total fs tree bytes: 2351742976
total extent tree bytes: 256081920
btree space waste bytes: 576841045
file data blocks allocated: 486477520896
 referenced 426042204160

"cache and super generation don't match, space cache will be invalidated" is totally useless

Zygo commented 1 week ago

"space cache will be invalidated" is typically done by the kernel during the next mount. check --repair does not have to do anything special in that case (other than to check the metadata of the storage where the cache is located, which is done like any other nodatacow file in later stages). There's no need to check the cache contents since the kernel will wipe them out on the next mount anyway.

The message wording could be clarified.

adam900710 commented 1 week ago

Your original report is about scrub not fixing the corruption, but why involving btrfs-check?

Anyway btrfs-progs won't repair csum errors.

Just in case, mind to run a memtest? Something weird is happening.

tribbloid commented 1 week ago

@adam900710 ah sorry it should be in another issue at best. Now @Zygo has explained it, I need to verify it after a reboot

adam900710 commented 1 week ago

Nope, kernel won't really address it at mount.

The cache can only be rebuild if some write operation is done to the offending block group.

It's more recommended just wipe the cache, and go v2 cache which is safer and faster (that's why it's the default mkfs option now).

I'm more interested in why the csum mismatch happened for the tree block and why scrub doesn't repair it.

For 6.8.x, there may be the bug of kernel, but since 6.8.x is already EOL 3 weeks ago, I strongly recommend to go 6.9.5 or newer, which fixes the kernel bug that can cause some race.