kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
527 stars 239 forks source link

ERROR: btrfs inspect logical-resolve and inode-resolve can not find some corrupted files #724

Open Zesko opened 6 months ago

Zesko commented 6 months ago

Someone has corrupted files that was detected by Btrfs filesystem to show:

Btrfs warning message from dmesg

BTRFS warning (device dm-0): csum failed root 257 ino 3037134 off 253952 csum 0xc341dc86 expected csum 0xc341fc86 mirror 1
BTRFS error (device dm-0): bdev /dev/mapper/luks-e117a149-a0c5-401b-8032-d03580a55c9b errs: wr 0, rd 0, flush 0, corrupt 789628, gen 0
...
BTRFS warning (device dm-0): checksum verify failed on logical 180469760 mirror 2 wanted 0x31c6926e found 0x1bbec240 level 0

We tried to determine which file is damaged:

  1. $ sudo btrfs inspect-internal logical-resolve 180469760 /

The output:

ERROR: logical ino ioctl: No such file or directory
  1. $ sudo btrfs inspect inode-resolve 3037134 /

The output:

Error: ino paths ioctl: no such file or directory

His/her system:

Kernel: 6.6.8 btrfs-progs v6.6.3

Source: https://forum.manjaro.org/t/update-ended-in-readonly-filesystem/154442/19

adam900710 commented 1 month ago

It's not that uncommon that some corrupted sectors are no longer referred by any file, but only in part of a data extent that no one is referring (btrfs extent bookend).

Or the inode is already under deletion (unlinked but not yet fully deleted).

Zygo commented 1 month ago

The inode in inode-resolve only works with the subvol (root) named in the dmesg message, i.e. you'll need subvol-resolve on 257 to find the path to pass to inode-resolve. Deleted files can't be resolved with inode-resolve, but logical-resolve -o -P can skip the path resolution so you can get a complete picture of how many references there are when some of the references are deleted.

Unreachable sectors can be found with the logical-resolve -o option. This will return any file that references any part of the extent. Without -o, logical-resolve will return only references to the specific block requested. It's possible that no file references that specific block, but some file(s) have partial references to other blocks in the extent.

Note that in the case where the corrupt block is unreachable, there may be no corrupted files, in the sense that no block of any existing file that is still reachable in the filesystem is corrupted; however, if any file still references any part of the extent, the extent containing the bad block will remain in the filesystem, where it will interfere with balance, resize, device remove, and scrub.

If the goal is to remove the corrupted blocks, all files that reference the extent containing the corrupted block must be removed or replaced, whether the files are corrupted or not.

Zygo commented 1 month ago

Some more context from the upstream link:

csum 0xc341dc86 expected csum 0xc341fc86 bdev /dev/mapper/luks-e117a149-a0c5-401b-8032-d03580a55c9b

That's a single-bit error, which frequently means bad or misconfigured host RAM. The device looks encrypted, which means a single-bit error introduced by the NVMe is astronomically unlikely (any such error would be as wide as the encryption block and affect the entire CRC, not merely one bit). The specific error message can only appear after several levels of metadata validation in btrfs, which means that even a single-bit error on an unencrypted drive could not trigger this message.

The csum failure could be the tip of the iceberg--there could be a lot of filesystem metadata corruption that we're not seeing, and some of it is interfering with logical-resolve as well (e.g. missing or incorrect backrefs).

The other clue from the upstream link is that the NVMe device is new and recently added to the system. I've had systems with long and unblemished service histories start throwing orders-of-magnitude higher ECC error rates in the host RAM after adding a new NVMe device. Usually the problem is resolved by upgrading the power supply or using an adapter to power the NVMe directly. NVMe uses power rails closer to the CPU and RAM than SATA devices, so NVMe device power usage is much more likely to interfere with the host than SATA devices (at least those using traditional SATA power connectors--M.2 SATA might have the same problem, but I don't have any of those to test with).

There's also:

Upon further research, the drive has failed and will need to be replaced.

This is not the conclusion I'd draw from the information presented, but NVMe failure could also point to a power issue (low voltage is as disruptive for NVMe devices as it is for host RAM) and that might break the NVMe device. On the systems where I observed the elevated ECC error rates, the NVMe devices frequently disconnected from the bus, most likely because the embedded controller CPU had crashed. I didn't look at the SMART data for my devices because upgrading the power supply resolved all of the problems.

In the absence of any other information, I'd remove the 'bug' tag here.