koverstreet / bcachefs

Other
643 stars 71 forks source link

Fsck aborts on orphaned subvolume #634

Closed boomshroom closed 5 months ago

boomshroom commented 6 months ago

Versions: Linux 6.7.0 with bcachefs 1.3 (NixOS unstable 24.05); bcachefs-tools v1.4.1.

To reproduce: Create subvolume; create child subvolume within the previous one; delete the parent subvolume; unmount if necessary on kernels without online fsck; run fsck on the filesystem.

Expected behavior: Fsck succeeds. Ideally the subvolume would be moved to lost+found and adopted by the root filesystem.

Actual behavior: Fsck aborts with:

...
check_directory_structure...check_path(): error ENOENT_bkey_type_mismatch
bch2_check_directory_structure(): error ENOENT_bkey_type_mismatch
bch2_fs_recovery(): error ENOENT_bkey_type_mismatch
bch2_fs_start(): error ENOENT_bkey_type_mismatch
shutting down
...
done going read-only, filesystem not clean
shutdown complete

Additional info: This is a small part of my very unprofessional issue on bcachefs-tools and is responsible for fsck failing to repair the bigger problems I was experiencing. I decided to test this piece in isolation and was surprised by how easy it was. I also remember when I first created the orphaned subvolume, though it didn't impact anything until I needed to repair the system for other reasons. I was able to make a crude hack to temporarily ignore the problem, though a more robust solution would be preferred.

koverstreet commented 6 months ago

Thanks for the bug report - I added a reproducer to ktest: https://evilpiepirate.org/git/ktest.git/commit/?id=777aa3cdfac16f639d44b721c8dfa8f2b6e97ee3

If you've got more bug reports keep them coming - it'll be a bit before I decide how to fix this properly, but the reports are very much welcome.

koverstreet commented 5 months ago

Just finished pushing the last of the fixes for this bug to the -testing branch.

That included a bunch of fixes for directory connectivity checking, plus fixing reattaching subvolumes - and a new btree, BTREE_ID_subvolume_children, which lets us walk subvolumes according to the filesystem path heirarchy.

For now it's just used to detect and disallow deleting a subvolume that has child subvolumes, but we'll probably add recursive subvolume deletion in the future if it's asked for.

boomshroom commented 5 months ago

Strictly speaking, the commit mentioning this issue doesn't technically fix this issue. I'm pretty sure this issue was actually fixed earlier. The commit in question instead fixes the situation that lead to this issue. It basically resolves it, but it's still good for fsck to be prepared for anything.

koverstreet commented 5 months ago

yup, that's my approach - always fix fsck first, then fix the issue that caused the corruption.

take advantage of the chaos monkey when he appears