kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
527 stars 239 forks source link

Attempted csum change, now CHANGING_FSID_V2 is stuck. I would like to unset it #808

Open aidan-gibson opened 3 weeks ago

aidan-gibson commented 3 weeks ago

I used btrfstune to change csum on a 32TB striping HDD setup. It ran for 5 days and did not work. On trying to mount, I now get

kernel: BTRFS error: device /dev/sdd has incomplete metadata_uuid change, please use btrfstune to complete
kernel: BTRFS error: device /dev/sdc1 has incomplete metadata_uuid change, please use btrfstune to complete

on running sudo btrfs inspect-internal dump-super /dev/sdc1, I get:

csum_type       0 (crc32c)
csum_size       4
csum            0x14973811 [match]
bytenr          65536
flags           0x1000000001
            ( WRITTEN |
              CHANGING_FSID_V2 )
magic           _BHRfS_M [match]

(and similarly for /dev/sdd)

I'd like to just nuke the fs at this point, but any data recovery would be nice. Is there a way I can manually disable CHANGING_FSID_V2? I'm thinking I can then mount without csum and recover the files. Thanks.

adam900710 commented 3 weeks ago

You are not changing btrfs csum but fsid?

Please provide your full command when doing the convert. Considering csum change is still an experimental build, unless you build btrfs-progs by yourself, I'm wondering what are you realing doing.

aidan-gibson commented 3 weeks ago

Original Command: sudo btrfstune --csum xxhash64 /dev/sdd

Let me know if I can provide any further context that would help.

adam900710 commented 3 weeks ago

And btrfs-progs version?

btrfstune --csum doesn't touch the CHANGING_FSID_V2 flag at all, it utilizes BTRFS_SUPER_FLAG_CHANGING_DATA_CSUM and BTRFS_SUPER_FLAG_CHANGING_META_CSUM instead.

Thus I'm wondering what version you're using.

aidan-gibson commented 3 weeks ago

Huh, that's odd. I didn't do anything else to it.

I used git version from this time.

I would really like to think carefully about this and try to fix it in place, as I don't have extra drives to store the 32TB.

Would using dd or something to edit CHANGING_FSID_V2 out of the superblock be completely hair-brained? I realize this is very much on me for using an experimental feature on such a large amount of data, and if it ends up irrecoverable it is what it is, but tips on the best approach forward to fix this in place would be very much appreciated.

adam900710 commented 3 weeks ago

You can try "btrfs check" first to see what's the error (which can be very slow and even hits ENOMEM for original mode).

But firstly, can you try "btrfs ins dump-tree -t root" first? IIRC for real csum change, we create extra csum root, and if that's the case we know if it's really csum change.

Otherwise I'm really confused on if it's really csum change or fsid change.

aidan-gibson commented 3 weeks ago
sudo btrfs ins dump-tree -t root /dev/sdd
btrfs-progs v6.8.1
root tree
leaf 72121592856576 items 13 free space 12220 generation 1007246 owner ROOT_TREE
leaf 72121592856576 flags 0x1(WRITTEN) backref revision 1
fs uuid 90a947f2-52cb-462b-8054-41f99bbcdbf9
chunk uuid d956c74b-41cf-4943-a130-fe54c4285f46
    item 0 key (EXTENT_TREE ROOT_ITEM 0) itemoff 15844 itemsize 439
        generation 1007246 root_dirid 0 bytenr 72121592954880 byte_limit 0 bytes_used 1689305088
        last_snapshot 0 flags 0x0(none) refs 1
        drop_progress key (0 UNKNOWN.0 0) drop_level 0
        level 2 generation_v2 1007246
        uuid 00000000-0000-0000-0000-000000000000
        parent_uuid 00000000-0000-0000-0000-000000000000
        received_uuid 00000000-0000-0000-0000-000000000000
        ctransid 0 otransid 0 stransid 0 rtransid 0
        ctime 0.0 (1969-12-31 16:00:00)
        otime 0.0 (1969-12-31 16:00:00)
        stime 0.0 (1969-12-31 16:00:00)
        rtime 0.0 (1969-12-31 16:00:00)
    item 1 key (DEV_TREE ROOT_ITEM 0) itemoff 15405 itemsize 439
        generation 485351 root_dirid 0 bytenr 45465760776192 byte_limit 0 bytes_used 2686976
        last_snapshot 0 flags 0x0(none) refs 1
        drop_progress key (0 UNKNOWN.0 0) drop_level 0
        level 1 generation_v2 485351
        uuid 00000000-0000-0000-0000-000000000000
        parent_uuid 00000000-0000-0000-0000-000000000000
        received_uuid 00000000-0000-0000-0000-000000000000
        ctransid 0 otransid 0 stransid 0 rtransid 0
        ctime 0.0 (1969-12-31 16:00:00)
        otime 0.0 (1969-12-31 16:00:00)
        stime 0.0 (1969-12-31 16:00:00)
        rtime 0.0 (1969-12-31 16:00:00)
    item 2 key (FS_TREE INODE_REF 6) itemoff 15388 itemsize 17
        index 0 namelen 7 name: default
    item 3 key (FS_TREE ROOT_ITEM 0) itemoff 14949 itemsize 439
        generation 485128 root_dirid 256 bytenr 45479377813504 byte_limit 0 bytes_used 1529004032
        last_snapshot 480325 flags 0x0(none) refs 1
        drop_progress key (0 UNKNOWN.0 0) drop_level 0
        level 2 generation_v2 485128
        uuid 00000000-0000-0000-0000-000000000000
        parent_uuid 00000000-0000-0000-0000-000000000000
        received_uuid 00000000-0000-0000-0000-000000000000
        ctransid 485128 otransid 0 stransid 0 rtransid 0
        ctime 1717169428.385246885 (2024-05-31 08:30:28)
        otime 0.0 (1969-12-31 16:00:00)
        stime 0.0 (1969-12-31 16:00:00)
        rtime 0.0 (1969-12-31 16:00:00)
    item 4 key (ROOT_TREE_DIR INODE_ITEM 0) itemoff 14789 itemsize 160
        generation 2 transid 0 size 0 nbytes 16384
        block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
        sequence 0 flags 0x0(none)
        atime 1694601368.0 (2023-09-13 03:36:08)
        ctime 1694601368.0 (2023-09-13 03:36:08)
        mtime 1694601368.0 (2023-09-13 03:36:08)
        otime 1694601368.0 (2023-09-13 03:36:08)
    item 5 key (ROOT_TREE_DIR INODE_REF 6) itemoff 14777 itemsize 12
        index 0 namelen 2 name: ..
    item 6 key (ROOT_TREE_DIR DIR_ITEM 2378154706) itemoff 14740 itemsize 37
        location key (FS_TREE ROOT_ITEM 18446744073709551615) type DIR
        transid 0 data_len 0 name_len 7
        name: default
    item 7 key (CSUM_TREE ROOT_ITEM 0) itemoff 14301 itemsize 439
        generation 1007245 root_dirid 0 bytenr 45465768902656 byte_limit 0 bytes_used 47835578368
        last_snapshot 0 flags 0x0(none) refs 1
        drop_progress key (0 UNKNOWN.0 0) drop_level 0
        level 3 generation_v2 1007245
        uuid 00000000-0000-0000-0000-000000000000
        parent_uuid 00000000-0000-0000-0000-000000000000
        received_uuid 00000000-0000-0000-0000-000000000000
        ctransid 0 otransid 0 stransid 0 rtransid 0
        ctime 0.0 (1969-12-31 16:00:00)
        otime 0.0 (1969-12-31 16:00:00)
        stime 0.0 (1969-12-31 16:00:00)
        rtime 0.0 (1969-12-31 16:00:00)
    item 8 key (UUID_TREE ROOT_ITEM 0) itemoff 13862 itemsize 439
        generation 440215 root_dirid 0 bytenr 64998765494272 byte_limit 0 bytes_used 16384
        last_snapshot 0 flags 0x0(none) refs 1
        drop_progress key (0 UNKNOWN.0 0) drop_level 0
        level 0 generation_v2 440215
        uuid 00000000-0000-0000-0000-000000000000
        parent_uuid 00000000-0000-0000-0000-000000000000
        received_uuid 00000000-0000-0000-0000-000000000000
        ctransid 0 otransid 0 stransid 0 rtransid 0
        ctime 0.0 (1969-12-31 16:00:00)
        otime 0.0 (1969-12-31 16:00:00)
        stime 0.0 (1969-12-31 16:00:00)
        rtime 0.0 (1969-12-31 16:00:00)
    item 9 key (FREE_SPACE_TREE ROOT_ITEM 0) itemoff 13423 itemsize 439
        generation 1007246 root_dirid 0 bytenr 72121593069568 byte_limit 0 bytes_used 11763712
        last_snapshot 0 flags 0x0(none) refs 1
        drop_progress key (0 UNKNOWN.0 0) drop_level 0
        level 2 generation_v2 1007246
        uuid 00000000-0000-0000-0000-000000000000
        parent_uuid 00000000-0000-0000-0000-000000000000
        received_uuid 00000000-0000-0000-0000-000000000000
        ctransid 0 otransid 0 stransid 0 rtransid 0
        ctime 0.0 (1969-12-31 16:00:00)
        otime 0.0 (1969-12-31 16:00:00)
        stime 0.0 (1969-12-31 16:00:00)
        rtime 0.0 (1969-12-31 16:00:00)
    item 10 key (BLOCK_GROUP_TREE ROOT_ITEM 0) itemoff 12984 itemsize 439
        generation 1007246 root_dirid 0 bytenr 72121597050880 byte_limit 0 bytes_used 1392640
        last_snapshot 0 flags 0x0(none) refs 0
        drop_progress key (0 UNKNOWN.0 0) drop_level 0
        level 1 generation_v2 1007246
        uuid 00000000-0000-0000-0000-000000000000
        parent_uuid 00000000-0000-0000-0000-000000000000
        received_uuid 00000000-0000-0000-0000-000000000000
        ctransid 0 otransid 0 stransid 0 rtransid 0
        ctime 0.0 (1969-12-31 16:00:00)
        otime 0.0 (1969-12-31 16:00:00)
        stime 0.0 (1969-12-31 16:00:00)
        rtime 0.0 (1969-12-31 16:00:00)
    item 11 key (CSUM_CHANGE TEMPORARY_ITEM 1) itemoff 12984 itemsize 0
        temporary item objectid CSUM_CHANGE offset 1
        target csum type xxhash64 (1)
    item 12 key (DATA_RELOC_TREE ROOT_ITEM 2) itemoff 12545 itemsize 439
        generation 480331 root_dirid 256 bytenr 48007776305152 byte_limit 0 bytes_used 16384
        last_snapshot 0 flags 0x0(none) refs 1
        drop_progress key (0 UNKNOWN.0 0) drop_level 0
        level 0 generation_v2 480331
        uuid 00000000-0000-0000-0000-000000000000
        parent_uuid 00000000-0000-0000-0000-000000000000
        received_uuid 00000000-0000-0000-0000-000000000000
        ctransid 0 otransid 0 stransid 0 rtransid 0
        ctime 0.0 (1969-12-31 16:00:00)
        otime 0.0 (1969-12-31 16:00:00)
        stime 0.0 (1969-12-31 16:00:00)
        rtime 0.0 (1969-12-31 16:00:00)
adam900710 commented 3 weeks ago

It's really csum change, there is an item for it:

    item 11 key (CSUM_CHANGE TEMPORARY_ITEM 1) itemoff 12984 itemsize 0
        temporary item objectid CSUM_CHANGE offset 1
        target csum type xxhash64 (1)

And I figured out the root cause, we have conflicting flags! The CHANGING_FSID_V2 is the same bit as CHANGING_DATA_CSUM!!!! Damn it!

The flags are saved in two different files, no wonder it can lead to conflicts.

For now you can only resume the change by btrfstune --csum xxhash. The root dump shows that, the conversion is still inside data csum conversion. (The process is slow because we will read all the data, verify the csum, re-calculate the new csum).

If you don't feel safe or want a progress bar, I can definitely add it very soon.

I'll definitely fix the conflicting flags first.

aidan-gibson commented 3 weeks ago

Ah that makes sense! Cool! Lines up with my check output

❯ sudo btrfs check --progress /dev/sdd
[sudo] password for bruh:
Opening filesystem to check...
Checking filesystem on /dev/sdd
UUID: 90a947f2-52cb-462b-8054-41f99bbcdbf9
[1/7] checking root items                      (0:08:32 elapsed, 18677485 items checked)
[2/7] checking extents                         (0:06:45 elapsed, 3117277 items checked)
[3/7] checking free space tree                 (0:07:18 elapsed, 27918 items checked)
[4/7] checking fs roots                        (1:44:29 elapsed, 93045 items checked)
ERROR: csum overlap, current bytenr=21068614664192 prev_end=32768402849792, eb=72121594576896 slot=0
[5/7] checking csums (without verifying data)  (0:02:18 elapsed, 6263434 items checked)
ERROR: errors found in csum tree
[6/7] checking root refs                       (0:00:00 elapsed, 3 items checked)
[7/7] checking quota groups skipped (not enabled on this FS)
found 31314030084096 bytes used, error(s) found
total csum bytes: 39665946740
total tree bytes: 51073302528
total fs tree bytes: 1529020416
total extent tree bytes: 1689354240
btree space waste bytes: 8067682213
file data blocks allocated: 438905287233536
 referenced 31827732127744
adam900710 commented 3 weeks ago

That overlapping csum is the new csum item. And that's expected to overlap.

We generate new csums (using a different objectid, so old and new ones are co-existing). The generation itself would also read the data (to be extra safe, making sure the csum matches)

After all csums are generated, then deleting the old ones.

Then renaming the new ones to use the regular objectid.

Then rewrite all metadata using new csum (in-place, which is a little dangerous).

So far it's the data csum generation causing the most time.

adam900710 commented 3 weeks ago

You can try this branch: https://github.com/adam900710/btrfs-progs/tree/csum_convert_progress

Which adds the extra progress (--progress option) output, and doesn't change the conflicting flags.

aidan-gibson commented 3 weeks ago

Excellent! Will build and get running now, thank you!

adam900710 commented 3 weeks ago

More importantly, I should add kernel rescue support for the half converted fs.

And of course, a way to cancel the conversion.

aidan-gibson commented 3 weeks ago

Immediately ran into an issue actually

sudo ./btrfstune --csum xxhash64 --progress /dev/sdd
[sudo] password for bruh:
WARNING: Experimental build with unstable or unfinished features
WARNING: Switching checksums is experimental, do not use for valuable data!

Proceed to switch checksums
ERROR: failed to start transaction: Unknown error -28
ERROR: failed to start transaction: No space left on device
ERROR: failed to generate new data csums: No space left on device
ERROR: failed to resume data checksum change: No space left on device
ERROR: failed to resume unfinished csum change: No space left on device
ERROR: btrfstune failed

This kinda makes sense, as before I attempted this conversion I had about 100GB and change free (again, total 32TB). In hindsight this feels pretty stupid of me.

The question is: what can I do about this now?

aidan-gibson commented 3 weeks ago

Oh wait sorry I'm silly, I couldn't mount prior because of the whole CHANGING_FSID_V2 bit stuff which is presumably fixed now; just gotta update my system btrfs-progs, then I can mount, delete some stuff, and presumably continue(?)

I'm on Arch; for the conversion I've been using btrfstune from a manual build while there is a separate btrfs-progs my system uses.

adam900710 commented 3 weeks ago

The ENOSPC problem is another problem I'm looking into, would need some time to get a proper fix.

Furthermore, you SHOULD NOT update btrfs-progs to any version with the conflicting flags fix. Or you will no longer be able to use btrfstune to resume the conversion.

Finally, even if I implemented kernel rescue support, it only mean READ-ONLY support, you can not touch anything of the fs, but only read important data out.

So for now your best chance is to wait for another testing branch and then try resume the convert.

aidan-gibson commented 3 weeks ago

Got it, at this point it would be totally acceptable if I only got READONLY so I could save the important stuff and nuke the rest. Thanks again for your efforts on this.

aidan-gibson commented 3 weeks ago

Is there a way to manually change superblock bits though? With dd or something? So I could disable BTRFS_SUPER_FLAG_CHANGING_FSID_V2 (1ULL << 36)? I'm aware the issue with that is this version of btrfs-progs won't know I'm in the middle of a convert as well, but I plan on just recovering some files and nuking the filesystem.

If I'm not mistaken, the fs is still mountable mid-checksum change(?) And the current thing preventing me from mounting is that bit, throwing the error BTRFS error: device /dev/sdc1 has incomplete metadata_uuid change, please use btrfstune to complete (and same error for /dev/sdd, as the fs is striped across both devices).

adam900710 commented 2 weeks ago

It's really stage dependent.

If you're still at generating new datacsum stage, your fs is totally fine, with the new datacsum taking extra metadata space. But it's still not good for RW mount, because the new datacsum won't be deleted/added/modified at all.

If you're at deleting old datacsum stage, a lot of data read would fail unless you go rescue mode to ignore datacsum completely. The same for datacsum rename stage.

If you're at metadata csum rewrite stage, without special kernel handling, tons of metadata read would just fail.

adam900710 commented 2 weeks ago

If you can build a testing kernel, please try this branch:

https://github.com/adam900710/linux/tree/rescue_changes

Then mount with "rescue=all,ro" of the interrupted fs, you should be able to salvage all files.

aidan-gibson commented 2 weeks ago

It seems a safe bet it's at the generating new datacsum stage, since it ENOSPC-ed(?) Am only interested in RO at this point as I intend to nuke the fs.

If it comes down to building a testing kernel I think I need to prepare another drive to not risk my main btrfs root partition, which is currently working fine and I don't want messed with.

adam900710 commented 2 weeks ago

The new code is only affecting "rescue=" mount option group, so you should be fine even using the existing btrfs partition.

But being safe is always much better than regretting.

aidan-gibson commented 2 weeks ago

Sounds good then, I'll get to work on that and report back

adam900710 commented 2 weeks ago

If you can wait for 2 or 3 days, I can create a btrfs-progs for you to cancel the conversion (without any leftover), so you do not need to nuke the fs.

To be honest, I really want a huge fs for us to test about the csum conversion. (I do not have several TBs space to test locally unfortunately)

aidan-gibson commented 2 weeks ago

Sure, I'll just wait for that rather than kernel stuff, thanks.

adam900710 commented 2 weeks ago

Here you go:

https://github.com/adam900710/btrfs-progs/tree/csum_conversion_cancel

I tested on my local fs with 5G data, and emulated interrupred conversion, all works fine. (And is way faster than the csum generation part).

The usage is ./btrfs-corrupt-block -X <dev>.

aidan-gibson commented 2 weeks ago

This worked!!!! We're back! Thank you!

aidan-gibson commented 2 weeks ago

Was pretty fast; still took a fair amount of run, but nowhere near as long as generating the checksums.

adam900710 commented 2 weeks ago

I'd prefer to run a full "btrfs check --readonly" just to be sure, but that can be very slow and memory consuming..

aidan-gibson commented 2 weeks ago

I've run that 3 times over this process and honestly I'm just going to be nuking the filesystem, starting over.