Open Architector4 opened 6 months ago
This is all running on Linux 6.7.2; let me try to build archiso with Linux 6.7.5 and run off that...
edit: no meaningful change as far as i can tell; though a run without --reconstruct_alloc
now hangs forever instead of making a loads of Input/output error
messages and quitting lol
here is running latest git built in debug mode and optimizations disabled and with latest Arch packages and kernel; seems to be more populated with proper debuginfo
And stdout/stderr:
...in my case here, i wonder if running reset-counters would help, but i don't think i want to do something stupid that may cause even more damage, so please tell me if that would be a good idea or a bad one or irrelevant
i don't mind losing some data if i can at least restore a meaningful amount of it to be honest
edit: nevermind, i made a full copy of the filesystem partition on another drive and did bcachefs reset-counters
and it had no effect
the segfault seems to be at inserting a key into the journal, which suggests to me the journal itself might have gotten messed up
is there a way to just delete the journal entries without committing them or doing anything with them? i don't care much if this messes up some filesystem structure if i can get this thing to run and make it at least mountable; and if anything i made a full copy of the filesystem partition to test on lol
Issue persists with latest commit https://github.com/koverstreet/bcachefs-tools/commit/25e84a9917fc8c2f1c7d2976e946c5e5a22b3589
stdout/stderr:
gdb:
it appears the issue is that in bch2_journal_key_insert_take
, when move_gap
is used, the keys->gap
value for some reason is way higher than keys->nr
, meaning the gap starts farther than the last element in the allocation and hence ends after it, which is what causes the segfault.
Adding a small if statement right before the move_gap
invocation seems to fix it and let fsck continue, but may just be a horrible hack that worsens everything. At least I have my separate backup copy of the partition in case it does.
if(keys->gap > keys->nr) //bad hack?
keys->gap = keys->nr;
I'll wait for it finish now...
...Now it looks like it deadlocked on futexes. I think I experienced the same problem before, but didn't bother with it and just reran fsck with different params until it worked, but that doesn't seem feasible here, as I need to make it finish with --reconstruct_alloc
specifically.
Interrupting and rerunning it seems to just give me the same effect.
Here's gdb backtraces:
May be related to https://github.com/koverstreet/bcachefs-tools/issues/118 ?
edit: it appears reset-counters
might have helped, as now it printed 3 more lines of output (below) and is slowly blipping some CPU usage from the bch_reclaim//de
and timers
threads... I hope I'm getting anywhere with this
check_lrus... done
check_btree_backpointers... done
check_backpointers_to_extents... done
I suppose it was deadlocking while checking lrus and is now very slowly munching through checking extents to backpointers lol
It's another deadlock. bch_reclaim//de
is spinning in a function bch2_journal_reclaim_thread
but otherwise everything is locked in futexes and no progress was happening in the last 14 hours, as far as I can tell. It's a 1TB filesystem on an NVMe SSD and with no subvolumes, so it shouldn't be taking that long I assume lol
the gdb:
edit: i wrote the backup (made before reset-counters) onto the raw nvme partition (removing the LUKS layer), and it now prints the next 3 lines (as in the comment above) too, but still deadlocks.
edit: i copied the recompiled tool to a fat32 partition, did a reboot, and without mounting any bcachefs, got the tool from that partition and tries fsck --reconstruct_alloc -fy again, and this time i didn't get the next 3 lines. Those lines may just be random luck i guess lol
IT MOUNTED! after trying fsck with and without reconstruct_alloc a few times, and running into the deadlock every single time, i decided to try and just mount it, and it worked.
...god, now to clean up all the mess i've done lol
Currently in a bit of a pickle here. I ran
bcachefs fsck --reconstruct_alloc -pf /dev/myroot
out of boredom, saw it printing a load of messages quickly, assumed that's normal part of operation of this mode, and decided to ctrl+c it and run it again with -r.Now bcachefs tool segfaults when I try to do it again and it won't mount. Both latest stable in Arch Linux repos (3:1.6.2-1) and latest master commit
6ff5313cbe0432
segfault.Writing from my phone and operating from Arch Linux installer ISO environment at the second, as I don't have another machine to do stuff with.
Here's a run with gdb, with
thr apply all bt
Or building in debug mode and running the same thing (random output messages from the tool itself are missing, I guess gdb logging doesn't capture stderr):
Here's output (stdout+stderr) with just -p (it doesn't segfault but may still be of note)
I really hope I can get this data un-eaten lol