kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
563 stars 243 forks source link

check/main.c:2003: add_missing_dir_index: BUG_ON `ret` triggered, value -17 #212

Closed elliotclee closed 2 years ago

elliotclee commented 5 years ago

Using kdave/btrfs-progs devel branch as of today.

(gdb) r check --repair /dev/md126p6 ... enabling repair mode Opening filesystem to check... Checking filesystem on /dev/md126p6 UUID: eb57a951-7723-4f5d-8b8e-a7cfb35b5600 [1/7] checking root items Fixed 0 roots. [2/7] checking extents No device size related problem found [3/7] checking free space cache cache and super generation don't match, space cache will be invalidated [4/7] checking fs roots Missing extent item in extent tree for disk_bytenr 1122072567808, num_bytes 516096 Missing extent item in extent tree for disk_bytenr 1121574080512, num_bytes 520192 Missing extent item in extent tree for disk_bytenr 1007552684032, num_bytes 516096 Missing extent item in extent tree for disk_bytenr 1007550844928, num_bytes 520192 Missing extent item in extent tree for disk_bytenr 1027975229440, num_bytes 126976 Missing extent item in extent tree for disk_bytenr 1027862646784, num_bytes 913408 Missing extent item in extent tree for disk_bytenr 1027901038592, num_bytes 1044480 Missing extent item in extent tree for disk_bytenr 1027974176768, num_bytes 1044480 Missing extent item in extent tree for disk_bytenr 1027975229440, num_bytes 126976 Missing extent item in extent tree for disk_bytenr 1027975356416, num_bytes 913408 Missing extent item in extent tree for disk_bytenr 1027859095552, num_bytes 1044480 Missing extent item in extent tree for disk_bytenr 1027860144128, num_bytes 1044480 Missing extent item in extent tree for disk_bytenr 2350452736, num_bytes 516096 Missing extent item in extent tree for disk_bytenr 2351005696, num_bytes 520192 Missing extent item in extent tree for disk_bytenr 2351755264, num_bytes 516096 Missing extent item in extent tree for disk_bytenr 2352414720, num_bytes 520192 repairing missing dir index item for inode 58352347 check/main.c:2003: add_missing_dir_index: BUG_ON ret triggered, value -17 ... (gdb) where

0 0x00007ffff7c82e35 in raise () from /lib64/libc.so.6

1 0x00007ffff7c6d895 in abort () from /lib64/libc.so.6

2 0x0000000000457f7b in bugon_trace (val=, line=2003, func=, filename=0x4ac639 "check/main.c", assertion=0x48a945 "ret") at ./kerncompat.h:123

3 add_missing_dir_index (backref=0xadc4e30, rec=0x4fad8e10, inode_cache=0x7fffffffd108, root=0x133f500) at check/main.c:2003

4 repair_inode_backrefs (delete=0, inode_cache=0x7fffffffd108, rec=0x4fad8e10, root=0x133f500) at check/main.c:2152

5 check_inode_recs (inode_cache=0x7fffffffd108, root=0x133f500) at check/main.c:2909

6 check_fs_root (wc=0x7fffffffd080, root_cache=0x7fffffffd658, root=) at check/main.c:3630

7 check_fs_roots (root_cache=0x7fffffffd658, fs_info=0x4d9e20) at check/main.c:3709

8 do_check_fs_roots (fs_info=fs_info@entry=0x4d9e20, root_cache=root_cache@entry=0x7fffffffd658) at check/main.c:3826

9 0x0000000000460741 in cmd_check (cmd=, argc=, argv=) at check/main.c:10233

10 0x000000000040d185 in cmd_execute (argv=0x7fffffffd7d0, argc=3, cmd=0x4cf8a0 ) at cmds/commands.h:125

11 main (argc=3, argv=0x7fffffffd7d0) at btrfs.c:386

Attached is the complete output of 'btrfs check --readonly /dev/md126p6' repo-check.out.gz

elliotclee commented 5 years ago

With the latest devel branch, I get:

enabling repair mode Opening filesystem to check... Checking filesystem on /dev/md126p6 UUID: eb57a951-7723-4f5d-8b8e-a7cfb35b5600 [1/7] checking root items Fixed 0 roots. [2/7] checking extents bad extent [1004374429696, 1004374437888), type mismatch with chunk bad extent [1004374544384, 1004374695936), type mismatch with chunk bad extent [1004374704128, 1004374749184), type mismatch with chunk bad extent [1004374749184, 1004374777856), type mismatch with chunk bad extent [1004374777856, 1004374790144), type mismatch with chunk bad extent [1004374790144, 1004374810624), type mismatch with chunk bad extent [1004374810624, 1004374814720), type mismatch with chunk bad extent [1004374814720, 1004374818816), type mismatch with chunk bad extent [1004374818816, 1004374827008), type mismatch with chunk bad extent [1004374827008, 1004374831104), type mismatch with chunk bad extent [1004374831104, 1004374835200), type mismatch with chunk bad extent [1004374835200, 1004374839296), type mismatch with chunk bad extent [1004374839296, 1004458291200), type mismatch with chunk extent-tree.c:3332: btrfs_fix_block_accounting: BUG_ON ret triggered, value -1 ./btrfs[0x417aef] ./btrfs(btrfs_fix_block_accounting+0x1dc)[0x41e4b8] ./btrfs[0x45e4ec] ./btrfs[0x460384] ./btrfs(main+0x94)[0x40d185] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f2a4fa1af43] ./btrfs(_start+0x2e)[0x40cd1e] Aborted (core dumped)

adam900710 commented 4 years ago

Would you mind to check how btrfs check --repair --mode=lowmem works?

Of course, we need to fix this problem, but I'm a little curious about how well lowmem mode works.

elliotclee commented 4 years ago

When I run it in lowmem mode, it does start fixing up some stuff, but gets stuck in an infinite loop inside check_inode_ref(). It gets to

1207 if (tmp_err && repair) { 1208 ret = repair_ternary_lowmem(root, ref_key->offset, 1209 ref_key->objectid, index, namebuf, 1210 name_len, imode_to_type(mode), 1211 tmp_err); 1212 if (!ret) { 1213 need_research = 1; 1214 goto begin; 1215 } 1216 }

But repair_ternary_lowmem repeatedly returns 0 for the same item, so it keeps on goto-ing back to the begin label...

The printout is at the end is a constant stream of: Add ref/dir_item of inode 58316283 name V01tmp.log filetype 1

Thanks, Elliot

elliotclee commented 4 years ago

If at line 1212 I play with gdb and set ret to 1, the program then prints out:

ERROR: root 5 DIR INDEX[58316021 18446744073709551615] missing name V01tmp.log filetype 1 ERROR: errors found in fs roots found 1211222577152 bytes used, error(s) found total csum bytes: 1179626448 total tree bytes: 2969993216 total fs tree bytes: 1704935424 total extent tree bytes: 46497792 btree space waste bytes: 368202312 file data blocks allocated: 360056063328256 referenced 1188951076864

and exits.

elliotclee commented 4 years ago

lowmem mode definitely works better - it made it possible to delete some weird invalid directory entries, for example - but I have to run it several times and each time it seems to pick up a few more issues, and it has problems fixing that V01tmp.log entry: [root@shiny Windows]# ls -li /repo/Users/lees/foo/AppData/Local/Microsoft/Windows/WebCache ls: cannot access '/repo/Users/lees/foo/AppData/Local/Microsoft/Windows/WebCache/V01tmp.log': No such file or directory total 0 ? -????????? ? ? ? ? ? V01tmp.log

elliotclee commented 4 years ago

I wound up running 'btrfs check' in lowmem mode, alternating with deleting the files/dirs that were newly deletable.

Did this about 20 times and the FS is now back to sanity. 'btrfs check' in non-lowmem mode now completes successfully.

adam900710 commented 4 years ago

I should ask for a btrfs-image/binary dump in the first place...

elliotclee commented 4 years ago

Filesystem is 1.3T. I don't have a place to host it, and my 3 Mbps uplink speed means uploading it would take around 52 days.

Will try to see how big the btrfs-image output is.

elliotclee commented 4 years ago

So I have the filesystem giving those errors again in other places, as well as some even worse ones, and 'btrfs check --repair' from latest devel tree in both lowmem and regular modes will not even find the problems let alone fix them.

I made a btrfs-image, which at 1.4G is a lot better than 1.3T.

In addition to the type of problems already mentioned, there also now are a few dirs that when accessed generate kernel messages like the following: Jan 15 16:10:18 shiny kernel: BTRFS critical (device md126p6): corrupt leaf: root=5 block=1814589652992 slot=79 ino=61781978 file_offset=0, invalid ram_bytes for file extent, have 7474, should be aligned to 4096 Jan 15 16:10:18 shiny kernel: BTRFS error (device md126p6): block=1814589652992 read time tree block corruption detected

I've attached the output of btrfs check...

btrfs-check.out.gz

Am uploading the btrfs image to my Google Drive - will post a link here when it's done in an hour and a half.

elliotclee commented 4 years ago

https://drive.google.com/open?id=1eAtjvJdLqCDB0C0OedzKuSX3TJ8uz0eT is the btrfs-image file.

marcosps commented 4 years ago

@adam900710 trying to restore this dumped image makes btrfs-image to crash (both stable and devel branches):

WARNING: cannot find a chunk, using logical WARNING: cannot find a chunk, using logical WARNING: cannot find a chunk, using logical ctree.c:2912: btrfs_del_leaf: Warning: assertion btrfs_header_generation(leaf) != trans->transid failed, value 1 ./btrfs-image(+0x2cd52)[0x564d6856bd52] ./btrfs-image(btrfs_del_items+0x2d0)[0x564d6856c070] ./btrfs-image(+0x1141b)[0x564d6855041b] ./btrfs-image(main+0x325)[0x564d685516e3] /lib64/libc.so.6(__libc_start_main+0xeb)[0x7f5e529a0ceb] ./btrfs-image(_start+0x2a)[0x564d6854c93a] ERROR: failed to insert dev extent 1 1225173106688: File exists ERROR: failed to fix chunks and devices mapping, the fs may not be mountable: File exists WARNING: reserved space leaked, flag=0x4 bytes_reserved=114688 extent buffer leak: start 1812585250816 len 16384 extent buffer leak: start 1812592427008 len 16384 WARNING: dirty eb leak (aborted trans): start 1812592427008 len 16384 extent buffer leak: start 2124003966976 len 16384 WARNING: dirty eb leak (aborted trans): start 2124003966976 len 16384 extent buffer leak: start 2124003983360 len 16384 WARNING: dirty eb leak (aborted trans): start 2124003983360 len 16384 extent buffer leak: start 1812601438208 len 16384 WARNING: dirty eb leak (aborted trans): start 1812601438208 len 16384 extent buffer leak: start 1812596998144 len 16384 WARNING: dirty eb leak (aborted trans): start 1812596998144 len 16384 ERROR: restore failed: -17

any ideas? thanks in advance!

adam900710 commented 4 years ago

@elliotclee Do you still have the corrupted fs? If so, could you please provide the following dump?

# btrfs ins dump-tree -b 1814589652992 <device>

Thanks

elliotclee commented 4 years ago
Sorry, I have long since wiped the fs and restored the files. I will keep an eye out for problems on the new fs though. 
nphantasm commented 10 months ago

Managed to come across this issue with btrfs-progs 6.6.2 and probably could with 6.6.3. btrfs check crashed only when repairing missing dir index item for inode x was printed twice and btrfs check no longer returns any errors after a few runs and the affected non-existent directory it tries to fix is no longer accessible as it should be.

[3/7] checking free space cache cache and super generation don't match, space cache will be invalidated [4/7] checking fs roots Deleting bad dir index [24986570,96,44] root 271 repairing missing dir index item for inode 26950385 Deleting bad dir index [24986570,96,44] root 16238 repairing missing dir index item for inode 26950385 Deleting bad dir index [24986570,96,44] root 16854 repairing missing dir index item for inode 26950385 Deleting bad dir index [24986570,96,44] root 16984 repairing missing dir index item for inode 26950385 Deleting bad dir index [24986570,96,44] root 17010 repairing missing dir index item for inode 26950385 Deleting bad dir index [24986570,96,44 root 17030 repairing missing dir index item for inode 26950385 Deleting bad dir index [24986570,96,44] root 17064 repairing missing dir index item for inode 26950385 repairing missing dir index item for inode 26950385 check/main.c:2133: add missing dir index: BUG_ON ret triggered, value -17 btrfs(+0x1d4fd)[0x55e34ca514fd] btrfs(+0x9d153)[0x55e34cad1153] btrfs(+0x9f56e)[0x55e34cad356e] btrfs(+0xaf9c1)[0x55e34cae39c1] btrfs(main+0x99)[0x55e34ca4e179] /usr/lib/libc.so.6(+9x27cd0)[0x7fdca6cb1cdo] /usr/lib/libc.so.6(__libc_start, main+0x8a)[0x7fdca6cb1d8a] [1] 1179 IOT instruction (core dumped) btrfs check --repair /dev/mapper/cryptroot

root 271 from above output is a subvolume and all other roots are older snapshots of said subvolume. What btrfs check is trying to fix here is a directory entry that no longer should exist (No such file or directory is returned to a program, when it tries accessing it - borgbackup for example).

Didn't make an image before trying to repair the FS, so can't really help more. The binary is from ArchLinux ISO for 2023-12-01.