koverstreet / bcachefs

Other
662 stars 70 forks source link

Very long fsck when fs contains a lot of inodes #576

Open marcin-github opened 1 year ago

marcin-github commented 1 year ago

This is the same fs as in issue #546 .

userland fsck takes ages to check fs. This is what I get on "clean" fs:

lip 24 14:10:27 mounting version 1.1: snapshot_skiplists opts=metadata_replicas=2,compression=lz4,background_compression=zstd:12,degraded,journal_reclaim_delay=1000,fsck,fix_errors=yes
lip 24 14:10:27 recovering from clean shutdown, journal seq 21102714
lip 24 14:12:14 journal read done, replaying entries 21102714-21102714
lip 24 14:12:15 alloc_read... done
lip 24 14:12:15 stripes_read... done
lip 24 14:12:15 snapshots_read... done
lip 24 14:40:48 check_allocations... done
lip 24 14:40:48 journal_replay... done
lip 24 14:41:39 check_alloc_info... done
lip 24 14:41:39 check_lrus... done
lip 24 14:45:49 check_btree_backpointers... done
lip 24 14:53:19 check_backpointers_to_extents... done
lip 24 14:54:23 check_extents_to_backpointers... done
lip 24 14:54:24 check_alloc_to_lru_refs... done
lip 24 14:54:24 check_snapshot_trees... done
lip 24 14:54:24 check_snapshots... done
lip 24 14:54:24 check_subvols... done
lip 24 14:54:24 delete_dead_snapshots...going read-write
lip 24 15:02:26  done
lip 24 15:08:36 check_inodes... done
lip 24 15:23:14 check_extents... done
lip 24 15:52:48 check_dirents... done
lip 24 15:52:48 check_xattrs... done
lip 24 15:52:48 check_root... done
lip 24 16:26:30 check_directory_structure... done
lip 24 16:40:22 check_nlinks... done

This is snippet from log from previous fsck:

lip 24 11:17:05 mounting version 1.1: snapshot_skiplists opts=metadata_replicas=2,compression=lz4,background_compression=zstd:12,degraded,journal_reclaim_delay=1000,fsck,fix_errors=yes
lip 24 11:17:05 recovering from clean shutdown, journal seq 21100641
lip 24 11:18:20 journal read done, replaying entries 21100641-21100641
lip 24 11:18:21 alloc_read... done
lip 24 11:18:21 stripes_read... done
lip 24 11:18:21 snapshots_read... done
lip 24 11:44:16 check_allocations... done
lip 24 11:44:16 journal_replay... done
lip 24 11:44:49 check_alloc_info... done
lip 24 11:44:49 check_lrus...incorrect lru entry: lru fragmentation time 2097152
lip 24 11:44:49   u64s 5 type set 18446462598734938112:3341630:0 len 0 ver 0
lip 24 11:44:49   for u64s 12 type alloc_v4 0:3341630:0 len 0 ver 0:
lip 24 11:44:49   gen 3 oldest_gen 0 data_type free
lip 24 11:44:49   journal_seq       21084211
lip 24 11:44:49   need_discard      0
lip 24 11:44:49   need_inc_gen      0
lip 24 11:44:49   dirty_sectors     0
lip 24 11:44:49   cached_sectors    0
lip 24 11:44:49   stripe            0
lip 24 11:44:49   stripe_redundancy 0
lip 24 11:44:49   io_time[READ]     22647534491
lip 24 11:44:49   io_time[WRITE]    17197109852
lip 24 11:44:49   fragmentation     0
lip 24 11:44:49   bp_start          7
lip 24 11:44:49 , fixing
lip 24 11:44:49 going read-write
lip 24 11:44:49 incorrect lru entry: lru fragmentation time 2097152
lip 24 11:44:49   u64s 5 type set 18446462598734938112:3341690:0 len 0 ver 0
lip 24 11:44:49   for u64s 12 type alloc_v4 0:3341690:0 len 0 ver 0:

[ snip, billions of simillar entries ...]

lip 24 11:44:57   fragmentation     0
lip 24 11:44:57   bp_start          7
lip 24 11:44:57 , fixing
lip 24 11:44:57  done
lip 24 11:48:48 check_btree_backpointers... done
lip 24 11:56:13 check_backpointers_to_extents... done
lip 24 11:57:14 check_extents_to_backpointers... done
lip 24 11:57:15 check_alloc_to_lru_refs... done
lip 24 11:57:15 check_snapshot_trees... done
lip 24 11:57:15 check_snapshots... done
lip 24 11:57:15 check_subvols... done
lip 24 12:04:15 delete_dead_snapshots... done
lip 24 12:09:36 check_inodes... done
lip 24 12:23:07 check_extents... done
lip 24 12:48:47 check_dirents... done
lip 24 12:48:47 check_xattrs... done
lip 24 12:48:47 check_root... done
lip 24 12:49:06 check_directory_structure...unreachable inode 460535:4294967295, type dir nlink 1 backptr 0:0, fixing
lip 24 12:49:16 unreachable inode 788718:4294967295, type dir nlink 0 backptr 0:0, fixing
lip 24 13:34:38 unreachable inode 537220982:4294967295, type dir nlink 0 backptr 0:0, fixing
lip 24 13:35:24 unreachable inode 537777947:4294967295, type dir nlink 0 backptr 0:0, fixing
lip 24 13:36:16 unreachable inode 539971551:4294967295, type dir nlink 0 backptr 0:0, fixing
lip 24 13:45:38 unreachable inode 1082763977:4294967295, type dir nlink 0 backptr 0:0, fixing
lip 24 13:56:10  done
lip 24 14:10:06 check_nlinks... done
# bcachefs fs usage /dane/kopie
Filesystem: 743efdb6-5ee2-4d9a-b1ba-d2efb23511a8
Size:                  1703040432640
Used:                  1647081457664
Online reserved:                   0

Data type       Required/total  Devices
btree:          1/1             [dm-7]                   21498167296
user:           1/1             [dm-7]                 1621284653056

(no label) (device 0):          dm-7              rw
                                data         buckets    fragmented
  free:                            0          257565
  sb:                        3149824               7        520192
  journal:                4294967296            8192
  btree:                 21498167296           54057    6843269120
  user:                1621284653056         3210931   62167939072
  cached:                          0               0
  parity:                          0               0
  stripe:                          0               0
  need_gc_gens:                    0               0
  need_discard:                    0               0
  erasure coded:                   0               0
  capacity:            1851130904576         3530752
# bcachefs show-super /dev/dm-7
External UUID:                              743efdb6-5ee2-4d9a-b1ba-d2efb23511a8
Internal UUID:                              1f3647f9-b8fe-4ca2-b6c0-e0f1803e96ca
Device index:                               0
Label:                                      burp
Version:                                    1.1: snapshot_skiplists
Version upgrade complete:                   1.1: snapshot_skiplists
Oldest version on disk:                     0.17: subvol_dirent
Created:                                    Fri Oct 29 19:50:46 2021
Sequence number:                            2153
Superblock size:                            70616
Clean:                                      0
Devices:                                    1
Sections:                                   journal,members,replicas_v0,clean,journal_seq_blacklist,counters
Features:                                   lz4,gzip,zstd,journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,incompressible,btree_ptr_v2,extents_above_btree_updates,btree_update
s_journalled,reflink_inline_data,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features:                            alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options:
  block_size:                               512 B
  btree_node_size:                          256 KiB
  errors:                                   continue [ro] panic
  metadata_replicas:                        2
  data_replicas:                            1
  metadata_replicas_required:               1
  data_replicas_required:                   1
  encoded_extent_max:                       64.0 KiB
  metadata_checksum:                        none [crc32c] crc64 xxhash
  data_checksum:                            none [crc32c] crc64 xxhash
  compression:                              lz4
  background_compression:                   zstd:12
  str_hash:                                 crc32c crc64 [siphash]
  metadata_target:                          none
  foreground_target:                        none
  background_target:                        none
  promote_target:                           none
  erasure_code:                             0
  inodes_32bit:                             1
  shard_inode_numbers:                      1
  inodes_use_key_cache:                     1
  gc_reserve_percent:                       8
  gc_reserve_bytes:                         0 B
  root_reserve_percent:                     0
  wide_macs:                                0
  acl:                                      1
  usrquota:                                 0
  grpquota:                                 0
  prjquota:                                 0
  journal_flush_delay:                      1000
  journal_flush_disabled:                   0
  journal_reclaim_delay:                    1000
  journal_transaction_names:                1
  version_upgrade:                          [compatible] incompatible none
  nocow:                                    0
members (size 64):
  Device:                                   0
    UUID:                                   f0abad2f-9f32-4065-8a32-62c54a35adfb
    Size:                                   1.68 TiB
    Bucket size:                            512 KiB
    First bucket:                           0
    Buckets:                                3530752
    Last mount:                             Tue Jul 25 20:12:20 2023
    State:                                  rw
    Label:                                  (none)
    Data allowed:                           journal,btree,user
    Has data:                               journal,btree,user
    Discard:                                0
    Freespace initialized:                  1

There is about 18M files&dirs. host has 16GB of ram. fsck, in peak, took ~12GB on RES.

koverstreet commented 1 year ago

So this is odd. When I run fsck on your fsck, it takes 4 minutes running in the kernel, 37 minutes in userspace.

will look more.

marcin-github commented 1 year ago

I just run fsck to check how many data fsck reads. fsck from -tools read 119GB of data while doing fsck with "-n". Also, I have a lot of incorrect lru entry: lru fragmentation time 2145386496 - this is something that appears very often on this fs. Update: I remove some data from this fs. fsck from tools started with "-y" read 128GB from hdd. Next I run fsck from kernel. kernel's fsck ( fsck,fix_errors,ro) read 70GB of data and it took 1 hour less (1.5h instead 2.5h).