bcachefs_write_ref high CPU use (doing LRU invalidation?)

I'm getting a lot of error spam in dmesg, of the form:

bcachefs (c827fefd-… inum 11… offset 11…): __bch2_write(): error: ENOSPC_bucket_alloc
__bch2_write: 8105 callbacks suppressed

With this kernel thread spinning at around 60% CPU: kworker/8:2+bcachefs_write_ref

I'm using bcachefs with an SSD, the filesystem was created like so (writeback caching, metadata replicas to force metadata caching, foreground and background compression):

bcachefs format --encrypted --metadata_replicas=2 \
  --compression=lz4 --background_compression=zstd \
  --fs_label=Vol --discard \
  --label=hdd /dev/VG/Vol --label=ssd /dev/VG/VolCache \
  --foreground_target=ssd --background_target=hdd --promote_target=ssd

I've grown the cache device, which only got the errors to stop for a short while.

I've been more or less following bcachefs/master (but staying before the split brain superblock change), regularly rebased on v6.7 rcs, I've pushed the versions as tags in a github fork; the errors appeared within git range-diff 5dddeba898a0...5a8ba0aecf8a. This leads me to suspect a Fix promotes commit of at least revealing the problem.

As a potentially slightly unusual thing, the root of the volume is mounted but also has subpaths bind mounted elsewhere.

perf top -e 'bcachefs:*' shows a lot of bcachefs:bucket_alloc_fail, twice as many as bcachefs:read_promote

0,26%  reserve stripe bucket 1:0 free 65620 avail 0 copygc_wait 416517960/270458800 seen 0 open 0 need_journal_commit 0 nouse 0 nocow 0 nonblocking 1 err freelist_empty

It is promote related, but it's not the "fix promotes" patch - I added that error message so that we'd log an error for any error that's getting reported on up to userspace, but forgot that promotes use BCH_WRITE_ALLOC_NOWAIT so errors are expected.

I've got a patch to fix the logging up in the master branch, can you test that and report back?

I've cherry-picked the commit ("bcachefs: Don't log errors if BCH_WRITE_ALLOC_NOWAIT" 3bf0400008e02dad9941875e82ad96b0bb89ff3e; final tree is bug/__bch2_write/silence-logs) and it did remove all the log spam. Sorry it took me time before I could reboot and test.

However, I still have the high CPU use from kworker/1:1+bcachefs_write_ref, is that expected? That kernel thread is permanently spinning at 50-70% CPU.

perf top -p $pid says:

Samples: 185K of event 'cycles:P', 4000 Hz, Event count (approx.): 33602846651 lost: 0/0 drop: 0/0                                                                           
Overhead  Shared Object                           Symbol                                                                                                                     
  23,61%  [kernel]                                [k] bch2_key_sort_fix_overlapping                                                                                          
  20,75%  [kernel]                                [k] validate_bset_keys                                                                                                     
  14,44%  [kernel]                                [k] bch2_bkey_cmp_packed                                                                                                   
  12,58%  [kernel]                                [k] bset_key_invalid.constprop.0                                                                                           
   9,20%  [kernel]                                [k] __bch2_bkey_unpack_key                                                                                                 
   3,01%  [kernel]                                [k] __bch2_bkey_invalid                                                                                                    
   1,90%  [kernel]                                [k] bch2_bkey_in_btree_node                                                                                                
   1,86%  [kernel]                                [k] srso_return_thunk                                                                                                      
   1,80%  [kernel]                                [k] srso_untrain_ret                                                                                                       
   1,26%  [kernel]                                [k] clear_page_rep                                                                                                         
   0,14%  [kyber_iosched]                         [k] 0x0000000000006a8c                                                                                                     
   0,11%  [kernel]                                [k] scatterwalk_copychunks                                                                                                 
   0,10%  [kernel]                                [k] bch2_checksum                                                                                                          
   0,10%  [kernel]                                [k] vsnprintf

perf top -p $pid -e 'bcachefs:*' shows only btree_node_read, always with trans_fn=bch2_do_invalidates_work and btree_id=10 (lru?):

Samples: 14K of event 'bcachefs:btree_node_read', 1 Hz, Event count (approx.): 609 lost: 0/0 drop: 0/0                                                                       
Overhead  Trace output                                                                                                                                                       
   0,16%  [FAILED TO PARSE] dev=265289744 trans_fn=bch2_do_invalidates_work level=0 btree_id=10 pos_inode=281563013609296 pos_offset=281474977147745 pos_snapshot=0

So it appears bch2_do_invalidates_work is not just spinning, it's reading in new btree nodes?

Are you on IRC, or could you join the IRC channel - irc.oftc.net#bcache?

The next thing we'll want to see is the full output of those btree node read tracepoints: perf trace -e bcachefs:btree_node_read

It seems like the LRU tree on that filesystem is borked

perf trace -p $pid_of_write_ref_kworker -e 'bcachefs:*'

Gives a lot of lines looking like:

1.439 bcachefs:btree_node_read(dev: 265289744, trans_fn: "bch2_do_invalidates_work", btree_id: 10, pos_inode: 282118289160088, pos_offset: 281474977460346)

(only pos_inode and pos_offset change).

bcachefs list --btree lru --verbose $devices

Gets stuck instead of listing keys:

mounting version 1.4: member_seq opts=ro,errors=continue,metadata_replicas=2,compression=lz4,background_compression=zstd,foreground_target=ssd,background_target=hdd,promote_target=ssd,degraded,verbose,nochanges,norecovery,read_only
recovering from clean shutdown, journal seq 36376981
alloc_read... done
stripes_read... done
snapshots_read... done
done starting filesystem
WARNING at libbcachefs/btree_iter.c:2823: btree trans held srcu lock (delaying memory reclaim) for 46 seconds

Whereas all other trees display something immediately.

The LRU btree seems to get unreasonably large at times. I don't know how to get the total size of it (until the accounting rework lands), but here it is ballooning in btree_cache statistics:

cat /sys/fs/bcachefs/*/internal/btree_cache                           
total:                 2.57 GiB (43290)                                                                                                                                      
nr dirty:              22.3 MiB (89)                                                                                                                                         
cannibalize lock:      0000000000000000                                               

extents                11.3 MiB (45)                                                                                                                                         
inodes                 256 KiB (1)                                                                                                                                           
dirents                256 KiB (1)                                                                                                                                           
xattrs                 256 KiB (1)     
alloc                  65.0 MiB (260) 
quotas                 256 KiB (1)     
stripes                256 KiB (1)
reflink                768 KiB (3)   
subvolumes             256 KiB (1) 
snapshots              256 KiB (1)
lru                    2.46 GiB (42866)
freespace              2.00 MiB (8)  
need_discard           512 KiB (2)
backpointers           19.3 MiB (77)
bucket_gens            4.25 MiB (17)
snapshot_trees         256 KiB (1)
deleted_inodes         256 KiB (1)
logged_ops             256 KiB (1)    
rebalance_work         256 KiB (1)  
subvolume_children     256 KiB (1)

Updated version with not freed reasons:

/sys/fs/bcachefs/9986b9d8-9f64-43c3-9a1d-cd397dcf3643/internal/btree_cache 
total:                         2.52 GiB (43100)
nr dirty:                      3.99 GiB (16356)
cannibalize lock:              0000000000000000
extents                        1.43 GiB (5879)
inodes                         1.50 MiB (6)
dirents                        256 KiB (1)
xattrs                         256 KiB (1)
alloc                          778 MiB (3112)
quotas                         256 KiB (1)
stripes                        256 KiB (1)
reflink                        75.5 MiB (302)
subvolumes                     256 KiB (1)
snapshots                      256 KiB (1)
lru                            2.54 GiB (26801)
freespace                      2.75 MiB (11)
need_discard                   512 KiB (2)
backpointers                   1.70 GiB (6954)
bucket_gens                    5.25 MiB (21)
snapshot_trees                 256 KiB (1)
deleted_inodes                 256 KiB (1)
logged_ops                     256 KiB (1)
rebalance_work                 512 KiB (2)
subvolume_children             256 KiB (1)
freed:                         46674527
not freed:
  dirty                        55685898
  write in flight              49
  read in flight               0
  lock intent failed           0
  lock write failed            0
  access bit                   53738470
  no evict failed              0
  write blocked                0
  will make reachable          0

My workload has a lot of data mmapped, maybe this is causing the trouble?

And here is a relevant srcu warning from that same workqueue; 675 seconds is a recent example but it's been as high as 1062 seconds.

mai 07 18:07:53 btree trans held srcu lock (delaying memory reclaim) for 675 seconds
mai 07 18:07:53 WARNING: CPU: 8 PID: 25447 at fs/bcachefs/btree_iter.c:2999 bch2_trans_srcu_unlock+0x149/0x160
mai 07 18:07:53 Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) xfs(E) essiv(E) authenc(E) dm_crypt(E) cmac(E) ccm(E) kyber_iosched(E) nls_utf8(E) wireguard(E) curve25519_x86_64(E) libcurve25519_generic(E) libchacha20poly1305(E) chacha_x86_64(E) poly1305_x86_64(E) ip6_udp_tunnel(E) udp_tunnel(E) ip6t_REJECT(E) nf_reject_ipv6(E) xt_hl(E) ip6t_rt(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_multiport(E) xt_recent(E) nft_limit(E) xt_limit(E) xt_addrtype(E) xt_tcpudp(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nft_compat(E) binfmt_misc(E) nf_tables(E) btrfs(E) blake2b_generic(E) nls_iso8859_1(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_scodec_component(E) snd_hda_codec_hdmi(E) iwlmvm(E) snd_hda_intel(E) intel_rapl_msr(E) snd_intel_dspcfg(E) intel_rapl_common(E) snd_hda_codec(E) snd_hwdep(E) mac80211(E) snd_hda_core(E) kvm_amd(E) libarc4(E) snd_pcm(E) wmi_bmof(E) kvm(E) snd_seq(E) snd_seq_device(E) rapl(E) snd_timer(E) iwlwifi(E) snd(E) soundcore(E) i2c_piix4(E) k10temp(E) cfg80211(E)
mai 07 18:07:53  wmi(E) mac_hid(E) drm(E) efi_pstore(E) nfnetlink(E) dmi_sysfs(E) ip_tables(E) x_tables(E) bcache(E) bridge(E) stp(E) pata_acpi(E) llc(E) crct10dif_pclmul(E) nvme(E) crc32_pclmul(E) igb(E) polyval_clmulni(E) polyval_generic(E) xhci_pci(E) ghash_clmulni_intel(E) i2c_algo_bit(E) nvme_core(E) ahci(E) ccp(E) xhci_pci_renesas(E) sha512_ssse3(E) libahci(E) dca(E) pata_jmicron(E) dm_mirror(E) dm_region_hash(E) dm_log(E) msr(E) autofs4(E) aesni_intel(E) crypto_simd(E) cryptd(E) z3fold(E) lz4(E)
mai 07 18:07:53 CPU: 8 PID: 25447 Comm: kworker/8:49 Tainted: G        W   E      6.9.0-rc7-g2p #67
mai 07 18:07:53 Hardware name: To Be Filled By O.E.M. X570 Phantom Gaming 4/X570 Phantom Gaming 4, BIOS P5.61 02/22/2024
mai 07 18:07:53 Workqueue: bcachefs_write_ref bch2_do_invalidates_work
mai 07 18:07:53 RIP: 0010:bch2_trans_srcu_unlock+0x149/0x160
mai 07 18:07:53 Code: fd 09 9c 01 48 c7 c7 48 ef 4a ab 48 b8 cf f7 53 e3 a5 9b c4 20 48 29 ca 48 d1 ea 48 f7 e2 48 89 d6 48 c1 ee 04 e8 a7 ec a5 ff <0f> 0b e9 3c ff ff ff 0f 0b e9 4c ff ff ff 66 0f 1f 84 00 00 00 00
mai 07 18:07:53 RSP: 0018:ffffbb96917f7b78 EFLAGS: 00010246
mai 07 18:07:53 RAX: 0000000000000000 RBX: ffff9e7c3ead8000 RCX: 0000000000000000
mai 07 18:07:53 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
mai 07 18:07:53 RBP: ffffbb96917f7b90 R08: 0000000000000000 R09: 0000000000000000
mai 07 18:07:53 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9e7a225c0000
mai 07 18:07:53 R13: ffff9e7c3ead8000 R14: ffff9e7a225c0000 R15: 00001fad8ab8ee30
mai 07 18:07:53 FS:  0000000000000000(0000) GS:ffff9e7caf000000(0000) knlGS:0000000000000000
mai 07 18:07:53 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
mai 07 18:07:53 CR2: 00007fff9e9cb327 CR3: 00000002ccef6000 CR4: 0000000000350ef0
mai 07 18:07:53 Call Trace:
mai 07 18:07:53  <TASK>
mai 07 18:07:53  ? show_regs+0x70/0x90
mai 07 18:07:53  ? __warn+0x8c/0x150
mai 07 18:07:53  ? bch2_trans_srcu_unlock+0x149/0x160
mai 07 18:07:53  ? report_bug+0x186/0x1b0
mai 07 18:07:53  ? handle_bug+0x48/0x90
mai 07 18:07:53  ? exc_invalid_op+0x1c/0x80
mai 07 18:07:53  ? asm_exc_invalid_op+0x1f/0x30
mai 07 18:07:53  ? bch2_trans_srcu_unlock+0x149/0x160
mai 07 18:07:53  ? bch2_trans_srcu_unlock+0x149/0x160
mai 07 18:07:53  bch2_trans_begin+0x6ab/0x7a0
mai 07 18:07:53  bch2_do_invalidates_work+0x159/0xa70
mai 07 18:07:53  process_one_work+0x18d/0x3f0
mai 07 18:07:53  worker_thread+0x308/0x440
mai 07 18:07:53  ? rescuer_thread+0x410/0x410
mai 07 18:07:53  kthread+0xe8/0x120
mai 07 18:07:53  ? kthread_complete_and_exit+0x30/0x30
mai 07 18:07:53  ret_from_fork+0x4b/0x70
mai 07 18:07:53  ? kthread_complete_and_exit+0x30/0x30
mai 07 18:07:53  ret_from_fork_asm+0x11/0x20
mai 07 18:07:53  </TASK>
mai 07 18:07:53 ---[ end trace 0000000000000000 ]---

Here is alloc_debug, dev-0 is the hdd, dev-1 is the SSD:

dev-0/alloc_debug
                     buckets         sectors      fragmented
free                 1045092               0               0
sb                         7            6152            1016
journal                 8192         8388608               0
btree                 180855       112442880        72752640
user                20456990     20915610879        32697996
cached                     0               0               0
parity                     0               0               0
stripe                     0               0               0
need_gc_gens               0               0               0
need_discard               0               0               0

reserves:
stripe                677876
normal                338952
copygc                    28
btree                     14
btree_copygc               0
reclaim                    0
interior_updates           0

open buckets               1
buckets to invalidate      0

dev-1/alloc_debug
                     buckets         sectors      fragmented
free                   80674               0               0
sb                        13            6152             504
journal                 8192         4194304               0
btree                 452858       231863296               0
user                      49           17611            7477
cached               1555366       795244249               0
parity                     0               0               0
stripe                     0               0               0
need_gc_gens               0               0               0
need_discard               0               0               0

reserves:
stripe                 65592
normal                 32824
copygc                    56
btree                     28
btree_copygc               0
reclaim                    0
interior_updates           0

open buckets              42
buckets to invalidate   1302

internal/alloc_debug
hidden                 12596736
btree                  344306176
data                   20915628490
cached                 4979405911
reserved               92522888
online_reserved        0
nr_inodes              313677

freelist_wait          empty
open buckets allocated 44
open buckets total     1024
open_buckets_wait      empty
open_buckets_btree     29
open_buckets_user      14
btree reserve cache    29

This has stopped happening now, not sure when. Possibly when these errors got fixed:

juil. 11 22:38:01 moulinex kernel: bucket 0:2675359 gen 0 data type user has wrong fragmentation_lru: got 2149580800, should be 0, fixing
juil. 11 22:38:09 moulinex kernel: bucket 0:5542735 gen 1 data type user has wrong fragmentation_lru: got 2149580800, should be 0, fixing
juil. 11 22:38:21 moulinex kernel: bucket 0:10749926 gen 0 data type user has wrong fragmentation_lru: got 2149580800, should be 0, fixing
juil. 11 22:38:21 moulinex kernel: bucket 0:10796970 gen 0 data type user has wrong fragmentation_lru: got 2149580800, should be 0, fixing
juil. 11 22:38:21 moulinex kernel: bucket 0:10848350 gen 0 data type user has wrong fragmentation_lru: got 2149580800, should be 0, fixing

See b5cbb42dc59f519fa3cf49b9afbd5ee4805be01b and #668.

Other known errors from show-super:

alloc_key_fragmentation_lru_wrong           20              Fri Jul 12 10:35:17 2024
accounting_key_junk_at_end                  72              Sun Aug 11 10:32:56 2024
accounting_key_replicas_nr_devs_0           28              Sun Aug 11 10:32:56 2024

The SSD device used to have

buckets to invalidate   1302

which is now back in the 0-1 range.

koverstreet / bcachefs

bcachefs_write_ref high CPU use (doing LRU invalidation?) #627