Closed g2p closed 3 months ago
perf top -e 'bcachefs:*'
shows a lot of bcachefs:bucket_alloc_fail
, twice as many as bcachefs:read_promote
0,26% reserve stripe bucket 1:0 free 65620 avail 0 copygc_wait 416517960/270458800 seen 0 open 0 need_journal_commit 0 nouse 0 nocow 0 nonblocking 1 err freelist_empty
It is promote related, but it's not the "fix promotes" patch - I added that error message so that we'd log an error for any error that's getting reported on up to userspace, but forgot that promotes use BCH_WRITE_ALLOC_NOWAIT so errors are expected.
I've got a patch to fix the logging up in the master branch, can you test that and report back?
I've cherry-picked the commit ("bcachefs: Don't log errors if BCH_WRITE_ALLOC_NOWAIT" 3bf0400008e02dad9941875e82ad96b0bb89ff3e; final tree is bug/__bch2_write/silence-logs) and it did remove all the log spam. Sorry it took me time before I could reboot and test.
However, I still have the high CPU use from kworker/1:1+bcachefs_write_ref
, is that expected? That kernel thread is permanently spinning at 50-70% CPU.
perf top -p $pid says:
Samples: 185K of event 'cycles:P', 4000 Hz, Event count (approx.): 33602846651 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
23,61% [kernel] [k] bch2_key_sort_fix_overlapping
20,75% [kernel] [k] validate_bset_keys
14,44% [kernel] [k] bch2_bkey_cmp_packed
12,58% [kernel] [k] bset_key_invalid.constprop.0
9,20% [kernel] [k] __bch2_bkey_unpack_key
3,01% [kernel] [k] __bch2_bkey_invalid
1,90% [kernel] [k] bch2_bkey_in_btree_node
1,86% [kernel] [k] srso_return_thunk
1,80% [kernel] [k] srso_untrain_ret
1,26% [kernel] [k] clear_page_rep
0,14% [kyber_iosched] [k] 0x0000000000006a8c
0,11% [kernel] [k] scatterwalk_copychunks
0,10% [kernel] [k] bch2_checksum
0,10% [kernel] [k] vsnprintf
perf top -p $pid -e 'bcachefs:*' shows only btree_node_read, always with trans_fn=bch2_do_invalidates_work and btree_id=10 (lru?):
Samples: 14K of event 'bcachefs:btree_node_read', 1 Hz, Event count (approx.): 609 lost: 0/0 drop: 0/0
Overhead Trace output
0,16% [FAILED TO PARSE] dev=265289744 trans_fn=bch2_do_invalidates_work level=0 btree_id=10 pos_inode=281563013609296 pos_offset=281474977147745 pos_snapshot=0
So it appears bch2_do_invalidates_work is not just spinning, it's reading in new btree nodes?
Are you on IRC, or could you join the IRC channel - irc.oftc.net#bcache?
The next thing we'll want to see is the full output of those btree node read tracepoints: perf trace -e bcachefs:btree_node_read
It seems like the LRU tree on that filesystem is borked
perf trace -p $pid_of_write_ref_kworker -e 'bcachefs:*'
Gives a lot of lines looking like:
1.439 bcachefs:btree_node_read(dev: 265289744, trans_fn: "bch2_do_invalidates_work", btree_id: 10, pos_inode: 282118289160088, pos_offset: 281474977460346)
(only pos_inode and pos_offset change).
bcachefs list --btree lru --verbose $devices
Gets stuck instead of listing keys:
mounting version 1.4: member_seq opts=ro,errors=continue,metadata_replicas=2,compression=lz4,background_compression=zstd,foreground_target=ssd,background_target=hdd,promote_target=ssd,degraded,verbose,nochanges,norecovery,read_only
recovering from clean shutdown, journal seq 36376981
alloc_read... done
stripes_read... done
snapshots_read... done
done starting filesystem
WARNING at libbcachefs/btree_iter.c:2823: btree trans held srcu lock (delaying memory reclaim) for 46 seconds
Whereas all other trees display something immediately.
The LRU btree seems to get unreasonably large at times. I don't know how to get the total size of it (until the accounting rework lands), but here it is ballooning in btree_cache statistics:
cat /sys/fs/bcachefs/*/internal/btree_cache
total: 2.57 GiB (43290)
nr dirty: 22.3 MiB (89)
cannibalize lock: 0000000000000000
extents 11.3 MiB (45)
inodes 256 KiB (1)
dirents 256 KiB (1)
xattrs 256 KiB (1)
alloc 65.0 MiB (260)
quotas 256 KiB (1)
stripes 256 KiB (1)
reflink 768 KiB (3)
subvolumes 256 KiB (1)
snapshots 256 KiB (1)
lru 2.46 GiB (42866)
freespace 2.00 MiB (8)
need_discard 512 KiB (2)
backpointers 19.3 MiB (77)
bucket_gens 4.25 MiB (17)
snapshot_trees 256 KiB (1)
deleted_inodes 256 KiB (1)
logged_ops 256 KiB (1)
rebalance_work 256 KiB (1)
subvolume_children 256 KiB (1)
Updated version with not freed reasons:
/sys/fs/bcachefs/9986b9d8-9f64-43c3-9a1d-cd397dcf3643/internal/btree_cache
total: 2.52 GiB (43100)
nr dirty: 3.99 GiB (16356)
cannibalize lock: 0000000000000000
extents 1.43 GiB (5879)
inodes 1.50 MiB (6)
dirents 256 KiB (1)
xattrs 256 KiB (1)
alloc 778 MiB (3112)
quotas 256 KiB (1)
stripes 256 KiB (1)
reflink 75.5 MiB (302)
subvolumes 256 KiB (1)
snapshots 256 KiB (1)
lru 2.54 GiB (26801)
freespace 2.75 MiB (11)
need_discard 512 KiB (2)
backpointers 1.70 GiB (6954)
bucket_gens 5.25 MiB (21)
snapshot_trees 256 KiB (1)
deleted_inodes 256 KiB (1)
logged_ops 256 KiB (1)
rebalance_work 512 KiB (2)
subvolume_children 256 KiB (1)
freed: 46674527
not freed:
dirty 55685898
write in flight 49
read in flight 0
lock intent failed 0
lock write failed 0
access bit 53738470
no evict failed 0
write blocked 0
will make reachable 0
My workload has a lot of data mmapped, maybe this is causing the trouble?
And here is a relevant srcu warning from that same workqueue; 675 seconds is a recent example but it's been as high as 1062 seconds.
mai 07 18:07:53 btree trans held srcu lock (delaying memory reclaim) for 675 seconds
mai 07 18:07:53 WARNING: CPU: 8 PID: 25447 at fs/bcachefs/btree_iter.c:2999 bch2_trans_srcu_unlock+0x149/0x160
mai 07 18:07:53 Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) xfs(E) essiv(E) authenc(E) dm_crypt(E) cmac(E) ccm(E) kyber_iosched(E) nls_utf8(E) wireguard(E) curve25519_x86_64(E) libcurve25519_generic(E) libchacha20poly1305(E) chacha_x86_64(E) poly1305_x86_64(E) ip6_udp_tunnel(E) udp_tunnel(E) ip6t_REJECT(E) nf_reject_ipv6(E) xt_hl(E) ip6t_rt(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_multiport(E) xt_recent(E) nft_limit(E) xt_limit(E) xt_addrtype(E) xt_tcpudp(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nft_compat(E) binfmt_misc(E) nf_tables(E) btrfs(E) blake2b_generic(E) nls_iso8859_1(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_scodec_component(E) snd_hda_codec_hdmi(E) iwlmvm(E) snd_hda_intel(E) intel_rapl_msr(E) snd_intel_dspcfg(E) intel_rapl_common(E) snd_hda_codec(E) snd_hwdep(E) mac80211(E) snd_hda_core(E) kvm_amd(E) libarc4(E) snd_pcm(E) wmi_bmof(E) kvm(E) snd_seq(E) snd_seq_device(E) rapl(E) snd_timer(E) iwlwifi(E) snd(E) soundcore(E) i2c_piix4(E) k10temp(E) cfg80211(E)
mai 07 18:07:53 wmi(E) mac_hid(E) drm(E) efi_pstore(E) nfnetlink(E) dmi_sysfs(E) ip_tables(E) x_tables(E) bcache(E) bridge(E) stp(E) pata_acpi(E) llc(E) crct10dif_pclmul(E) nvme(E) crc32_pclmul(E) igb(E) polyval_clmulni(E) polyval_generic(E) xhci_pci(E) ghash_clmulni_intel(E) i2c_algo_bit(E) nvme_core(E) ahci(E) ccp(E) xhci_pci_renesas(E) sha512_ssse3(E) libahci(E) dca(E) pata_jmicron(E) dm_mirror(E) dm_region_hash(E) dm_log(E) msr(E) autofs4(E) aesni_intel(E) crypto_simd(E) cryptd(E) z3fold(E) lz4(E)
mai 07 18:07:53 CPU: 8 PID: 25447 Comm: kworker/8:49 Tainted: G W E 6.9.0-rc7-g2p #67
mai 07 18:07:53 Hardware name: To Be Filled By O.E.M. X570 Phantom Gaming 4/X570 Phantom Gaming 4, BIOS P5.61 02/22/2024
mai 07 18:07:53 Workqueue: bcachefs_write_ref bch2_do_invalidates_work
mai 07 18:07:53 RIP: 0010:bch2_trans_srcu_unlock+0x149/0x160
mai 07 18:07:53 Code: fd 09 9c 01 48 c7 c7 48 ef 4a ab 48 b8 cf f7 53 e3 a5 9b c4 20 48 29 ca 48 d1 ea 48 f7 e2 48 89 d6 48 c1 ee 04 e8 a7 ec a5 ff <0f> 0b e9 3c ff ff ff 0f 0b e9 4c ff ff ff 66 0f 1f 84 00 00 00 00
mai 07 18:07:53 RSP: 0018:ffffbb96917f7b78 EFLAGS: 00010246
mai 07 18:07:53 RAX: 0000000000000000 RBX: ffff9e7c3ead8000 RCX: 0000000000000000
mai 07 18:07:53 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
mai 07 18:07:53 RBP: ffffbb96917f7b90 R08: 0000000000000000 R09: 0000000000000000
mai 07 18:07:53 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9e7a225c0000
mai 07 18:07:53 R13: ffff9e7c3ead8000 R14: ffff9e7a225c0000 R15: 00001fad8ab8ee30
mai 07 18:07:53 FS: 0000000000000000(0000) GS:ffff9e7caf000000(0000) knlGS:0000000000000000
mai 07 18:07:53 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
mai 07 18:07:53 CR2: 00007fff9e9cb327 CR3: 00000002ccef6000 CR4: 0000000000350ef0
mai 07 18:07:53 Call Trace:
mai 07 18:07:53 <TASK>
mai 07 18:07:53 ? show_regs+0x70/0x90
mai 07 18:07:53 ? __warn+0x8c/0x150
mai 07 18:07:53 ? bch2_trans_srcu_unlock+0x149/0x160
mai 07 18:07:53 ? report_bug+0x186/0x1b0
mai 07 18:07:53 ? handle_bug+0x48/0x90
mai 07 18:07:53 ? exc_invalid_op+0x1c/0x80
mai 07 18:07:53 ? asm_exc_invalid_op+0x1f/0x30
mai 07 18:07:53 ? bch2_trans_srcu_unlock+0x149/0x160
mai 07 18:07:53 ? bch2_trans_srcu_unlock+0x149/0x160
mai 07 18:07:53 bch2_trans_begin+0x6ab/0x7a0
mai 07 18:07:53 bch2_do_invalidates_work+0x159/0xa70
mai 07 18:07:53 process_one_work+0x18d/0x3f0
mai 07 18:07:53 worker_thread+0x308/0x440
mai 07 18:07:53 ? rescuer_thread+0x410/0x410
mai 07 18:07:53 kthread+0xe8/0x120
mai 07 18:07:53 ? kthread_complete_and_exit+0x30/0x30
mai 07 18:07:53 ret_from_fork+0x4b/0x70
mai 07 18:07:53 ? kthread_complete_and_exit+0x30/0x30
mai 07 18:07:53 ret_from_fork_asm+0x11/0x20
mai 07 18:07:53 </TASK>
mai 07 18:07:53 ---[ end trace 0000000000000000 ]---
Here is alloc_debug, dev-0 is the hdd, dev-1 is the SSD:
dev-0/alloc_debug
buckets sectors fragmented
free 1045092 0 0
sb 7 6152 1016
journal 8192 8388608 0
btree 180855 112442880 72752640
user 20456990 20915610879 32697996
cached 0 0 0
parity 0 0 0
stripe 0 0 0
need_gc_gens 0 0 0
need_discard 0 0 0
reserves:
stripe 677876
normal 338952
copygc 28
btree 14
btree_copygc 0
reclaim 0
interior_updates 0
open buckets 1
buckets to invalidate 0
dev-1/alloc_debug
buckets sectors fragmented
free 80674 0 0
sb 13 6152 504
journal 8192 4194304 0
btree 452858 231863296 0
user 49 17611 7477
cached 1555366 795244249 0
parity 0 0 0
stripe 0 0 0
need_gc_gens 0 0 0
need_discard 0 0 0
reserves:
stripe 65592
normal 32824
copygc 56
btree 28
btree_copygc 0
reclaim 0
interior_updates 0
open buckets 42
buckets to invalidate 1302
internal/alloc_debug
hidden 12596736
btree 344306176
data 20915628490
cached 4979405911
reserved 92522888
online_reserved 0
nr_inodes 313677
freelist_wait empty
open buckets allocated 44
open buckets total 1024
open_buckets_wait empty
open_buckets_btree 29
open_buckets_user 14
btree reserve cache 29
This has stopped happening now, not sure when. Possibly when these errors got fixed:
juil. 11 22:38:01 moulinex kernel: bucket 0:2675359 gen 0 data type user has wrong fragmentation_lru: got 2149580800, should be 0, fixing
juil. 11 22:38:09 moulinex kernel: bucket 0:5542735 gen 1 data type user has wrong fragmentation_lru: got 2149580800, should be 0, fixing
juil. 11 22:38:21 moulinex kernel: bucket 0:10749926 gen 0 data type user has wrong fragmentation_lru: got 2149580800, should be 0, fixing
juil. 11 22:38:21 moulinex kernel: bucket 0:10796970 gen 0 data type user has wrong fragmentation_lru: got 2149580800, should be 0, fixing
juil. 11 22:38:21 moulinex kernel: bucket 0:10848350 gen 0 data type user has wrong fragmentation_lru: got 2149580800, should be 0, fixing
See b5cbb42dc59f519fa3cf49b9afbd5ee4805be01b and #668.
Other known errors from show-super:
alloc_key_fragmentation_lru_wrong 20 Fri Jul 12 10:35:17 2024
accounting_key_junk_at_end 72 Sun Aug 11 10:32:56 2024
accounting_key_replicas_nr_devs_0 28 Sun Aug 11 10:32:56 2024
The SSD device used to have
buckets to invalidate 1302
which is now back in the 0-1 range.
I'm getting a lot of error spam in dmesg, of the form:
With this kernel thread spinning at around 60% CPU: kworker/8:2+bcachefs_write_ref
I'm using bcachefs with an SSD, the filesystem was created like so (writeback caching, metadata replicas to force metadata caching, foreground and background compression):
I've grown the cache device, which only got the errors to stop for a short while.
I've been more or less following bcachefs/master (but staying before the split brain superblock change), regularly rebased on v6.7 rcs, I've pushed the versions as tags in a github fork; the errors appeared within
git range-diff 5dddeba898a0...5a8ba0aecf8a
. This leads me to suspect aFix promotes
commit of at least revealing the problem.As a potentially slightly unusual thing, the root of the volume is mounted but also has subpaths bind mounted elsewhere.