Open alkkkj opened 2 years ago
This regressed with backpointers: when backpointers no longer fit into the alloc key we have to spill over to the backpointers btree, which isn't cached in the btree key cache.
Can you try with a 128k bucket size? That will mean (with 4k block size) 32 blocks per bucket, which will fit in the alloc keys - that'll help confirm. I'm going to expand the size of bkey values to address this.
Tried, any more experiments needed?
[root@qs62 ~]# bcachefs show-super /dev/nvme0n1
External UUID: 148bf99b-5599-4f1d-aa3a-41a1931fb615
Internal UUID: e9db42fc-5c0e-4251-a80b-c8c783f8948f
Device index: 0
Label:
Version: backpointers
Oldest version on disk: backpointers
Created: Fri Sep 23 09:15:06 2022
Sequence number: 9
Superblock size: 3656
Clean: 0
Devices: 1
Sections: members,replicas_v0,clean,journal_v2,counters
Features: new_siphash,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features: alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done
Options: block_size: 4.00 KiB btree_node_size: 128 KiB errors: continue [ro] panic metadata_replicas: 1 data_replicas: 1 metadata_replicas_required: 1 data_replicas_required: 1 encoded_extent_max: 64.0 KiB metadata_checksum: none [crc32c] crc64 xxhash data_checksum: none [crc32c] crc64 xxhash compression: [none] lz4 gzip zstd background_compression: [none] lz4 gzip zstd str_hash: crc32c crc64 [siphash] metadata_target: none foreground_target: none background_target: none promote_target: none erasure_code: 0 inodes_32bit: 1 shard_inode_numbers: 1 inodes_use_key_cache: 1 gc_reserve_percent: 8 gc_reserve_bytes: 0 B root_reserve_percent: 0 wide_macs: 0 acl: 1 usrquota: 0 grpquota: 0 prjquota: 0 journal_flush_delay: 1000 journal_flush_disabled: 0 journal_reclaim_delay: 100
members (size 64): Device: 0 UUID: e4db599b-ced1-4da6-ae97-8713b994e7f2 Size: 1.46 TiB Bucket size: 128 KiB First bucket: 0 Buckets: 12209482 Last mount: Fri Sep 23 09:16:00 2022 State: rw Label: (none) Data allowed: journal,btree,user Has data: journal,btree,user Discard: 0 Freespace initialized: 1
[root@qs62 mnt]# cat fio.conf [global] ioengine=libaio time_based direct=1 thread group_reporting randrepeat=0 norandommap numjobs=1 runtime=300 filename=/mnt/bfs-1/test1
[randwrite-4k-io32] bs=4k iodepth=64 rw=randwrite
[root@qs62 mnt]# fio fio.conf
randwrite-4k-io32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.7
Starting 1 thread
^Cbs: 1 (f=1): [w(1)][37.3%][r=0KiB/s,w=48.2MiB/s][r=0,w=12.3k IOPS][eta 03m:08s]
randwrite-4k-io32: (groupid=0, jobs=1): err= 0: pid=14997: Fri Sep 23 09:31:07 2022 write: IOPS=12.3k, BW=47.0MiB/s (50.3MB/s)(5415MiB/112864msec) slat (usec): min=3, max=3216, avg=19.84, stdev=29.21 clat (usec): min=55, max=182211, avg=5187.89, stdev=4393.18 lat (usec): min=66, max=182225, avg=5208.04, stdev=4392.06 clat percentiles (usec): | 1.00th=[ 281], 5.00th=[ 519], 10.00th=[ 865], 20.00th=[ 1631], | 30.00th=[ 2376], 40.00th=[ 3195], 50.00th=[ 4080], 60.00th=[ 5145], | 70.00th=[ 6390], 80.00th=[ 8160], 90.00th=[10945], 95.00th=[13435], | 99.00th=[19268], 99.50th=[22152], 99.90th=[31327], 99.95th=[36439], | 99.99th=[54789] bw ( KiB/s): min=37640, max=60872, per=99.99%, avg=49130.42, stdev=3498.32, samples=225 iops : min= 9410, max=15218, avg=12282.56, stdev=874.57, samples=225 lat (usec) : 100=0.01%, 250=0.56%, 500=4.18%, 750=3.60%, 1000=3.54% lat (msec) : 2=13.05%, 4=24.12%, 10=38.41%, 20=11.68%, 50=0.84% lat (msec) : 100=0.01%, 250=0.01% cpu : usr=5.97%, sys=27.96%, ctx=843971, majf=0, minf=872 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,1386357,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs): WRITE: bw=47.0MiB/s (50.3MB/s), 47.0MiB/s-47.0MiB/s (50.3MB/s-50.3MB/s), io=5415MiB (5679MB), run=112864-112864msec
Disk stats (read/write): nvme0n1: ios=2/2274235, merge=0/0, ticks=0/27154, in_queue=27155, util=98.69%
avg-cpu: %user %nice %system %iowait %steal %idle 0.08 0.00 3.50 0.00 0.00 96.42
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 0.00 0.00 0.00 19806.00 0.00 140.08 14.48 0.24 0.01 0.00 0.01 0.05 98.57
Damn, that's even worse. I'll have to run some benchmarks of my own and investigate.
OS VERSION: cat /etc/redhat-release CentOS Linux release 7.6.1810 (Core) [root@xx bfs-4]# uname -a Linux qs62 5.19.0 #1 SMP PREEMPT_DYNAMIC Wed Sep 21 09:59:57 CST 2022 x86_64 x86_64 x86_64 GNU/Linux
Just pushed some performance improvements - in my testing we were CPU bound, but I was seeing higher performance than you so you may be bottlenecked on something else.
Can you check if those patches helped performance for you? And see if you're CPU bound - on this test, htop should show one core pegged at 100% if we are.
How about the Random read performance ?
1) I did not use CPU bound for bcachefs, should I? 2) 30s fio direct model, 4K random write testing result: numjobs 1, qdepth 64, IOPS 19.4k, ava latency 3292.05us numjobs 4, qdepth 1, IOPS 40.4k, ava latency 95.47us
the write amplification is still double(iostat:fio throughput)
Also did testing by more num jobs with qdepth 1, the maximum value is still 40.4k
More details:
bcachefs show-super /dev/nvme nvme0 nvme0n1 nvme1 nvme1n1 [root@qs62 bfs-1]# bcachefs show-super /dev/nvme0n1 External UUID: 83548cc8-185b-4f70-b4cd-3f164f45bb74 Internal UUID: 805a32bb-d980-4e6d-9528-6cecc0d4b970 Device index: 0 Label: Version: backpointers Oldest version on disk: backpointers Created: Wed Sep 28 10:50:41 2022 Sequence number: 9 Superblock size: 3640 Clean: 0 Devices: 1 Sections: members,replicas_v0,clean,journal_v2,counters Features: new_siphash,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes Compat features: alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done
Options: block_size: 512 B btree_node_size: 256 KiB errors: continue [ro] panic metadata_replicas: 1 data_replicas: 1 metadata_replicas_required: 1 data_replicas_required: 1 encoded_extent_max: 64.0 KiB metadata_checksum: none [crc32c] crc64 xxhash data_checksum: none [crc32c] crc64 xxhash compression: [none] lz4 gzip zstd background_compression: [none] lz4 gzip zstd str_hash: crc32c crc64 [siphash] metadata_target: none foreground_target: none background_target: none promote_target: none erasure_code: 0 inodes_32bit: 1 shard_inode_numbers: 1 inodes_use_key_cache: 1 gc_reserve_percent: 8 gc_reserve_bytes: 0 B root_reserve_percent: 0 wide_macs: 0 acl: 1 usrquota: 0 grpquota: 0 prjquota: 0 journal_flush_delay: 1000 journal_flush_disabled: 0 journal_reclaim_delay: 100
members (size 64): Device: 0 UUID: 1dfe8b33-ca39-4faf-9128-ef8b0a12ec11 Size: 1.46 TiB Bucket size: 512 KiB First bucket: 0 Buckets: 3052370 Last mount: Wed Sep 28 10:51:03 2022 State: rw Label: (none) Data allowed: journal,btree,user Has data: journal,btree,user Discard: 0 Freespace initialized: 1
[global] ioengine=libaio time_based direct=1 thread group_reporting randrepeat=0 norandommap numjobs=1 runtime=30 filename=/mnt/test1
[randwrite-4k-io32] bs=4k iodepth=64 rw=randwrite stonewall
qdepth 64, num threads 1, random write IOPS=19.4k
fio ../fio.conf randwrite-4k-io32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.7 Starting 1 thread Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=75.7MiB/s][r=0,w=19.4k IOPS][eta 00m:00s] randwrite-4k-io32: (groupid=0, jobs=1): err= 0: pid=5735: Wed Sep 28 11:06:20 2022 write: IOPS=19.4k, BW=75.9MiB/s (79.6MB/s)(2277MiB/30003msec) slat (usec): min=3, max=1923, avg=15.11, stdev=15.40 clat (usec): min=57, max=27392, avg=3276.65, stdev=2496.61 lat (usec): min=70, max=27410, avg=3292.05, stdev=2495.38 clat percentiles (usec): | 1.00th=[ 285], 5.00th=[ 457], 10.00th=[ 701], 20.00th=[ 1172], | 30.00th=[ 1663], 40.00th=[ 2212], 50.00th=[ 2769], 60.00th=[ 3228], | 70.00th=[ 3949], 80.00th=[ 4948], 90.00th=[ 6652], 95.00th=[ 8225], | 99.00th=[11600], 99.50th=[12911], 99.90th=[16188], 99.95th=[17695], | 99.99th=[21890] bw ( KiB/s): min=70712, max=87792, per=100.00%, avg=77710.13, stdev=2718.97, samples=60 iops : min=17678, max=21948, avg=19427.53, stdev=679.74, samples=60 lat (usec) : 100=0.01%, 250=0.52%, 500=5.41%, 750=5.14%, 1000=5.45% lat (msec) : 2=19.73%, 4=34.23%, 10=27.36%, 20=2.13%, 50=0.02% cpu : usr=9.49%, sys=34.86%, ctx=482744, majf=0, minf=143 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,582889,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs): WRITE: bw=75.9MiB/s (79.6MB/s), 75.9MiB/s-75.9MiB/s (79.6MB/s-79.6MB/s), io=2277MiB (2388MB), run=30003-30003msec
Disk stats (read/write): nvme0n1: ios=0/751200, merge=0/0, ticks=0/12639, in_queue=12639, util=99.36%
But performance looks a little better when setting num threads to 3, iodepth to 1
randwrite-4k-io32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 ... fio-3.7 Starting 4 threads Jobs: 4 (f=4): [w(4)][100.0%][r=0KiB/s,w=158MiB/s][r=0,w=40.4k IOPS][eta 00m:00s] randwrite-4k-io32: (groupid=0, jobs=4): err= 0: pid=5905: Wed Sep 28 11:10:37 2022 write: IOPS=41.3k, BW=161MiB/s (169MB/s)(4834MiB/30001msec) slat (usec): min=6, max=1256, avg= 9.31, stdev= 4.17 clat (usec): min=2, max=3808, avg=85.95, stdev=88.31 lat (usec): min=39, max=3826, avg=95.47, stdev=88.48 clat percentiles (usec): | 1.00th=[ 40], 5.00th=[ 47], 10.00th=[ 52], 20.00th=[ 58], | 30.00th=[ 63], 40.00th=[ 67], 50.00th=[ 70], 60.00th=[ 74], | 70.00th=[ 78], 80.00th=[ 85], 90.00th=[ 106], 95.00th=[ 192], | 99.00th=[ 347], 99.50th=[ 396], 99.90th=[ 1516], 99.95th=[ 1696], | 99.99th=[ 2147] bw ( KiB/s): min=32232, max=51456, per=25.01%, avg=41270.00, stdev=2209.74, samples=236 iops : min= 8058, max=12864, avg=10317.47, stdev=552.43, samples=236 lat (usec) : 4=0.01%, 50=7.99%, 100=80.63%, 250=7.50%, 500=3.55% lat (usec) : 750=0.10%, 1000=0.03% lat (msec) : 2=0.18%, 4=0.02% cpu : usr=3.56%, sys=11.77%, ctx=1238056, majf=0, minf=170 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,1237578,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs): WRITE: bw=161MiB/s (169MB/s), 161MiB/s-161MiB/s (169MB/s-169MB/s), io=4834MiB (5069MB), run=30001-30001msec
Disk stats (read/write): nvme0n1: ios=0/1425536, merge=0/0, ticks=0/22824, in_queue=22823, util=99.59%
I missed the fact that you're testing with iodepth=1 before. That explains the low performance, but not the write amplification.
There's various counters and time stats, in particular counters will tell you amount of user data writes, btree writes and journal writes. Check that.
I'd also suspect something is flushing the journal - your fio config doesn't specify sync. There should be a counter for number of journal writes, it should be on the order of one every 10-100 ms for a workload like this - if it's higher, that's the source of the write amplification.
I did test with qdepth 64, numjobs 1 , the IOPS is 19.4k, ava latency 3292.05us.
what is the best fio direct random 4k write performance you have tested?
Thanks.
Use the kernel version 6.0.0, do the performance testing on the same HW platform, data here(directIO, queue depth 256, num jobs 4, truncate two 50G new files for rand RW and SEQ RW) --------------- bcachefs ------- btrfs----- 4K rand write ----37.6k --------74.9k, 4k rand read ----264k,-------- 313k 64K seq write -----25.0k ----------30.9k 64k seq read -----51.5k -----------51.4k
Looks the btrfs performance is a little better, especially the 4K random write.
Yeah backpointers was a huge performance regression - I may have to make it optional, right now I'm working on the btree write buffer code to see if I can get acceptable performance with that
@koverstreet backpointers were solving a performance/scaling issue in copygc that would cripple a large HDD based FS without it, I don't think making it optional is the best idea, perhaps with the new persistent work tree we could solve the issue another way and drop backpointers, or make them async, since we only really need them for copygc.
The rebalance work btree is completely unrelated.
Making it optional might be the best approach: since as you noted it's needed primarily for large HDD based filesystems, which are slow enough that the performance impact won't affect them, we can let users enable it for those filesystems and not have it affect smaller SSD filesystems where the performance impact is, as this bug report notes, a real issue.
I can do some experimental testing on NVMe device without backpointers if you happy to point how to disable this in code.
You'd have to go to the first pre-backpointers commit - 0a10aa12ef
~45K, random 4K write , direct IO model. (5.17). I guess you did optimization on direct IO after version 5.17. So I will test again if you make backpointer optional.
Btree write buffer has been merged - go ahead and retest
Post your fio config
the above testing use this the following command: fio -direct=1 -iodepth=64 -rw=randwrite -ioengine=libaio -bs=4k -thread -numjobs=4 -group_reporting -filename=/mnt/bfs-1/f1-50g -name=RandWrite_4K_Testing
I set the numjobs to 2, performance is better
More testing result with buffer merged code:
BTRFS: fio -direct=1 -iodepth=1 -rw=randwrite -ioengine=libaio -bs=4k -thread -numjobs=1 -group_reporting -filename=/mnt/btrfs-1/f2-50g -name=RandWrite_4K_Testing -timeout=300
IOPS=36.1k lat (usec): min=20, max=1399, avg=26.21, stdev= 8.04
Bcachefs: fio -direct=1 -iodepth=1 -rw=randwrite -ioengine=libaio -bs=4k -thread -numjobs=1 -group_reporting -filename=/mnt/bfs-1/f2-50g -name=RandWrite_4K_Testing -timeout=300
IOPS=14.9k lat (usec): min=36, max=14937, avg=65.09, stdev=187.14
Mounting fast in large HDD when FS is clean . But when the FS is not clean or unepxected shutdown , mounting is still slow , often 15 mins in 6T HDD . it is more faster now , but not enough , although it is 30 mins or 1 hours before.
I know HDD is poor in Random read ...
Btrfs does basically not need fsck , could Bcachefs work like that ? thanks @koverstreet
mount with -o verbose and post the log
mount with -o verbose and post the log
Mounting hangs long time in "checking for deleted inodes" ,while mounting large HDD
[ 8676.448183] bcachefs: loading out-of-tree module taints kernel.
[ 8676.541734] bcachefs: bch2_fs_open()
[ 8676.541742] bcachefs: bch2_read_super()
[ 8676.542168] bcachefs: bch2_read_super() ret 0
[ 8676.542173] bcachefs: bch2_fs_alloc()
[ 8676.573610] bcachefs: bch2_fs_journal_init()
[ 8676.573654] bcachefs: bch2_fs_journal_init() ret 0
[ 8676.573784] bcachefs: bch2_fs_btree_cache_init()
[ 8676.574229] bcachefs: bch2_fs_btree_cache_init() ret 0
[ 8676.576252] bcachefs: bch2_fs_encryption_init()
[ 8676.576276] bcachefs: bch2_fs_encryption_init() ret 0
[ 8676.576283] bcachefs: __bch2_fs_compress_init()
[ 8676.576309] bcachefs: __bch2_fs_compress_init() ret 0
[ 8676.576504] bcachefs: bch2_fs_fsio_init()
[ 8676.577286] bcachefs: bch2_fs_fsio_init() ret 0
[ 8676.577290] bcachefs: bch2_dev_alloc()
[ 8676.578187] bcachefs: bch2_dev_alloc() ret 0
[ 8676.581995] bcachefs: bch2_fs_alloc() ret 0
[ 8676.582501] bcachefs (sda): recovering from unclean shutdown
[ 8676.582529] bcachefs (sda): starting journal read
[ 8693.689501] bcachefs (sda): journal read done on device sda, ret 0
[ 8693.690201] bcachefs (sda): journal read done, replaying entries 95556109-95558024
[ 8693.862730] bcachefs (sda): starting alloc read
[ 8694.985790] bcachefs (sda): alloc read done
[ 8694.985798] bcachefs (sda): starting stripes_read
[ 8694.985817] bcachefs (sda): stripes_read done
[ 8694.985819] bcachefs (sda): reading snapshots table
[ 8694.985944] bcachefs (sda): reading snapshots done
[ 8694.985947] bcachefs (sda): starting journal replay, 9327 keys
[ 8694.987256] bcachefs (sda): going read-write
[ 8694.996949] ------------[ cut here ]------------
[ 8694.996952] Looking for class "b->c.lock" with key bch2_btree_node_lock_key [bcachefs], but found a different class "b->c.lock" with the same key
[ 8694.996987] WARNING: CPU: 7 PID: 23732 at kernel/locking/lockdep.c:937 look_up_lock_class+0x9c/0x110
[ 8694.996995] Modules linked in: bcachefs(O) mean_and_variance intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rfkill kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_rcore snd_hwdep aesni_intel crypto_simd snd_seq iTCO_wdt cryptd iTCO_vendor_support mxm_wmi snd_seq_device snd_pcm efi_pstore pcspkr i2c_i801 snd_timer i2c_smbus snd lpc_ich sg i2c_core soundcore mfd_core evdev wmi button ip_tables x_tables r8169 realtek mdio_devres libphy efivarfs autofs4
[ 8694.997047] CPU: 7 PID: 23732 Comm: kworker/7:0 Tainted: G O 6.1.0-uksm+ #50 346e44c1ce9f02bb71dda964ff5209e1818a5f27
[ 8694.997050] Hardware name: INTEL X99/X99, BIOS 5.11 12/30/2020
[ 8694.997052] Workqueue: events_long bch2_do_discards_work [bcachefs]
[ 8694.997079] RIP: 0010:look_up_lock_class+0x9c/0x110
[ 8694.997083] Code: 17 48 81 fa 80 2c e5 82 74 52 80 3d ca 26 d5 00 00 75 49 48 c7 c7 40 a5 54 82 48 89 45 f0 c6 05 b6 26 d5 00 01 e8 e4 cf f9 ff <0f> 0b 48 8b 45 f0 eb 2a 89 75 f0 e8 b4 b3 a0 ff 8b 75 f0 48 c7 c7
[ 8694.997085] RSP: 0018:ffffc9000b2a35b8 EFLAGS: 00010096
[ 8694.997087] RAX: 0000000000000085 RBX: 0000000000000000 RCX: 0000000000000000
[ 8694.997089] RDX: 0000000000000001 RSI: ffffffff825b1ec8 RDI: 00000000ffffffff
[ 8694.997090] RBP: ffffc9000b2a35c8 R08: 0000000000013ffb R09: 00000000ffffbfff
[ 8694.997091] R10: ffff88903fca0000 R11: ffff88903fca0000 R12: 0000000000000000
[ 8694.997093] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 8694.997094] FS: 0000000000000000(0000) GS:ffff889033a00000(0000) knlGS:0000000000000000
[ 8694.997096] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8694.997097] CR2: 00007f48850a2000 CR3: 00000001300f0002 CR4: 00000000003706e0
[ 8694.997098] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8694.997100] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8694.997101] Call Trace:
[ 8694.997102] <TASK>
[ 8694.997105] register_lock_class+0x36/0x7d0
[ 8694.997109] ? __this_cpu_preempt_check+0x13/0x20
[ 8694.997115] __lock_acquire+0x60/0x1cd0
[ 8694.997120] lock_acquire+0xc0/0x2b0
[ 8694.997123] ? bch2_btree_path_traverse_cached_slowpath+0xd37/0x15a0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997157] ? bch2_btree_path_traverse_cached_slowpath+0xd37/0x15a0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997187] six_trylock_ip_intent+0x5d/0x160
[ 8694.997190] ? bch2_btree_path_traverse_cached_slowpath+0xd37/0x15a0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997220] bch2_btree_path_traverse_cached_slowpath+0xd4d/0x15a0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997249] ? bch2_btree_key_cache_find+0x226/0x3a0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997281] bch2_btree_path_traverse_cached+0x1c0/0x240 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997310] bch2_btree_path_traverse_one+0x5cc/0xd60 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997339] ? bch2_do_discards_work+0x90a/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997366] ? bch2_btree_path_verify+0x87/0xc0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997395] bch2_btree_iter_peek_slot+0x12a/0x760 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997423] ? bch2_printbuf_exit+0x1c/0x30 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997463] ? bch2_trans_update_max_paths+0xd3/0xf2 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997507] ? bch2_path_get+0x2b3/0x4f0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997539] ? bch2_do_discards_work+0x810/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997570] bch2_do_discards_work+0x90a/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997608] ? bch2_do_discards_work+0x42/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997640] ? bch2_do_discards_work+0x810/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997675] ? bch2_do_discards_work+0x15a/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997710] process_one_work+0x27a/0x570
[ 8694.997717] worker_thread+0x52/0x3e0
[ 8694.997720] ? process_one_work+0x570/0x570
[ 8694.997722] kthread+0xf5/0x120
[ 8694.997725] ? kthread_complete_and_exit+0x20/0x20
[ 8694.997730] ret_from_fork+0x1f/0x30
[ 8694.997738] </TASK>
[ 8694.997739] irq event stamp: 26167
[ 8694.997741] hardirqs last enabled at (26167): [<ffffffff8131479a>] ___slab_alloc.constprop.0+0x91a/0x1090
[ 8694.997746] hardirqs last disabled at (26166): [<ffffffff81314c2f>] ___slab_alloc.constprop.0+0xdaf/0x1090
[ 8694.997749] softirqs last enabled at (26132): [<ffffffffa074b645>] __bch2_btree_node_hash_insert+0x3a5/0x6e0 [bcachefs]
[ 8694.997779] softirqs last disabled at (26130): [<ffffffffa074b3e6>] __bch2_btree_node_hash_insert+0x146/0x6e0 [bcachefs]
[ 8694.997809] ---[ end trace 0000000000000000 ]---
[ 8701.223951] bcachefs (sda): journal replay done
[ 8701.223969] bcachefs (sda): checking for deleted inodes
Ok that's fixable - we need to add an index for deleted inodes.
version : 5.19
[root@qs62 mnt]# bcachefs show-super /dev/nvme0n1 External UUID: 8a11ae86-5c3e-4426-92d2-9dc642f42cf4 Internal UUID: dc7919f5-7143-4cd9-95fe-01a144e796d9 Device index: 0 Label:
Version: backpointers Oldest version on disk: backpointers Created: Thu Sep 22 18:04:44 2022 Sequence number: 9 Superblock size: 3640 Clean: 0 Devices: 1 Sections: members,replicas_v0,clean,journal_v2,counters Features: new_siphash,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes Compat features: alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done
Options: block_size: 512 B btree_node_size: 512 KiB errors: continue [ro] panic metadata_replicas: 1 data_replicas: 1 metadata_replicas_required: 1 data_replicas_required: 1 encoded_extent_max: 64.0 KiB metadata_checksum: none [crc32c] crc64 xxhash data_checksum: none [crc32c] crc64 xxhash compression: [none] lz4 gzip zstd background_compression: [none] lz4 gzip zstd str_hash: crc32c crc64 [siphash] metadata_target: none foreground_target: none background_target: none promote_target: none erasure_code: 0 inodes_32bit: 1 shard_inode_numbers: 1 inodes_use_key_cache: 1 gc_reserve_percent: 8 gc_reserve_bytes: 0 B root_reserve_percent: 0 wide_macs: 0 acl: 1 usrquota:
fio config file: [root@qs62 mnt]# cat fio.conf [global] ioengine=libaio time_based direct=1 thread group_reporting randrepeat=0 norandommap numjobs=1 runtime=300 filename=/mnt/bfs-1/test1
[randwrite-4k-io32] bs=4k iodepth=64 rw=randwrite stonewall
fio result: [root@qs62 mnt]# fio fio.conf randwrite-4k-io32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.7 Starting 1 thread ^Cbs: 1 (f=1): [w(1)][25.0%][r=0KiB/s,w=62.1MiB/s][r=0,w=15.9k IOPS][eta 03m:45s]
randwrite-4k-io32: (groupid=0, jobs=1): err= 0: pid=3634: Thu Sep 22 18:09:25 2022 write: IOPS=17.1k, BW=66.9MiB/s (70.2MB/s)(5017MiB/74983msec) slat (usec): min=3, max=7374, avg=14.57, stdev=15.56 clat (usec): min=46, max=40128, avg=3719.80, stdev=3697.97 lat (usec): min=60, max=40143, avg=3734.66, stdev=3697.19 clat percentiles (usec): | 1.00th=[ 277], 5.00th=[ 441], 10.00th=[ 668], 20.00th=[ 1106], | 30.00th=[ 1598], 40.00th=[ 2180], 50.00th=[ 2638], 60.00th=[ 3032], | 70.00th=[ 3884], 80.00th=[ 5342], 90.00th=[ 8356], 95.00th=[12125], | 99.00th=[17695], 99.50th=[19792], 99.90th=[25035], 99.95th=[26870], | 99.99th=[31327] bw ( KiB/s): min=56120, max=93232, per=100.00%, avg=68511.07, stdev=7481.22, samples=149 iops : min=14030, max=23308, avg=17127.74, stdev=1870.32, samples=149 lat (usec) : 50=0.01%, 100=0.01%, 250=0.59%, 500=5.66%, 750=5.74% lat (usec) : 1000=5.63% lat (msec) : 2=19.47%, 4=33.92%, 10=21.39%, 20=7.14%, 50=0.45% cpu : usr=8.19%, sys=29.58%, ctx=1042242, majf=0, minf=161 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,1284258,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs): WRITE: bw=66.9MiB/s (70.2MB/s), 66.9MiB/s-66.9MiB/s (70.2MB/s-70.2MB/s), io=5017MiB (5260MB), run=74983-74983msec
Disk stats (read/write): nvme0n1: ios=0/1534109, merge=0/0, ticks=0/62994, in_queue=62993, util=99.00%
nvme device iostat:
avg-cpu: %user %nice %system %iowait %steal %idle 0.11 0.00 3.37 0.00 0.00 96.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme0n1 0.00 0.00 0.00 18806.33 0.00 124.82 13.59 0.78 0.04 0.00 0.04 0.05 99.33
Questions: