NVMe Random write performance[5.19]

alkkkj commented 2 years ago

version : 5.19

[root@qs62 mnt]# bcachefs show-super /dev/nvme0n1 External UUID: 8a11ae86-5c3e-4426-92d2-9dc642f42cf4 Internal UUID: dc7919f5-7143-4cd9-95fe-01a144e796d9 Device index: 0 Label:
Version: backpointers Oldest version on disk: backpointers Created: Thu Sep 22 18:04:44 2022 Sequence number: 9 Superblock size: 3640 Clean: 0 Devices: 1 Sections: members,replicas_v0,clean,journal_v2,counters Features: new_siphash,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes Compat features: alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options: block_size: 512 B btree_node_size: 512 KiB errors: continue [ro] panic metadata_replicas: 1 data_replicas: 1 metadata_replicas_required: 1 data_replicas_required: 1 encoded_extent_max: 64.0 KiB metadata_checksum: none [crc32c] crc64 xxhash data_checksum: none [crc32c] crc64 xxhash compression: [none] lz4 gzip zstd background_compression: [none] lz4 gzip zstd str_hash: crc32c crc64 [siphash] metadata_target: none foreground_target: none background_target: none promote_target: none erasure_code: 0 inodes_32bit: 1 shard_inode_numbers: 1 inodes_use_key_cache: 1 gc_reserve_percent: 8 gc_reserve_bytes: 0 B root_reserve_percent: 0 wide_macs: 0 acl: 1 usrquota:

fio config file: [root@qs62 mnt]# cat fio.conf [global] ioengine=libaio time_based direct=1 thread group_reporting randrepeat=0 norandommap numjobs=1 runtime=300 filename=/mnt/bfs-1/test1

[randwrite-4k-io32] bs=4k iodepth=64 rw=randwrite stonewall

fio result: [root@qs62 mnt]# fio fio.conf randwrite-4k-io32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.7 Starting 1 thread ^Cbs: 1 (f=1): [w(1)][25.0%][r=0KiB/s,w=62.1MiB/s][r=0,w=15.9k IOPS][eta 03m:45s]

randwrite-4k-io32: (groupid=0, jobs=1): err= 0: pid=3634: Thu Sep 22 18:09:25 2022 write: IOPS=17.1k, BW=66.9MiB/s (70.2MB/s)(5017MiB/74983msec) slat (usec): min=3, max=7374, avg=14.57, stdev=15.56 clat (usec): min=46, max=40128, avg=3719.80, stdev=3697.97 lat (usec): min=60, max=40143, avg=3734.66, stdev=3697.19 clat percentiles (usec): | 1.00th=[ 277], 5.00th=[ 441], 10.00th=[ 668], 20.00th=[ 1106], | 30.00th=[ 1598], 40.00th=[ 2180], 50.00th=[ 2638], 60.00th=[ 3032], | 70.00th=[ 3884], 80.00th=[ 5342], 90.00th=[ 8356], 95.00th=[12125], | 99.00th=[17695], 99.50th=[19792], 99.90th=[25035], 99.95th=[26870], | 99.99th=[31327] bw ( KiB/s): min=56120, max=93232, per=100.00%, avg=68511.07, stdev=7481.22, samples=149 iops : min=14030, max=23308, avg=17127.74, stdev=1870.32, samples=149 lat (usec) : 50=0.01%, 100=0.01%, 250=0.59%, 500=5.66%, 750=5.74% lat (usec) : 1000=5.63% lat (msec) : 2=19.47%, 4=33.92%, 10=21.39%, 20=7.14%, 50=0.45% cpu : usr=8.19%, sys=29.58%, ctx=1042242, majf=0, minf=161 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,1284258,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs): WRITE: bw=66.9MiB/s (70.2MB/s), 66.9MiB/s-66.9MiB/s (70.2MB/s-70.2MB/s), io=5017MiB (5260MB), run=74983-74983msec

Disk stats (read/write): nvme0n1: ios=0/1534109, merge=0/0, ticks=0/62994, in_queue=62993, util=99.00%

nvme device iostat:

avg-cpu: %user %nice %system %iowait %steal %idle 0.11 0.00 3.37 0.00 0.00 96.51

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme0n1 0.00 0.00 0.00 18806.33 0.00 124.82 13.59 0.78 0.04 0.00 0.04 0.05 99.33

Questions:

rand write performance seems bad ,given the fio with same config on the raw nvme device reaches 300K IOPS.
write amplification seems serious, almost double, the fio on bcachefs file gets bandwidth 66MB/s while the backend nvme device bandwidth reaches 124MB/s.

koverstreet commented 2 years ago

This regressed with backpointers: when backpointers no longer fit into the alloc key we have to spill over to the backpointers btree, which isn't cached in the btree key cache.

Can you try with a 128k bucket size? That will mean (with 4k block size) 32 blocks per bucket, which will fit in the alloc keys - that'll help confirm. I'm going to expand the size of bkey values to address this.

alkkkj commented 2 years ago

Tried, any more experiments needed? [root@qs62 ~]# bcachefs show-super /dev/nvme0n1 External UUID: 148bf99b-5599-4f1d-aa3a-41a1931fb615 Internal UUID: e9db42fc-5c0e-4251-a80b-c8c783f8948f Device index: 0 Label:
Version: backpointers Oldest version on disk: backpointers Created: Fri Sep 23 09:15:06 2022 Sequence number: 9 Superblock size: 3656 Clean: 0 Devices: 1 Sections: members,replicas_v0,clean,journal_v2,counters Features: new_siphash,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes Compat features: alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options: block_size: 4.00 KiB btree_node_size: 128 KiB errors: continue [ro] panic metadata_replicas: 1 data_replicas: 1 metadata_replicas_required: 1 data_replicas_required: 1 encoded_extent_max: 64.0 KiB metadata_checksum: none [crc32c] crc64 xxhash data_checksum: none [crc32c] crc64 xxhash compression: [none] lz4 gzip zstd background_compression: [none] lz4 gzip zstd str_hash: crc32c crc64 [siphash] metadata_target: none foreground_target: none background_target: none promote_target: none erasure_code: 0 inodes_32bit: 1 shard_inode_numbers: 1 inodes_use_key_cache: 1 gc_reserve_percent: 8 gc_reserve_bytes: 0 B root_reserve_percent: 0 wide_macs: 0 acl: 1 usrquota: 0 grpquota: 0 prjquota: 0 journal_flush_delay: 1000 journal_flush_disabled: 0 journal_reclaim_delay: 100

members (size 64): Device: 0 UUID: e4db599b-ced1-4da6-ae97-8713b994e7f2 Size: 1.46 TiB Bucket size: 128 KiB First bucket: 0 Buckets: 12209482 Last mount: Fri Sep 23 09:16:00 2022 State: rw Label: (none) Data allowed: journal,btree,user Has data: journal,btree,user Discard: 0 Freespace initialized: 1

[root@qs62 mnt]# cat fio.conf [global] ioengine=libaio time_based direct=1 thread group_reporting randrepeat=0 norandommap numjobs=1 runtime=300 filename=/mnt/bfs-1/test1

[randwrite-4k-io32] bs=4k iodepth=64 rw=randwrite

[root@qs62 mnt]# fio fio.conf
randwrite-4k-io32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.7 Starting 1 thread ^Cbs: 1 (f=1): [w(1)][37.3%][r=0KiB/s,w=48.2MiB/s][r=0,w=12.3k IOPS][eta 03m:08s]

randwrite-4k-io32: (groupid=0, jobs=1): err= 0: pid=14997: Fri Sep 23 09:31:07 2022 write: IOPS=12.3k, BW=47.0MiB/s (50.3MB/s)(5415MiB/112864msec) slat (usec): min=3, max=3216, avg=19.84, stdev=29.21 clat (usec): min=55, max=182211, avg=5187.89, stdev=4393.18 lat (usec): min=66, max=182225, avg=5208.04, stdev=4392.06 clat percentiles (usec): | 1.00th=[ 281], 5.00th=[ 519], 10.00th=[ 865], 20.00th=[ 1631], | 30.00th=[ 2376], 40.00th=[ 3195], 50.00th=[ 4080], 60.00th=[ 5145], | 70.00th=[ 6390], 80.00th=[ 8160], 90.00th=[10945], 95.00th=[13435], | 99.00th=[19268], 99.50th=[22152], 99.90th=[31327], 99.95th=[36439], | 99.99th=[54789] bw ( KiB/s): min=37640, max=60872, per=99.99%, avg=49130.42, stdev=3498.32, samples=225 iops : min= 9410, max=15218, avg=12282.56, stdev=874.57, samples=225 lat (usec) : 100=0.01%, 250=0.56%, 500=4.18%, 750=3.60%, 1000=3.54% lat (msec) : 2=13.05%, 4=24.12%, 10=38.41%, 20=11.68%, 50=0.84% lat (msec) : 100=0.01%, 250=0.01% cpu : usr=5.97%, sys=27.96%, ctx=843971, majf=0, minf=872 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,1386357,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs): WRITE: bw=47.0MiB/s (50.3MB/s), 47.0MiB/s-47.0MiB/s (50.3MB/s-50.3MB/s), io=5415MiB (5679MB), run=112864-112864msec

Disk stats (read/write): nvme0n1: ios=2/2274235, merge=0/0, ticks=0/27154, in_queue=27155, util=98.69%

avg-cpu: %user %nice %system %iowait %steal %idle 0.08 0.00 3.50 0.00 0.00 96.42

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

nvme0n1 0.00 0.00 0.00 19806.00 0.00 140.08 14.48 0.24 0.01 0.00 0.01 0.05 98.57

koverstreet commented 2 years ago

Damn, that's even worse. I'll have to run some benchmarks of my own and investigate.

alkkkj commented 2 years ago

OS VERSION: cat /etc/redhat-release CentOS Linux release 7.6.1810 (Core) [root@xx bfs-4]# uname -a Linux qs62 5.19.0 #1 SMP PREEMPT_DYNAMIC Wed Sep 21 09:59:57 CST 2022 x86_64 x86_64 x86_64 GNU/Linux

koverstreet commented 1 year ago

Just pushed some performance improvements - in my testing we were CPU bound, but I was seeing higher performance than you so you may be bottlenecked on something else.

Can you check if those patches helped performance for you? And see if you're CPU bound - on this test, htop should show one core pegged at 100% if we are.

bhzhu203 commented 1 year ago

How about the Random read performance ?

alkkkj commented 1 year ago

1) I did not use CPU bound for bcachefs, should I? 2) 30s fio direct model, 4K random write testing result: numjobs 1, qdepth 64, IOPS 19.4k, ava latency 3292.05us numjobs 4, qdepth 1, IOPS 40.4k, ava latency 95.47us

the write amplification is still double(iostat:fio throughput)

Also did testing by more num jobs with qdepth 1, the maximum value is still 40.4k

More details:

bcachefs show-super /dev/nvme nvme0 nvme0n1 nvme1 nvme1n1 [root@qs62 bfs-1]# bcachefs show-super /dev/nvme0n1 External UUID: 83548cc8-185b-4f70-b4cd-3f164f45bb74 Internal UUID: 805a32bb-d980-4e6d-9528-6cecc0d4b970 Device index: 0 Label: Version: backpointers Oldest version on disk: backpointers Created: Wed Sep 28 10:50:41 2022 Sequence number: 9 Superblock size: 3640 Clean: 0 Devices: 1 Sections: members,replicas_v0,clean,journal_v2,counters Features: new_siphash,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes Compat features: alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options: block_size: 512 B btree_node_size: 256 KiB errors: continue [ro] panic metadata_replicas: 1 data_replicas: 1 metadata_replicas_required: 1 data_replicas_required: 1 encoded_extent_max: 64.0 KiB metadata_checksum: none [crc32c] crc64 xxhash data_checksum: none [crc32c] crc64 xxhash compression: [none] lz4 gzip zstd background_compression: [none] lz4 gzip zstd str_hash: crc32c crc64 [siphash] metadata_target: none foreground_target: none background_target: none promote_target: none erasure_code: 0 inodes_32bit: 1 shard_inode_numbers: 1 inodes_use_key_cache: 1 gc_reserve_percent: 8 gc_reserve_bytes: 0 B root_reserve_percent: 0 wide_macs: 0 acl: 1 usrquota: 0 grpquota: 0 prjquota: 0 journal_flush_delay: 1000 journal_flush_disabled: 0 journal_reclaim_delay: 100

members (size 64): Device: 0 UUID: 1dfe8b33-ca39-4faf-9128-ef8b0a12ec11 Size: 1.46 TiB Bucket size: 512 KiB First bucket: 0 Buckets: 3052370 Last mount: Wed Sep 28 10:51:03 2022 State: rw Label: (none) Data allowed: journal,btree,user Has data: journal,btree,user Discard: 0 Freespace initialized: 1

[global] ioengine=libaio time_based direct=1 thread group_reporting randrepeat=0 norandommap numjobs=1 runtime=30 filename=/mnt/test1

[randwrite-4k-io32] bs=4k iodepth=64 rw=randwrite stonewall

qdepth 64, num threads 1, random write IOPS=19.4k

fio ../fio.conf randwrite-4k-io32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.7 Starting 1 thread Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=75.7MiB/s][r=0,w=19.4k IOPS][eta 00m:00s] randwrite-4k-io32: (groupid=0, jobs=1): err= 0: pid=5735: Wed Sep 28 11:06:20 2022 write: IOPS=19.4k, BW=75.9MiB/s (79.6MB/s)(2277MiB/30003msec) slat (usec): min=3, max=1923, avg=15.11, stdev=15.40 clat (usec): min=57, max=27392, avg=3276.65, stdev=2496.61 lat (usec): min=70, max=27410, avg=3292.05, stdev=2495.38 clat percentiles (usec): | 1.00th=[ 285], 5.00th=[ 457], 10.00th=[ 701], 20.00th=[ 1172], | 30.00th=[ 1663], 40.00th=[ 2212], 50.00th=[ 2769], 60.00th=[ 3228], | 70.00th=[ 3949], 80.00th=[ 4948], 90.00th=[ 6652], 95.00th=[ 8225], | 99.00th=[11600], 99.50th=[12911], 99.90th=[16188], 99.95th=[17695], | 99.99th=[21890] bw ( KiB/s): min=70712, max=87792, per=100.00%, avg=77710.13, stdev=2718.97, samples=60 iops : min=17678, max=21948, avg=19427.53, stdev=679.74, samples=60 lat (usec) : 100=0.01%, 250=0.52%, 500=5.41%, 750=5.14%, 1000=5.45% lat (msec) : 2=19.73%, 4=34.23%, 10=27.36%, 20=2.13%, 50=0.02% cpu : usr=9.49%, sys=34.86%, ctx=482744, majf=0, minf=143 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,582889,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs): WRITE: bw=75.9MiB/s (79.6MB/s), 75.9MiB/s-75.9MiB/s (79.6MB/s-79.6MB/s), io=2277MiB (2388MB), run=30003-30003msec

Disk stats (read/write): nvme0n1: ios=0/751200, merge=0/0, ticks=0/12639, in_queue=12639, util=99.36%

But performance looks a little better when setting num threads to 3, iodepth to 1

randwrite-4k-io32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 ... fio-3.7 Starting 4 threads Jobs: 4 (f=4): [w(4)][100.0%][r=0KiB/s,w=158MiB/s][r=0,w=40.4k IOPS][eta 00m:00s] randwrite-4k-io32: (groupid=0, jobs=4): err= 0: pid=5905: Wed Sep 28 11:10:37 2022 write: IOPS=41.3k, BW=161MiB/s (169MB/s)(4834MiB/30001msec) slat (usec): min=6, max=1256, avg= 9.31, stdev= 4.17 clat (usec): min=2, max=3808, avg=85.95, stdev=88.31 lat (usec): min=39, max=3826, avg=95.47, stdev=88.48 clat percentiles (usec): | 1.00th=[ 40], 5.00th=[ 47], 10.00th=[ 52], 20.00th=[ 58], | 30.00th=[ 63], 40.00th=[ 67], 50.00th=[ 70], 60.00th=[ 74], | 70.00th=[ 78], 80.00th=[ 85], 90.00th=[ 106], 95.00th=[ 192], | 99.00th=[ 347], 99.50th=[ 396], 99.90th=[ 1516], 99.95th=[ 1696], | 99.99th=[ 2147] bw ( KiB/s): min=32232, max=51456, per=25.01%, avg=41270.00, stdev=2209.74, samples=236 iops : min= 8058, max=12864, avg=10317.47, stdev=552.43, samples=236 lat (usec) : 4=0.01%, 50=7.99%, 100=80.63%, 250=7.50%, 500=3.55% lat (usec) : 750=0.10%, 1000=0.03% lat (msec) : 2=0.18%, 4=0.02% cpu : usr=3.56%, sys=11.77%, ctx=1238056, majf=0, minf=170 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,1237578,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs): WRITE: bw=161MiB/s (169MB/s), 161MiB/s-161MiB/s (169MB/s-169MB/s), io=4834MiB (5069MB), run=30001-30001msec

Disk stats (read/write): nvme0n1: ios=0/1425536, merge=0/0, ticks=0/22824, in_queue=22823, util=99.59%

koverstreet commented 1 year ago

I missed the fact that you're testing with iodepth=1 before. That explains the low performance, but not the write amplification.

There's various counters and time stats, in particular counters will tell you amount of user data writes, btree writes and journal writes. Check that.

I'd also suspect something is flushing the journal - your fio config doesn't specify sync. There should be a counter for number of journal writes, it should be on the order of one every 10-100 ms for a workload like this - if it's higher, that's the source of the write amplification.

alkkkj commented 1 year ago

I did test with qdepth 64, numjobs 1 , the IOPS is 19.4k, ava latency 3292.05us.

what is the best fio direct random 4k write performance you have tested?

Thanks.

alkkkj commented 1 year ago

Use the kernel version 6.0.0, do the performance testing on the same HW platform, data here(directIO, queue depth 256, num jobs 4, truncate two 50G new files for rand RW and SEQ RW) --------------- bcachefs ------- btrfs----- 4K rand write ----37.6k --------74.9k， 4k rand read ----264k，-------- 313k 64K seq write -----25.0k ----------30.9k 64k seq read -----51.5k -----------51.4k

Looks the btrfs performance is a little better, especially the 4K random write.

koverstreet commented 1 year ago

Yeah backpointers was a huge performance regression - I may have to make it optional, right now I'm working on the btree write buffer code to see if I can get acceptable performance with that

YellowOnion commented 1 year ago

@koverstreet backpointers were solving a performance/scaling issue in copygc that would cripple a large HDD based FS without it, I don't think making it optional is the best idea, perhaps with the new persistent work tree we could solve the issue another way and drop backpointers, or make them async, since we only really need them for copygc.

koverstreet commented 1 year ago

The rebalance work btree is completely unrelated.

Making it optional might be the best approach: since as you noted it's needed primarily for large HDD based filesystems, which are slow enough that the performance impact won't affect them, we can let users enable it for those filesystems and not have it affect smaller SSD filesystems where the performance impact is, as this bug report notes, a real issue.

alkkkj commented 1 year ago

I can do some experimental testing on NVMe device without backpointers if you happy to point how to disable this in code.

koverstreet commented 1 year ago

You'd have to go to the first pre-backpointers commit - 0a10aa12ef

alkkkj commented 1 year ago

~45K, random 4K write , direct IO model. (5.17). I guess you did optimization on direct IO after version 5.17. So I will test again if you make backpointer optional.

koverstreet commented 1 year ago

Btree write buffer has been merged - go ahead and retest

alkkkj commented 1 year ago

koverstreet commented 1 year ago

Post your fio config

alkkkj commented 1 year ago

the above testing use this the following command: fio -direct=1 -iodepth=64 -rw=randwrite -ioengine=libaio -bs=4k -thread -numjobs=4 -group_reporting -filename=/mnt/bfs-1/f1-50g -name=RandWrite_4K_Testing

I set the numjobs to 2, performance is better

alkkkj commented 1 year ago

More testing result with buffer merged code:

BTRFS: fio -direct=1 -iodepth=1 -rw=randwrite -ioengine=libaio -bs=4k -thread -numjobs=1 -group_reporting -filename=/mnt/btrfs-1/f2-50g -name=RandWrite_4K_Testing -timeout=300

IOPS=36.1k lat (usec): min=20, max=1399, avg=26.21, stdev= 8.04

Bcachefs: fio -direct=1 -iodepth=1 -rw=randwrite -ioengine=libaio -bs=4k -thread -numjobs=1 -group_reporting -filename=/mnt/bfs-1/f2-50g -name=RandWrite_4K_Testing -timeout=300

IOPS=14.9k lat (usec): min=36, max=14937, avg=65.09, stdev=187.14

bhzhu203 commented 1 year ago

Mounting fast in large HDD when FS is clean . But when the FS is not clean or unepxected shutdown , mounting is still slow , often 15 mins in 6T HDD . it is more faster now , but not enough , although it is 30 mins or 1 hours before.

I know HDD is poor in Random read ...

Btrfs does basically not need fsck , could Bcachefs work like that ? thanks @koverstreet

koverstreet commented 1 year ago

mount with -o verbose and post the log

bhzhu203 commented 1 year ago

mount with -o verbose and post the log

Mounting hangs long time in "checking for deleted inodes" ,while mounting large HDD

[ 8676.448183] bcachefs: loading out-of-tree module taints kernel.
[ 8676.541734] bcachefs: bch2_fs_open() 
[ 8676.541742] bcachefs: bch2_read_super() 
[ 8676.542168] bcachefs: bch2_read_super() ret 0
[ 8676.542173] bcachefs: bch2_fs_alloc() 
[ 8676.573610] bcachefs: bch2_fs_journal_init() 
[ 8676.573654] bcachefs: bch2_fs_journal_init() ret 0
[ 8676.573784] bcachefs: bch2_fs_btree_cache_init() 
[ 8676.574229] bcachefs: bch2_fs_btree_cache_init() ret 0
[ 8676.576252] bcachefs: bch2_fs_encryption_init() 
[ 8676.576276] bcachefs: bch2_fs_encryption_init() ret 0
[ 8676.576283] bcachefs: __bch2_fs_compress_init() 
[ 8676.576309] bcachefs: __bch2_fs_compress_init() ret 0
[ 8676.576504] bcachefs: bch2_fs_fsio_init() 
[ 8676.577286] bcachefs: bch2_fs_fsio_init() ret 0
[ 8676.577290] bcachefs: bch2_dev_alloc() 
[ 8676.578187] bcachefs: bch2_dev_alloc() ret 0
[ 8676.581995] bcachefs: bch2_fs_alloc() ret 0
[ 8676.582501] bcachefs (sda): recovering from unclean shutdown
[ 8676.582529] bcachefs (sda): starting journal read
[ 8693.689501] bcachefs (sda): journal read done on device sda, ret 0
[ 8693.690201] bcachefs (sda): journal read done, replaying entries 95556109-95558024
[ 8693.862730] bcachefs (sda): starting alloc read
[ 8694.985790] bcachefs (sda): alloc read done
[ 8694.985798] bcachefs (sda): starting stripes_read
[ 8694.985817] bcachefs (sda): stripes_read done
[ 8694.985819] bcachefs (sda): reading snapshots table
[ 8694.985944] bcachefs (sda): reading snapshots done
[ 8694.985947] bcachefs (sda): starting journal replay, 9327 keys
[ 8694.987256] bcachefs (sda): going read-write
[ 8694.996949] ------------[ cut here ]------------
[ 8694.996952] Looking for class "b->c.lock" with key bch2_btree_node_lock_key [bcachefs], but found a different class "b->c.lock" with the same key
[ 8694.996987] WARNING: CPU: 7 PID: 23732 at kernel/locking/lockdep.c:937 look_up_lock_class+0x9c/0x110
[ 8694.996995] Modules linked in: bcachefs(O) mean_and_variance intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rfkill kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_rcore snd_hwdep aesni_intel crypto_simd snd_seq iTCO_wdt cryptd iTCO_vendor_support mxm_wmi snd_seq_device snd_pcm efi_pstore pcspkr i2c_i801 snd_timer i2c_smbus snd lpc_ich sg i2c_core soundcore mfd_core evdev wmi button ip_tables x_tables r8169 realtek mdio_devres libphy efivarfs autofs4
[ 8694.997047] CPU: 7 PID: 23732 Comm: kworker/7:0 Tainted: G           O       6.1.0-uksm+ #50 346e44c1ce9f02bb71dda964ff5209e1818a5f27
[ 8694.997050] Hardware name: INTEL X99/X99, BIOS 5.11 12/30/2020
[ 8694.997052] Workqueue: events_long bch2_do_discards_work [bcachefs]
[ 8694.997079] RIP: 0010:look_up_lock_class+0x9c/0x110
[ 8694.997083] Code: 17 48 81 fa 80 2c e5 82 74 52 80 3d ca 26 d5 00 00 75 49 48 c7 c7 40 a5 54 82 48 89 45 f0 c6 05 b6 26 d5 00 01 e8 e4 cf f9 ff <0f> 0b 48 8b 45 f0 eb 2a 89 75 f0 e8 b4 b3 a0 ff 8b 75 f0 48 c7 c7
[ 8694.997085] RSP: 0018:ffffc9000b2a35b8 EFLAGS: 00010096
[ 8694.997087] RAX: 0000000000000085 RBX: 0000000000000000 RCX: 0000000000000000
[ 8694.997089] RDX: 0000000000000001 RSI: ffffffff825b1ec8 RDI: 00000000ffffffff
[ 8694.997090] RBP: ffffc9000b2a35c8 R08: 0000000000013ffb R09: 00000000ffffbfff
[ 8694.997091] R10: ffff88903fca0000 R11: ffff88903fca0000 R12: 0000000000000000
[ 8694.997093] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 8694.997094] FS:  0000000000000000(0000) GS:ffff889033a00000(0000) knlGS:0000000000000000
[ 8694.997096] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8694.997097] CR2: 00007f48850a2000 CR3: 00000001300f0002 CR4: 00000000003706e0
[ 8694.997098] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8694.997100] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8694.997101] Call Trace:
[ 8694.997102]  <TASK>
[ 8694.997105]  register_lock_class+0x36/0x7d0
[ 8694.997109]  ? __this_cpu_preempt_check+0x13/0x20
[ 8694.997115]  __lock_acquire+0x60/0x1cd0
[ 8694.997120]  lock_acquire+0xc0/0x2b0
[ 8694.997123]  ? bch2_btree_path_traverse_cached_slowpath+0xd37/0x15a0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997157]  ? bch2_btree_path_traverse_cached_slowpath+0xd37/0x15a0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997187]  six_trylock_ip_intent+0x5d/0x160
[ 8694.997190]  ? bch2_btree_path_traverse_cached_slowpath+0xd37/0x15a0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997220]  bch2_btree_path_traverse_cached_slowpath+0xd4d/0x15a0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997249]  ? bch2_btree_key_cache_find+0x226/0x3a0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997281]  bch2_btree_path_traverse_cached+0x1c0/0x240 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997310]  bch2_btree_path_traverse_one+0x5cc/0xd60 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997339]  ? bch2_do_discards_work+0x90a/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997366]  ? bch2_btree_path_verify+0x87/0xc0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997395]  bch2_btree_iter_peek_slot+0x12a/0x760 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997423]  ? bch2_printbuf_exit+0x1c/0x30 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997463]  ? bch2_trans_update_max_paths+0xd3/0xf2 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997507]  ? bch2_path_get+0x2b3/0x4f0 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997539]  ? bch2_do_discards_work+0x810/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997570]  bch2_do_discards_work+0x90a/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997608]  ? bch2_do_discards_work+0x42/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997640]  ? bch2_do_discards_work+0x810/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997675]  ? bch2_do_discards_work+0x15a/0xe50 [bcachefs ca89c72e16aef781524bbba62a3551a36a5bf6e2]
[ 8694.997710]  process_one_work+0x27a/0x570
[ 8694.997717]  worker_thread+0x52/0x3e0
[ 8694.997720]  ? process_one_work+0x570/0x570
[ 8694.997722]  kthread+0xf5/0x120
[ 8694.997725]  ? kthread_complete_and_exit+0x20/0x20
[ 8694.997730]  ret_from_fork+0x1f/0x30
[ 8694.997738]  </TASK>
[ 8694.997739] irq event stamp: 26167
[ 8694.997741] hardirqs last  enabled at (26167): [<ffffffff8131479a>] ___slab_alloc.constprop.0+0x91a/0x1090
[ 8694.997746] hardirqs last disabled at (26166): [<ffffffff81314c2f>] ___slab_alloc.constprop.0+0xdaf/0x1090
[ 8694.997749] softirqs last  enabled at (26132): [<ffffffffa074b645>] __bch2_btree_node_hash_insert+0x3a5/0x6e0 [bcachefs]
[ 8694.997779] softirqs last disabled at (26130): [<ffffffffa074b3e6>] __bch2_btree_node_hash_insert+0x146/0x6e0 [bcachefs]
[ 8694.997809] ---[ end trace 0000000000000000 ]---
[ 8701.223951] bcachefs (sda): journal replay done
[ 8701.223969] bcachefs (sda): checking for deleted inodes

koverstreet commented 1 year ago

Ok that's fixable - we need to add an index for deleted inodes.

koverstreet / bcachefs

NVMe Random write performance[5.19] #458