Open erikgrinaker opened 2 days ago
Notes from initial waltest
tests (local with fsync disabled):
fio
can do sustained 5 GB/s writes with large enough block sizes (1M).MAX_SEND_SIZE
from 128 KB to 1 MB did not significantly improve throughput.Interesting findings:
AppendRequest::wal_data
before dispatching writes significantly reduced throughput (below 100 MB/s).NoFlushAppendRequest
).I wonder if the delayed flushes due to the NoFlushAppendRequest
batching causes delays in commit_lsn
advancement, and this somehow affects the pipelining (similarly to the bandwidth-delay product). Will investigate.
Results of the process_msg
and wal_acceptor_throughput
benchmarks from #9531 with fsync=false
on my local MacBook. This only measures payload throughput, and considers segment/page/record/message headers to be overhead.
process_msg/fsync=false/size=8
time: [8.4812 µs 8.5148 µs 8.5460 µs]
thrpt: [914.17 KiB/s 917.52 KiB/s 921.16 KiB/s]
process_msg/fsync=false/size=1024
time: [9.4852 µs 9.5408 µs 9.5931 µs]
thrpt: [101.80 MiB/s 102.36 MiB/s 102.96 MiB/s]
process_msg/fsync=false/size=8192
time: [15.714 µs 16.296 µs 16.824 µs]
thrpt: [464.36 MiB/s 479.41 MiB/s 497.17 MiB/s]
process_msg/fsync=false/size=131072
time: [147.03 µs 153.72 µs 160.07 µs]
thrpt: [780.93 MiB/s 813.19 MiB/s 850.15 MiB/s]
process_msg/fsync=false/size=1048576
time: [1.3300 ms 1.3565 ms 1.3812 ms]
thrpt: [723.99 MiB/s 737.21 MiB/s 751.90 MiB/s]
wal_acceptor_throughput/fsync=false/size=1024
time: [14.618 s 14.714 s 14.828 s]
thrpt: [69.058 MiB/s 69.591 MiB/s 70.049 MiB/s]
wal_acceptor_throughput/fsync=false/size=8192
time: [3.3334 s 3.3466 s 3.3611 s]
thrpt: [304.66 MiB/s 305.98 MiB/s 307.19 MiB/s]
wal_acceptor_throughput/fsync=false/size=131072
time: [1.9624 s 1.9828 s 2.0100 s]
thrpt: [509.46 MiB/s 516.44 MiB/s 521.82 MiB/s]
wal_acceptor_throughput/fsync=false/size=1048576
time: [1.9293 s 1.9515 s 1.9787 s]
thrpt: [517.51 MiB/s 524.74 MiB/s 530.75 MiB/s]
Some findings:
process_msg
caps out at ~800 MB/s. The disk can do 6 GB/s.size=1024
to size=8192
, the additional IO cost is also about 8 µs.Adding another variant that also commits records doesn't show any significant difference. We only flush the control file after each WAL segment, so that checks out.
process_msg/fsync=false/commit=false/size=8
time: [8.8677 µs 8.9545 µs 9.0712 µs]
thrpt: [861.24 KiB/s 872.46 KiB/s 881.01 KiB/s]
process_msg/fsync=false/commit=false/size=1024
time: [9.3791 µs 9.4232 µs 9.4647 µs]
thrpt: [103.18 MiB/s 103.63 MiB/s 104.12 MiB/s]
process_msg/fsync=false/commit=false/size=8192
time: [16.107 µs 16.785 µs 17.405 µs]
thrpt: [448.86 MiB/s 465.45 MiB/s 485.02 MiB/s]
process_msg/fsync=false/commit=false/size=131072
time: [153.33 µs 159.96 µs 166.16 µs]
thrpt: [752.29 MiB/s 781.47 MiB/s 815.22 MiB/s]
process_msg/fsync=false/commit=false/size=1048576
time: [1.3596 ms 1.3936 ms 1.4289 ms]
thrpt: [699.83 MiB/s 717.57 MiB/s 735.51 MiB/s]
process_msg/fsync=false/commit=true/size=8
time: [8.7410 µs 8.8093 µs 8.8824 µs]
thrpt: [879.55 KiB/s 886.85 KiB/s 893.78 KiB/s]
process_msg/fsync=false/commit=true/size=1024
time: [9.5369 µs 9.8461 µs 10.228 µs]
thrpt: [95.479 MiB/s 99.183 MiB/s 102.40 MiB/s]
process_msg/fsync=false/commit=true/size=8192
time: [15.810 µs 16.456 µs 17.041 µs]
thrpt: [458.44 MiB/s 474.75 MiB/s 494.14 MiB/s]
process_msg/fsync=false/commit=true/size=131072
time: [143.30 µs 154.28 µs 164.11 µs]
thrpt: [761.67 MiB/s 810.21 MiB/s 872.30 MiB/s]
process_msg/fsync=false/commit=true/size=1048576
time: [1.4108 ms 1.4383 ms 1.4624 ms]
thrpt: [683.79 MiB/s 695.29 MiB/s 708.80 MiB/s]
The fixed 8 µs cost is likely Tokio scheduling overhead, see flamegraph.svg. Tokio's file IO is not efficient. One option is to explore tokio-epoll-uring. Another is to increase batching along with tokio::fs::File::set_max_buf_size
to amortize scheduling costs.
Just to get a baseline, I added benchmarks for simple stdlib and Tokio writes. This confirms that the majority of the fixed 8 µs cost is just Tokio. However, it also shows that we should be able to saturate disks with both write paths at large enough write sizes, although the throughput is highly unreliable beyond 1 GB/s. This is unlike the process_msg
path, where we cap out at around 800 MB/s -- worth exploring. Could be the segment size.
file_write/stdlib/fsync=false/size=8
time: [1.1057 µs 1.1126 µs 1.1211 µs]
thrpt: [6.8055 MiB/s 6.8572 MiB/s 6.9002 MiB/s]
file_write/stdlib/fsync=false/size=1024
time: [1.6529 µs 1.6643 µs 1.6787 µs]
thrpt: [581.72 MiB/s 586.75 MiB/s 590.83 MiB/s]
file_write/stdlib/fsync=false/size=8192
time: [3.6048 µs 3.7885 µs 4.0441 µs]
thrpt: [1.8866 GiB/s 2.0139 GiB/s 2.1165 GiB/s]
file_write/stdlib/fsync=false/size=131072
time: [33.349 µs 40.166 µs 46.905 µs]
thrpt: [2.6025 GiB/s 3.0391 GiB/s 3.6604 GiB/s]
file_write/stdlib/fsync=false/size=1048576
time: [180.84 µs 206.91 µs 234.39 µs]
thrpt: [4.1665 GiB/s 4.7198 GiB/s 5.4003 GiB/s]
file_write/tokio/fsync=false/size=8
time: [7.4669 µs 7.5745 µs 7.6736 µs]
thrpt: [1018.1 KiB/s 1.0072 MiB/s 1.0218 MiB/s]
file_write/tokio/fsync=false/size=1024
time: [8.2489 µs 8.2887 µs 8.3289 µs]
thrpt: [117.25 MiB/s 117.82 MiB/s 118.39 MiB/s]
file_write/tokio/fsync=false/size=8192
time: [8.7655 µs 9.3938 µs 10.119 µs]
thrpt: [772.06 MiB/s 831.66 MiB/s 891.28 MiB/s]
file_write/tokio/fsync=false/size=131072
time: [23.198 µs 23.475 µs 23.824 µs]
thrpt: [5.1239 GiB/s 5.2001 GiB/s 5.2622 GiB/s]
file_write/tokio/fsync=false/size=1048576
time: [287.18 µs 340.62 µs 402.53 µs]
thrpt: [2.4260 GiB/s 2.8671 GiB/s 3.4005 GiB/s]
Could be the segment size.
It's segments all right:
Ingestion of logical messages (noops) should be able to saturate the hardware, especially when fsync is disabled. It doesn't. Why?
See Slack thread and waltest.txt.