Investigate WAL ingest performance for logical messages - Githubissues

neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.

https://neon.tech

Apache License 2.0

14.98k stars 436 forks source link

Investigate WAL ingest performance for logical messages #9642

Open erikgrinaker opened 2 days ago

erikgrinaker commented 2 days ago

Ingestion of logical messages (noops) should be able to saturate the hardware, especially when fsync is disabled. It doesn't. Why?

See Slack thread and waltest.txt.

erikgrinaker commented 2 days ago

Notes from initial waltest tests (local with fsync disabled):

WAL throughput caps out at ~300 MB/s, with ~1.2 GB/s disk throughput.
fio can do sustained 5 GB/s writes with large enough block sizes (1M).
Each append request results in an individual Safekeeper disk write.
Each commit advancement results in a control file write and rename.
Increasing logical messages from 10 KB to 1 MB did not significantly improve throughput.
Increasing MAX_SEND_SIZE from 128 KB to 1 MB did not significantly improve throughput.

Interesting findings:

Buffering/batching of AppendRequest::wal_data before dispatching writes significantly reduced throughput (below 100 MB/s).
Disabling flush batching in WAL receiver significantly improved throughput to ~400 MB/s (i.e. don't use NoFlushAppendRequest).

I wonder if the delayed flushes due to the NoFlushAppendRequest batching causes delays in commit_lsn advancement, and this somehow affects the pipelining (similarly to the bandwidth-delay product). Will investigate.

erikgrinaker commented 1 day ago

Results of the process_msg and wal_acceptor_throughput benchmarks from #9531 with fsync=false on my local MacBook. This only measures payload throughput, and considers segment/page/record/message headers to be overhead.

process_msg/fsync=false/size=8
                        time:   [8.4812 µs 8.5148 µs 8.5460 µs]
                        thrpt:  [914.17 KiB/s 917.52 KiB/s 921.16 KiB/s]
process_msg/fsync=false/size=1024
                        time:   [9.4852 µs 9.5408 µs 9.5931 µs]
                        thrpt:  [101.80 MiB/s 102.36 MiB/s 102.96 MiB/s]
process_msg/fsync=false/size=8192
                        time:   [15.714 µs 16.296 µs 16.824 µs]
                        thrpt:  [464.36 MiB/s 479.41 MiB/s 497.17 MiB/s]
process_msg/fsync=false/size=131072
                        time:   [147.03 µs 153.72 µs 160.07 µs]
                        thrpt:  [780.93 MiB/s 813.19 MiB/s 850.15 MiB/s]
process_msg/fsync=false/size=1048576
                        time:   [1.3300 ms 1.3565 ms 1.3812 ms]
                        thrpt:  [723.99 MiB/s 737.21 MiB/s 751.90 MiB/s]

wal_acceptor_throughput/fsync=false/size=1024
                        time:   [14.618 s 14.714 s 14.828 s]
                        thrpt:  [69.058 MiB/s 69.591 MiB/s 70.049 MiB/s]
wal_acceptor_throughput/fsync=false/size=8192
                        time:   [3.3334 s 3.3466 s 3.3611 s]
                        thrpt:  [304.66 MiB/s 305.98 MiB/s 307.19 MiB/s]
wal_acceptor_throughput/fsync=false/size=131072
                        time:   [1.9624 s 1.9828 s 2.0100 s]
                        thrpt:  [509.46 MiB/s 516.44 MiB/s 521.82 MiB/s]
wal_acceptor_throughput/fsync=false/size=1048576
                        time:   [1.9293 s 1.9515 s 1.9787 s]
                        thrpt:  [517.51 MiB/s 524.74 MiB/s 530.75 MiB/s]

Some findings:

Raw process_msg caps out at ~800 MB/s. The disk can do 6 GB/s.
There is a fixed cost of ~8 µs, or 125,000 messages/s.
- At 8 KB/message, that's about 1 GB/s.
- Comparing size=1024 to size=8192, the additional IO cost is also about 8 µs.
- The IO cost appears to scale linearly at about 1.2 µs/KB.
- At 8 KB writes, optimizing or batching away the fixed cost would provide at most a 2x throughput improvement.
The WAL acceptor loop reduces throughput from ~800 MB/s to ~500 MB/s, or ~40%, even at large message sizes. Why?

erikgrinaker commented 1 day ago

Adding another variant that also commits records doesn't show any significant difference. We only flush the control file after each WAL segment, so that checks out.

https://github.com/neondatabase/neon/blob/e287f36a058221b7c804b4b0f440933962eb3deb/safekeeper/src/safekeeper.rs#L931-L938

process_msg/fsync=false/commit=false/size=8
                        time:   [8.8677 µs 8.9545 µs 9.0712 µs]
                        thrpt:  [861.24 KiB/s 872.46 KiB/s 881.01 KiB/s]
process_msg/fsync=false/commit=false/size=1024
                        time:   [9.3791 µs 9.4232 µs 9.4647 µs]
                        thrpt:  [103.18 MiB/s 103.63 MiB/s 104.12 MiB/s]
process_msg/fsync=false/commit=false/size=8192
                        time:   [16.107 µs 16.785 µs 17.405 µs]
                        thrpt:  [448.86 MiB/s 465.45 MiB/s 485.02 MiB/s]
process_msg/fsync=false/commit=false/size=131072
                        time:   [153.33 µs 159.96 µs 166.16 µs]
                        thrpt:  [752.29 MiB/s 781.47 MiB/s 815.22 MiB/s]
process_msg/fsync=false/commit=false/size=1048576
                        time:   [1.3596 ms 1.3936 ms 1.4289 ms]
                        thrpt:  [699.83 MiB/s 717.57 MiB/s 735.51 MiB/s]

process_msg/fsync=false/commit=true/size=8
                        time:   [8.7410 µs 8.8093 µs 8.8824 µs]
                        thrpt:  [879.55 KiB/s 886.85 KiB/s 893.78 KiB/s]
process_msg/fsync=false/commit=true/size=1024
                        time:   [9.5369 µs 9.8461 µs 10.228 µs]
                        thrpt:  [95.479 MiB/s 99.183 MiB/s 102.40 MiB/s]
process_msg/fsync=false/commit=true/size=8192
                        time:   [15.810 µs 16.456 µs 17.041 µs]
                        thrpt:  [458.44 MiB/s 474.75 MiB/s 494.14 MiB/s]
process_msg/fsync=false/commit=true/size=131072
                        time:   [143.30 µs 154.28 µs 164.11 µs]
                        thrpt:  [761.67 MiB/s 810.21 MiB/s 872.30 MiB/s]
process_msg/fsync=false/commit=true/size=1048576
                        time:   [1.4108 ms 1.4383 ms 1.4624 ms]
                        thrpt:  [683.79 MiB/s 695.29 MiB/s 708.80 MiB/s]

erikgrinaker commented 1 day ago

The fixed 8 µs cost is likely Tokio scheduling overhead, see flamegraph.svg. Tokio's file IO is not efficient. One option is to explore tokio-epoll-uring. Another is to increase batching along with tokio::fs::File::set_max_buf_size to amortize scheduling costs.

erikgrinaker commented 1 day ago

Just to get a baseline, I added benchmarks for simple stdlib and Tokio writes. This confirms that the majority of the fixed 8 µs cost is just Tokio. However, it also shows that we should be able to saturate disks with both write paths at large enough write sizes, although the throughput is highly unreliable beyond 1 GB/s. This is unlike the process_msg path, where we cap out at around 800 MB/s -- worth exploring. Could be the segment size.

file_write/stdlib/fsync=false/size=8
                        time:   [1.1057 µs 1.1126 µs 1.1211 µs]
                        thrpt:  [6.8055 MiB/s 6.8572 MiB/s 6.9002 MiB/s]
file_write/stdlib/fsync=false/size=1024
                        time:   [1.6529 µs 1.6643 µs 1.6787 µs]
                        thrpt:  [581.72 MiB/s 586.75 MiB/s 590.83 MiB/s]
file_write/stdlib/fsync=false/size=8192
                        time:   [3.6048 µs 3.7885 µs 4.0441 µs]
                        thrpt:  [1.8866 GiB/s 2.0139 GiB/s 2.1165 GiB/s]
file_write/stdlib/fsync=false/size=131072
                        time:   [33.349 µs 40.166 µs 46.905 µs]
                        thrpt:  [2.6025 GiB/s 3.0391 GiB/s 3.6604 GiB/s]
file_write/stdlib/fsync=false/size=1048576
                        time:   [180.84 µs 206.91 µs 234.39 µs]
                        thrpt:  [4.1665 GiB/s 4.7198 GiB/s 5.4003 GiB/s]

file_write/tokio/fsync=false/size=8
                        time:   [7.4669 µs 7.5745 µs 7.6736 µs]
                        thrpt:  [1018.1 KiB/s 1.0072 MiB/s 1.0218 MiB/s]
file_write/tokio/fsync=false/size=1024
                        time:   [8.2489 µs 8.2887 µs 8.3289 µs]
                        thrpt:  [117.25 MiB/s 117.82 MiB/s 118.39 MiB/s]
file_write/tokio/fsync=false/size=8192
                        time:   [8.7655 µs 9.3938 µs 10.119 µs]
                        thrpt:  [772.06 MiB/s 831.66 MiB/s 891.28 MiB/s]
file_write/tokio/fsync=false/size=131072
                        time:   [23.198 µs 23.475 µs 23.824 µs]
                        thrpt:  [5.1239 GiB/s 5.2001 GiB/s 5.2622 GiB/s]
file_write/tokio/fsync=false/size=1048576
                        time:   [287.18 µs 340.62 µs 402.53 µs]
                        thrpt:  [2.4260 GiB/s 2.8671 GiB/s 3.4005 GiB/s]

For completeness, here are `fsync=true` results (expand) -- as expected, Tokio and stdlib are roughly equal here.

``` file_write/stdlib/fsync=true/size=8 time: [4.2782 ms 4.3225 ms 4.3734 ms] thrpt: [1.7864 KiB/s 1.8074 KiB/s 1.8261 KiB/s] file_write/stdlib/fsync=true/size=1024 time: [4.3500 ms 4.3826 ms 4.4172 ms] thrpt: [226.39 KiB/s 228.18 KiB/s 229.89 KiB/s] file_write/stdlib/fsync=true/size=8192 time: [4.7115 ms 4.7774 ms 4.8438 ms] thrpt: [1.6129 MiB/s 1.6353 MiB/s 1.6582 MiB/s] file_write/stdlib/fsync=true/size=131072 time: [4.8032 ms 4.8609 ms 4.9180 ms] thrpt: [25.417 MiB/s 25.715 MiB/s 26.024 MiB/s] file_write/stdlib/fsync=true/size=1048576 time: [5.1204 ms 5.1468 ms 5.1752 ms] thrpt: [193.23 MiB/s 194.30 MiB/s 195.30 MiB/s] file_write/tokio/fsync=true/size=8 time: [4.1335 ms 4.1950 ms 4.2764 ms] thrpt: [1.8269 KiB/s 1.8623 KiB/s 1.8901 KiB/s] file_write/tokio/fsync=true/size=1024 time: [4.2537 ms 4.3002 ms 4.3485 ms] thrpt: [229.97 KiB/s 232.55 KiB/s 235.09 KiB/s] file_write/tokio/fsync=true/size=8192 time: [4.9080 ms 4.9594 ms 5.0100 ms] thrpt: [1.5594 MiB/s 1.5753 MiB/s 1.5918 MiB/s] file_write/tokio/fsync=true/size=131072 time: [4.9253 ms 4.9811 ms 5.0378 ms] thrpt: [24.812 MiB/s 25.095 MiB/s 25.379 MiB/s] file_write/tokio/fsync=true/size=1048576 time: [5.2555 ms 5.2914 ms 5.3267 ms] thrpt: [187.73 MiB/s 188.99 MiB/s 190.28 MiB/s] ```

erikgrinaker commented 1 day ago

Could be the segment size.

It's segments all right: