neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.24k stars 444 forks source link

safekeeper: increase segment size #9687

Open erikgrinaker opened 2 weeks ago

erikgrinaker commented 2 weeks ago

Fsync costs when closing and initializing segments significantly affect WAL ingestion throughput. Increasing the segment size would amortize these costs.

Local experiments on my MacBook show that increasing the segment size from 16 MB to 128 MB yields a 200% improvement in throughput for large appends.

erikgrinaker commented 2 weeks ago

Benchmarks with 128 MB segments on a MacBook (compared to 16 MB segments):

wal_acceptor_throughput/fsync=true/commit=false/size=1024
                        time:   [12.519 s 12.546 s 12.572 s]
                        thrpt:  [81.448 MiB/s 81.618 MiB/s 81.796 MiB/s]
                 change:
                        time:   [-8.5756% -7.9787% -7.5078%] (p = 0.00 < 0.05)
                        thrpt:  [+8.1173% +8.6705% +9.3800%]
wal_acceptor_throughput/fsync=true/commit=false/size=8192
                        time:   [2.0104 s 2.0257 s 2.0420 s]
                        thrpt:  [501.48 MiB/s 505.50 MiB/s 509.34 MiB/s]
                 change:
                        time:   [-37.287% -36.477% -35.645%] (p = 0.00 < 0.05)
                        thrpt:  [+55.388% +57.423% +59.456%]
wal_acceptor_throughput/fsync=true/commit=false/size=131072
                        time:   [580.04 ms 592.50 ms 606.46 ms]
                        thrpt:  [1.6489 GiB/s 1.6878 GiB/s 1.7240 GiB/s]
                 change:
                        time:   [-68.217% -67.331% -66.340%] (p = 0.00 < 0.05)
                        thrpt:  [+197.08% +206.10% +214.63%]
wal_acceptor_throughput/fsync=true/commit=false/size=1048576
                        time:   [636.88 ms 651.40 ms 668.31 ms]
                        thrpt:  [1.4963 GiB/s 1.5352 GiB/s 1.5702 GiB/s]
                 change:
                        time:   [-69.122% -68.119% -66.988%] (p = 0.00 < 0.05)
                        thrpt:  [+202.92% +213.66% +223.86%]
erikgrinaker commented 1 week ago

Ran the benchmark on a i4i.2xlarge instance (the current Safekeeper instance type). The local NVMe disk has a max throughput of 1.1 GB/s according to dd (including fsync).

When increasing the segment size from 16 MB to 128 MB, we only see a marginal 8% improvement at large writes with fsync enabled:

wal_acceptor_throughput/fsync=true/commit=false/size=1024
                        time:   [25.258 s 25.348 s 25.433 s]
                        thrpt:  [40.262 MiB/s 40.398 MiB/s 40.542 MiB/s]
                 change:
                        time:   [-2.4044% -1.8825% -1.3797%] (p = 0.00 < 0.05)
                        thrpt:  [+1.3990% +1.9186% +2.4637%]
wal_acceptor_throughput/fsync=true/commit=false/size=8192
                        time:   [4.2398 s 4.2792 s 4.3206 s]
                        thrpt:  [237.01 MiB/s 239.30 MiB/s 241.52 MiB/s]
                 change:
                        time:   [-4.4160% -3.0942% -1.7938%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8266% +3.1930% +4.6200%]
wal_acceptor_throughput/fsync=true/commit=false/size=131072
                        time:   [1.3707 s 1.3994 s 1.4365 s]
                        thrpt:  [712.85 MiB/s 731.76 MiB/s 747.08 MiB/s]
                 change:
                        time:   [-6.3426% -4.0051% -1.0257%] (p = 0.01 < 0.05)
                        thrpt:  [+1.0363% +4.1722% +6.7721%]
wal_acceptor_throughput/fsync=true/commit=false/size=1048576
                        time:   [1.3187 s 1.3252 s 1.3319 s]
                        thrpt:  [768.85 MiB/s 772.73 MiB/s 776.51 MiB/s]
                 change:
                        time:   [-8.4178% -7.2095% -6.1468%] (p = 0.00 < 0.05)
                        thrpt:  [+6.5494% +7.7696% +9.1915%]

The run with fsync disabled also saw minor improvements, but it's already saturating the hardware:

wal_acceptor_throughput/fsync=false/commit=false/size=1024
                        time:   [24.855 s 24.958 s 25.061 s]
                        thrpt:  [40.861 MiB/s 41.029 MiB/s 41.199 MiB/s]
                 change:
                        time:   [-1.8436% -1.3008% -0.6921%] (p = 0.00 < 0.05)
                        thrpt:  [+0.6969% +1.3180% +1.8783%]
wal_acceptor_throughput/fsync=false/commit=false/size=8192
                        time:   [3.7415 s 3.7863 s 3.8349 s]
                        thrpt:  [267.02 MiB/s 270.45 MiB/s 273.69 MiB/s]
                 change:
                        time:   [-4.4309% -2.8678% -1.1727%] (p = 0.00 < 0.05)
                        thrpt:  [+1.1866% +2.9524% +4.6363%]
wal_acceptor_throughput/fsync=false/commit=false/size=131072
                        time:   [903.61 ms 920.39 ms 941.14 ms]
                        thrpt:  [1.0625 GiB/s 1.0865 GiB/s 1.1067 GiB/s]
                 change:
                        time:   [-2.4959% +0.4753% +3.9114%] (p = 0.78 > 0.05)
                        thrpt:  [-3.7642% -0.4731% +2.5598%]
wal_acceptor_throughput/fsync=false/commit=false/size=1048576
                        time:   [847.05 ms 852.62 ms 859.55 ms]
                        thrpt:  [1.1634 GiB/s 1.1729 GiB/s 1.1806 GiB/s]
                 change:
                        time:   [-7.3003% -5.4448% -3.8696%] (p = 0.00 < 0.05)
                        thrpt:  [+4.0253% +5.7583% +7.8752%]

The large discrepancy compared with the macOS results (200% improvement) is mostly down to fsync latencies. On my MacBook with macOS, an 8-byte write+fsync write takes 4.1 ms -- on the i4i.2xlarge running Linux, an 8-byte write+fsync takes 0.1 ms.

However, even with 256 MB segments and 8 MB append sizes, we're still capping out at 700 MB/s. We may be hitting some other bottleneck instead. Maybe the segment size will matter once we resolve this other bottleneck, or with faster disk. Worth exploring further.

I ran these benchmarks on a Hetzner node with a local SSD as well, results were similar.