Open erikgrinaker opened 2 weeks ago
Benchmarks with 128 MB segments on a MacBook (compared to 16 MB segments):
wal_acceptor_throughput/fsync=true/commit=false/size=1024
time: [12.519 s 12.546 s 12.572 s]
thrpt: [81.448 MiB/s 81.618 MiB/s 81.796 MiB/s]
change:
time: [-8.5756% -7.9787% -7.5078%] (p = 0.00 < 0.05)
thrpt: [+8.1173% +8.6705% +9.3800%]
wal_acceptor_throughput/fsync=true/commit=false/size=8192
time: [2.0104 s 2.0257 s 2.0420 s]
thrpt: [501.48 MiB/s 505.50 MiB/s 509.34 MiB/s]
change:
time: [-37.287% -36.477% -35.645%] (p = 0.00 < 0.05)
thrpt: [+55.388% +57.423% +59.456%]
wal_acceptor_throughput/fsync=true/commit=false/size=131072
time: [580.04 ms 592.50 ms 606.46 ms]
thrpt: [1.6489 GiB/s 1.6878 GiB/s 1.7240 GiB/s]
change:
time: [-68.217% -67.331% -66.340%] (p = 0.00 < 0.05)
thrpt: [+197.08% +206.10% +214.63%]
wal_acceptor_throughput/fsync=true/commit=false/size=1048576
time: [636.88 ms 651.40 ms 668.31 ms]
thrpt: [1.4963 GiB/s 1.5352 GiB/s 1.5702 GiB/s]
change:
time: [-69.122% -68.119% -66.988%] (p = 0.00 < 0.05)
thrpt: [+202.92% +213.66% +223.86%]
Ran the benchmark on a i4i.2xlarge
instance (the current Safekeeper instance type). The local NVMe disk has a max throughput of 1.1 GB/s according to dd
(including fsync).
When increasing the segment size from 16 MB to 128 MB, we only see a marginal 8% improvement at large writes with fsync enabled:
wal_acceptor_throughput/fsync=true/commit=false/size=1024
time: [25.258 s 25.348 s 25.433 s]
thrpt: [40.262 MiB/s 40.398 MiB/s 40.542 MiB/s]
change:
time: [-2.4044% -1.8825% -1.3797%] (p = 0.00 < 0.05)
thrpt: [+1.3990% +1.9186% +2.4637%]
wal_acceptor_throughput/fsync=true/commit=false/size=8192
time: [4.2398 s 4.2792 s 4.3206 s]
thrpt: [237.01 MiB/s 239.30 MiB/s 241.52 MiB/s]
change:
time: [-4.4160% -3.0942% -1.7938%] (p = 0.00 < 0.05)
thrpt: [+1.8266% +3.1930% +4.6200%]
wal_acceptor_throughput/fsync=true/commit=false/size=131072
time: [1.3707 s 1.3994 s 1.4365 s]
thrpt: [712.85 MiB/s 731.76 MiB/s 747.08 MiB/s]
change:
time: [-6.3426% -4.0051% -1.0257%] (p = 0.01 < 0.05)
thrpt: [+1.0363% +4.1722% +6.7721%]
wal_acceptor_throughput/fsync=true/commit=false/size=1048576
time: [1.3187 s 1.3252 s 1.3319 s]
thrpt: [768.85 MiB/s 772.73 MiB/s 776.51 MiB/s]
change:
time: [-8.4178% -7.2095% -6.1468%] (p = 0.00 < 0.05)
thrpt: [+6.5494% +7.7696% +9.1915%]
The run with fsync disabled also saw minor improvements, but it's already saturating the hardware:
wal_acceptor_throughput/fsync=false/commit=false/size=1024
time: [24.855 s 24.958 s 25.061 s]
thrpt: [40.861 MiB/s 41.029 MiB/s 41.199 MiB/s]
change:
time: [-1.8436% -1.3008% -0.6921%] (p = 0.00 < 0.05)
thrpt: [+0.6969% +1.3180% +1.8783%]
wal_acceptor_throughput/fsync=false/commit=false/size=8192
time: [3.7415 s 3.7863 s 3.8349 s]
thrpt: [267.02 MiB/s 270.45 MiB/s 273.69 MiB/s]
change:
time: [-4.4309% -2.8678% -1.1727%] (p = 0.00 < 0.05)
thrpt: [+1.1866% +2.9524% +4.6363%]
wal_acceptor_throughput/fsync=false/commit=false/size=131072
time: [903.61 ms 920.39 ms 941.14 ms]
thrpt: [1.0625 GiB/s 1.0865 GiB/s 1.1067 GiB/s]
change:
time: [-2.4959% +0.4753% +3.9114%] (p = 0.78 > 0.05)
thrpt: [-3.7642% -0.4731% +2.5598%]
wal_acceptor_throughput/fsync=false/commit=false/size=1048576
time: [847.05 ms 852.62 ms 859.55 ms]
thrpt: [1.1634 GiB/s 1.1729 GiB/s 1.1806 GiB/s]
change:
time: [-7.3003% -5.4448% -3.8696%] (p = 0.00 < 0.05)
thrpt: [+4.0253% +5.7583% +7.8752%]
The large discrepancy compared with the macOS results (200% improvement) is mostly down to fsync latencies. On my MacBook with macOS, an 8-byte write+fsync write takes 4.1 ms -- on the i4i.2xlarge
running Linux, an 8-byte write+fsync takes 0.1 ms.
However, even with 256 MB segments and 8 MB append sizes, we're still capping out at 700 MB/s. We may be hitting some other bottleneck instead. Maybe the segment size will matter once we resolve this other bottleneck, or with faster disk. Worth exploring further.
I ran these benchmarks on a Hetzner node with a local SSD as well, results were similar.
Fsync costs when closing and initializing segments significantly affect WAL ingestion throughput. Increasing the segment size would amortize these costs.
Local experiments on my MacBook show that increasing the segment size from 16 MB to 128 MB yields a 200% improvement in throughput for large appends.