Significant performance regression going from Postgres 13 → 14+

finestructure commented 2 weeks ago

I've been investigating a big run time regression in our test suite after switching the Postgres version in our docker container from 13 to 16, which I've subsequently narrowed down to happening when going from 13 to 14.

It's been suggested that this might be due to how PostgresNIO is adopting changes in Postgres 14, perhaps "pipelining support".

There are more details in the tracking issue on our end (the one linked above) which I'll briefly summarise here:

we're seeing a x3 run time regression running out test suite when using a Postgres 14+ container vs Postgres 13
running a pure SQL test script against different Postgres versions with a copy of our production database shows that Postgres 14+ are in fact faster than 13
this leads me to believe the problem isn't the Postgres version per se but something higher up the stack

Here are the figures from the issue:

SPI test suite, running locally on a Mac Studio:

Pg 13: 52s
Pg 14: 162s
Pg 16: 169s

There's a clear regression going from Postgres 13 → 14.

This is not the case when running plain SQL against those db versions (with copies of our production data, so it's a real world scenario):

Pg 13: 1.79s
Pg 14: 1.65s
Pg 16: 1.66s

I don't know if or how this could be an issue in PostgresNIO but it doesn't seem to be a db issue. Are there any other tests I could be running to help figure out where the slowdown could be occurring?

fabianfett commented 2 weeks ago

@finestructure Can you help me investigate this? Can you provide me a SQL query and a matching database that becomes slower with Postgres 14?

finestructure commented 2 weeks ago

I'll try and come up with something that involves less of our stack (and who knows, maybe I'll find the cause isn't PostgresNIO in the process 😅).

finestructure commented 1 week ago

So the good news is that a perf test I created here: https://github.com/finestructure/pg-perf-regression.git does not show any performance regression in PostgresNIO.

I've got more details in the original issue.

As I conclude there, I think that leaves us with two possible explanations for the x3 performance regression:

The SQL test load is significantly different from what we do in our tests, or at the very least we do something else that is not captured by my SQL load test.
There's an issue higher up the stack in Vapor or in our code that behaves dramatically differently depending on what Postgres version is being targeted.

I'll close this issue for now while I try to figure out which it is - or if it's something else entirely 😅

Any ideas what else I could try greatly appreciated!

finestructure commented 1 week ago

I believe I know why I couldn't reproduce the issue at first. I've got some details here and will try to create a reproducer that better reflects what we're doing.

Some details here: https://github.com/SwiftPackageIndex/SwiftPackageIndex-Server/issues/3360#issuecomment-2348434200

finestructure commented 5 days ago

I'll start by saying that this issue is quite bizarre 😅 There seem to be multiple factors at play. For one, the issue seems to happen when running in an XCTest. That's why I couldn't reproduce it standalone at first.

There are also varying degrees of slowdown in an SQL query depending on how many times another SQL query ran before it.

Best I can tell this is not a measurement effect. I see this same slowdown in our actual tests.

I've set up a pure PostgresNIO project here with my tests that resemble our unit tests: https://github.com/finestructure/pg-perf-regression.git

Simply running

./run-perf-test.sh 13 test1
./run-perf-test.sh 13 test2
./run-perf-test.sh 13 test3
./run-perf-test.sh 14 test1
./run-perf-test.sh 14 test2
./run-perf-test.sh 14 test3

should yield all the figures. I've plotted my results here https://www.icloud.com/numbers/0945R88oKsGdVM5kY89hT7eMA#Issue_3360_perf_regression and am attaching screenshots below for easier viewing.

The core bit of the test is this bit here:

        // test1
        try await recreateDatabase(testDbName)
        try await recreateDatabase(testDbName)
        try await createSnapshot(original: testDbName, snapshot: snapshotName)

        // test2
        try await recreateDatabase(testDbName)
        try await createSnapshot(original: testDbName, snapshot: snapshotName)

        // test3
        try await createSnapshot(original: testDbName, snapshot: snapshotName)

The timing is only done within createSnapshot.

Observations are:

test1 runs in ~9ms on Pg13 vs ~60ms on Pg14 -> 6.6x slower
test2 runs in ~16ms on Pg13 vs ~60ms on Pg14 -> 3.9x slower
test3 runs in ~27ms on Pg13 vs ~67ms on Pg14 -> 2.5x slower

Please double check my work here! While I've run these tests and variants hundreds of times now, and have found them reproducible, it's easy to get stuck in a groove and overlook something obvious. I think I'm testing the right thing and in a sensible fashion, and it does reflect our real-world observations, but it could still well be that I'm timing the wrong thing or something like that.

It's important to note that this is not a micro-benchmark regressing. We're a seeing large (3x) real world increase in our test runtime to the degree that we keep testing with Pg 13 locally for now.

CleanShot 2024-09-16 at 13 02 02@2x

finestructure commented 4 days ago

The numbers change a little but show generally the same trends when running in release mode:

test1 2.4x slower 13 → 14
test2 1.8x slower 13 → 14
test3 1.3x slower 13 → 14

I've added the figures and graphs in new tabs in the numbers document: https://www.icloud.com/numbers/0945R88oKsGdVM5kY89hT7eMA#Issue_3360_perf_regression

JawadHF commented 2 days ago

There was a default change made to the vacuum default for cleaning up indexes that changed from postgres 13 to 14 (from ON to AUTO) detailed here which may be affecting your results. Check to see if the prescribed solution of adding INDEX_CLEANUP ON to the vacuum command helps.

finestructure commented 1 day ago

The latest test is essentially just running

DROP DATABASE IF EXISTS \(databaseName) WITH (FORCE)
CREATE DATABASE \(databaseName)

followed by

DROP DATABASE IF EXISTS \(snapshot) WITH (FORCE)
CREATE DATABASE \(snapshot) TEMPLATE \(original)

I wouldn't expect changes to VACUUM to have any effect here?

vapor / postgres-nio

Significant performance regression going from Postgres 13 → 14+ #506