neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.57k stars 423 forks source link

Epic: proper vacuuming for databases that are not always on #6228

Open kelvich opened 9 months ago

kelvich commented 9 months ago

Motivation

With scale to zero it is easy to end up in the situation where auctovacuum will not run at all. Postgres autovacuum is designed to spread the load over a long period of time to avoid database impact, so for endpoint which are not always on we may never complete vacuum. That will result in a slower performance and bigger startup time (due to clog/multixact bloat).

DoD

We ensure that one way or another vacuum runs periodically on all of the branches. We should monitor that and define a metric that shows us how severely given database needs vacuum. One example of such metric is lastTransactionId - min(relfrozenxid). Or may be we can pick something better.

Implementation ideas

The most straightforward approach is to treat autovacuum process the same way as user connections in compute_ctl and keep updating last_active_at metric while vacuum is running. That would prevent control plane from suspending this endpoint and we'll be sure that vacuum completes. However, due to postgres autovacuum settings we'll keep compute running for unnecessary long time with autovacuum making really slow progress. We could change this setting but then we'll create problems for users run endpoint 24/7.

I can see few options there. Let's imagine compute_ctl detected that right now there are no user queries right now and there is a necessity to run vacuum (either by checking presence of active autovacuum or by looking at the same metrics that autovacuum looks at when it decides to start). Now we could:

First approach looks better since it allows swifter transition between slow-burning vacuum and all-in vacuum. But there are few open questions:

UI

It would be good to show the user reason for his endpoint to be active right now. We could expose some time_spend_in_vacuum field in compute_ctl and UI can show that vacuum is in progress near the latest start operation. When vacuum finishes we can show cumulative time that it was running. That should also work in cases when our vacuum was interrupted by user connection and rolled back vacuum setting to the default ones. Makes sence to count only excessive vacuuming time where we were using aggressive settings. Or in other words we show the amount of compute time that we've burnt on vacuum without other user activity (and in theory we can subtract auto suspend time).

### Tasks
- [ ] https://github.com/neondatabase/neon/issues/6684

Other related info

jcsp commented 9 months ago

Idea: we could combine this with handling of idle tenants in the rest of the stack, e.g. for an idle tenant pageserver would like to be woken up to GC the tenant after 7 days. We might want to schedule data integrity scrubs in future too: maybe an overall "scrub" control plane operation could fire up the tenant and then call into all layers of the stack with a general "please do whatever housekeeping you need to do" API hook.

MMeent commented 9 months ago

One issue with our scale-to-zero is that we drop some statistics used by the autovacuum deamon to determine if the table should be vacuumed (specifically, all stats in the cumulative statistics system, as accessed through pgstat.h's definitions). This means that even if we shut down the node when the autovacuum deamon is not currently running, it will impact the performance of the system due to a loss of statistics-triggered automatic vacuums after the restart.

IMO, we should build a better solution for non-critical files than our current solution of "drop everything": A clean shutdown of a Primary compute node could upload known statistics files like the stats file at pg_stat/pgstat.stat and pg_stat_statements' data file at pg_stat/pg_stat_statements.stat to S3. Additionally, this could also allow us to use pg_prewarm more aggressively by dumping the contents of shared_buffers and neon's file cache and re-loading (most of) that data on startup (by the means of dumping buffer descriptors, not the actual content, of course).

Example files which would be beneficial to keep around for a compute:

MMeent commented 9 months ago

C.f. an internal discussion about an issue where autovacuum didn't trigger which resulted in measurably different performance.

vadim2404 commented 8 months ago

@knizhnik merged the patch that disabled compute suspension if vacuum works

this week: to investigate the situation with stats file

kelvich commented 8 months ago

@vadim2404 @knizhnik did you read what is written in the issue? please revert https://github.com/neondatabase/neon/commit/31a4eb40b2163b3f90439f820af541a7a5e4e970 asap, it will blow up production epically: by default autovaccum is configured to run slowly and sleep most of the time to avoid impact on the main load. In our case it will effectively disable scale to zero.

vadim2404 commented 8 months ago

I didn't notice any spikes in # of active computes in different regions (we started it with staging) and also @Bodobolero looked at this PR.

https://neonprod.grafana.net/d/e6971550-2dc3-4e31-aeb0-849ed06841f6/active-computes-by-region?orgId=1&from=now-24h&to=now&viewPanel=1

kelvich commented 8 months ago

hm, I missed that it is on prod. Ok, let's keep an eye on th number of active compute instead of rolling back now. But the point still holds -- vacuum may run for dozen of hours, there is no visibility on why it is running, etc. So all of the things mentioned in the issue still needed.

vadim2404 commented 8 months ago

therefore we didn't close the issue

knizhnik commented 8 months ago

Autovacuum is definitely configured to run slowly. But as far as I know there are no some artificial delays (i.e. sleeps...). It just give up if it fails to set some lock.

I never saw Postgres where auto vacuum was always active. Especially in case when there is no database activity. Database suspend time is 5 minutes. If autovacuum was not able to complete in 5 minutes for small database (and most of our databases are actually small), then there is something wrong.

hlinnaka commented 8 months ago

But as far as I know there are no some artificial delays (i.e. sleeps...).

There is, see autovacuum_vacuum_cost_delay and https://www.postgresql.org/docs/current/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-VACUUM-COST.

With a small database, it will still finish reasonably quickly with the default settings though, despite the throttling.

MMeent commented 8 months ago

Autovacuum is definitely configured to run slowly. But as far as I know there are no some artificial delays (i.e. sleeps...).

It does have delays, see vacuum_delay_point() in src/backend/commands/vacuum.c, which is called from all the main scan loops of all access methods.

vadim2404 commented 8 months ago

@knizhnik to look into this issue

Bodobolero commented 8 months ago

@kelvich see update in https://github.com/neondatabase/neon/issues/6684

I verified that auto vacuum is no longer interrupted by compute suspend once it is started and thus I verified https://github.com/neondatabase/neon/pull/6322 is working correctly, however without preserving pg_stat cumulative statistics this is only half solved and https://github.com/neondatabase/neon/pull/6613 is important in this context.

Bodobolero commented 8 months ago

We also have a problem with monitoring auto-vacuum in Neon https://github.com/neondatabase/neon/issues/6686

vadim2404 commented 7 months ago

It was partly released, and other items are not so urgent