neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.6k stars 423 forks source link

Cumulative stats lost during suspend of compute - thus auto vacuum and analyze thresholds are lost, too #6684

Open Bodobolero opened 8 months ago

Bodobolero commented 8 months ago

TLDR:

Autovacuum and analyze (creating statistics) database maintenance operations are important for performance. Without correct statistics we have suboptimal PostgreSQL plans. Without auto vacuum we have table bloat and dead tuples.

Postgres decides when to run those DBMS maintenance operations based on cumulative statistics. When Neon stateless suspends and restarts a compute the cumulative statistics are not preserved across the suspension. This causes missing runs of auto vacuum and analyze with negative impact on performance.

There is a separate RFC in progress https://github.com/neondatabase/neon/pull/6613 to preserve aux files including pg_stat (cumulative statistics).

I verified that auto vacuum is no longer interrupted by compute suspend once it is started and thus I verified https://github.com/neondatabase/neon/pull/6322 is working correctly, however without preserving pg_stat cumulative statistics this is only half solved.

Details

Auto-vacuum experiments

Concepts

Relevant documentation

https://www.postgresql.org/docs/16/routine-vacuuming.html https://www.postgresql.org/docs/16/routine-vacuuming.html#AUTOVACUUM

PostgreSQL has an optional but highly recommended feature called autovacuum, whose purpose is to automate the execution of VACUUM and ANALYZE commands. When enabled, autovacuum checks for tables that have had a large number of inserted, updated or deleted tuples. These checks use the statistics collection facility; therefore, autovacuum cannot be used unless track_counts is set to true. In the default configuration, autovacuuming is enabled and the related configuration parameters are appropriately set.

https://www.postgresql.org/docs/16/monitoring-stats.html

https://www.postgresql.org/docs/16/runtime-config-statistics.html#GUC-TRACK-COUNTS

`track_counts (boolean)``

Enables collection of statistics on database activity. This parameter is on by default, because the autovacuum daemon needs the collected information. Only superusers and users with the appropriate SET privilege can change this setting.

Cumulative Query and Index Statistics:

When enabled, the data that is collected can be accessed via the pg_stat and pg_statio family of system views

https://www.postgresql.org/docs/16/monitoring.html

Reporting vacuum progress

https://www.postgresql.org/docs/16/progress-reporting.html#VACUUM-PROGRESS-REPORTING

The pg_stat_progress_vacuum view will contain one row for each backend (including autovacuum worker processes) that is currently vacuuming.

In Neon only superusers can see most of the columns and the users derived from role neon_superuser can not.

Logging vacuum

https://www.postgresql.org/docs/16/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHAT

log_autovacuum_min_duration (integer)

Causes each action executed by autovacuum to be logged if it ran for at least the specified amount of time. Set to 10 minutes by default in Neon.

Criteria when vacuum is started

Tables whose relfrozenxid value is more than autovacuum_freeze_max_age transactions old are always vacuumed (this also applies to those tables whose freeze max age has been modified via storage parameters; see below).

Otherwise, if the number of tuples obsoleted since the last VACUUM exceeds the “vacuum threshold”, the table is vacuumed. The vacuum threshold is defined as:

vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples

where the vacuum base threshold is autovacuum_vacuum_threshold, the vacuum scale factor is autovacuum_vacuum_scale_factor, and the number of tuples is pg_class.reltuples.

Neon defaults: vacuum base threshold: autovacuum_vacuum_threshold = 50 vacuum scale factor: autovacuum_vacuum_scale_factor = 0.1

In other words if 50 + 10 % of the tuples are obsoleted since the last vacuum, the table is vacuumed in Neon.

The table is also vacuumed if the number of tuples inserted since the last vacuum has exceeded the defined insert threshold, which is defined as:

vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples

Neon defaults: vacuum base insert threshold: autovacuum_vacuum_insert_threshold = 1000 vacuum insert scale factor: autovacuum_vacuum_insert_scale_factor = 0.2

In other words if the table grows by 1000 tuples plus 20 % since the last vacuum, the table is vacuumed in Neon.

This is important to mark blocks as all visible and tuples to be frozen. (Tuple can no longer be rolled back and is permanently visible to all transactions.)

The number of obsolete tuples and the number of inserted tuples are obtained from the cumulative statistics system; it is a semi-accurate count updated by each UPDATE, DELETE and INSERT operation. (It is only semi-accurate because some information might be lost under heavy load.)

If Neon loses the cumulative statistics during shutdown, the criteria for the next autovacuum run will be missing and autovacuum will not run unless there are enough obsoleted or inserted tuples since the compute was activated or the other criteria (Tables whose relfrozenxid value is more than autovacuum_freeze_max_age ) kicks in.

Analyze step

For analyze, a similar condition is used: the threshold, defined as:

analyze threshold = analyze base threshold + analyze scale factor * number of tuples

Neon defaults: analyze base threshold: autovacuum_analyze_threshold = 50 analyze scale factor: autovacuum_analyze_scale_factor = 0.1

is compared to the total number of tuples inserted, updated, or deleted since the last ANALYZE.

Which means if we lose cumulative stats also our regular stats will suffer because we don't run ANALYZE frequently enough.

Todos

Update https://github.com/neondatabase/neon/issues/6228 with these findings

Automate a test that verifies autovacuum is not suspended.

DDL

DROP TABLE pgbench_accounts;
CREATE TABLE pgbench_accounts (
    aid bigint NOT NULL,
    bid integer,
    abalance integer,
    filler character(84),
    -- create more blocks more quickly to make sure auto-vacuum has something to do
    text_column_plain TEXT  DEFAULT repeat('NeonIsCool', 5),
    jsonb_column_extended JSONB  DEFAULT ('{ "tell everyone": [' || repeat('{"Neon": "IsCool"},',9) || ' {"Neon": "IsCool"}]}')::jsonb
)
WITH (fillfactor='100');

DML

\! date
insert into pgbench_accounts(aid,bid,abalance,filler) select aid, (aid - 1) / 100000 + 1, 0, '' from generate_series(1, 20000000) as aid;
\! date

Sample project

psql "postgresql://peter:<secret>@ep-withered-block-62154334.us-east-2.aws.neon.build/neondb?sslmode=require"

Experiment

Measure auto-vacuum duration

Next step:

open two psql sessions in the one record time and run steps above in the other one monitor statement activity and auto_vacuum activity.

But we can not monitor for too long otherwise this will inhibit suspension during the real verifiation. In this first experiment we only want to see if autovacuum is running and for how long.

\! date
SELECT pid, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, 
       index_vacuum_count, max_dead_tuples, num_dead_tuples
FROM pg_stat_progress_vacuum;

This doesn't work on Neon because our users don't have permission to see all fields.

SELECT
  pg_stat_progress_vacuum.pid as pid,
  phase,
  heap_blks_total,
  heap_blks_scanned,
  heap_blks_vacuumed,
  index_vacuum_count,
  max_dead_tuples,
  num_dead_tuples,
  age(clock_timestamp(), xact_start) AS duration
FROM pg_stat_progress_vacuum
JOIN pg_stat_activity ON pg_stat_progress_vacuum.pid = pg_stat_activity.pid;

After the run we can check the last run of autovacuum

\! date
SELECT pid, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, 
       index_vacuum_count, max_dead_tuples, num_dead_tuples
FROM pg_stat_progress_vacuum;
SELECT
    relname AS table_name,
    last_autovacuum,
    last_autoanalyze
FROM
    pg_stat_user_tables
ORDER BY
    last_autovacuum DESC, last_autoanalyze DESC;
neondb=> \! date
Thu Feb  8 11:05:48 CET 2024
neondb=> insert into pgbench_accounts(aid,bid,abalance,filler) select aid, (aid - 1) / 100000 + 1, 0, '' from generate_series(1, 20000000) as aid;
\! date
INSERT 0 20000000
neondb=> \! date
Thu Feb  8 11:07:06 CET 2024

neondb=> \! date
Thu Feb  8 11:07:39 CET 2024
neondb=> SELECT pid, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, 
neondb->        index_vacuum_count, max_dead_tuples, num_dead_tuples
neondb-> FROM pg_stat_progress_vacuum;
 pid  | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples 
------+-------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
 2984 |       |                 |                   |                    |                    |                 |                
(1 row)

Insert was running for 78 seconds.

Vacuum was running for - I could not determine because after suspension the pg_stat_user_tables stats are lost!

neondb->     last_autovacuum DESC, last_autoanalyze DESC;
    table_name    | last_autovacuum | last_autoanalyze 
------------------+-----------------+------------------
 pgbench_accounts |                 | 
(1 row)

So I have to repeat the experiment

neondb=> DROP TABLE pgbench_accounts;
DROP TABLE
neondb=> CREATE TABLE pgbench_accounts (
neondb(>     aid bigint NOT NULL,
neondb(>     bid integer,
neondb(>     abalance integer,
neondb(>     filler character(84),
neondb(>     -- create more blocks more quickly to make sure auto-vacuum has something to do
neondb(>     text_column_plain TEXT  DEFAULT repeat('NeonIsCool', 5),
neondb(>     jsonb_column_extended JSONB  DEFAULT ('{ "tell everyone": [' || repeat('{"Neon": "IsCool"},',9) || ' {"Neon": "IsCool"}]}')::jsonb
neondb(> )
neondb-> WITH (fillfactor='100');
CREATE TABLE
neondb=> \! date
Thu Feb  8 11:34:56 CET 2024
neondb=> insert into pgbench_accounts(aid,bid,abalance,filler) select aid, (aid - 1) / 100000 + 1, 0, '' from generate_series(1, 20000000) as aid;
\! date
INSERT 0 20000000
neondb=> \! date
Thu Feb  8 11:36:20 CET 2024

neondb=> \! date
Thu Feb  8 11:37:13 CET 2024
neondb=> SELECT pid, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, 
neondb->        index_vacuum_count, max_dead_tuples, num_dead_tuples
neondb-> FROM pg_stat_progress_vacuum;
 pid  | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples 
------+-------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
 1885 |       |                 |                   |                    |                    |                 |                
(1 row)

neondb=> \! date
Thu Feb  8 11:50:37 CET 2024
neondb=> SELECT pid, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, 
neondb->        index_vacuum_count, max_dead_tuples, num_dead_tuples
neondb-> FROM pg_stat_progress_vacuum;
 pid  | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples 
------+-------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
 1885 |       |                 |                   |                    |                    |                 |                
(1 row)

Time: 183.036 ms
neondb=> SELECT
neondb->     relname AS table_name,
neondb->     last_autovacuum,
neondb->     last_autoanalyze
neondb-> FROM
neondb->     pg_stat_user_tables
neondb-> ORDER BY
neondb->     last_autovacuum DESC, last_autoanalyze DESC;
    table_name    | last_autovacuum | last_autoanalyze 
------------------+-----------------+------------------
 pgbench_accounts |                 | 
(1 row)

Time: 126.325 ms
neondb=> \! date
Thu Feb  8 11:50:42 CET 2024
neondb=> SELECT pid, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, 
neondb->        index_vacuum_count, max_dead_tuples, num_dead_tuples
neondb-> FROM pg_stat_progress_vacuum;
 pid | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples 
-----+-------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
(0 rows)

Time: 125.616 ms
neondb=> SELECT
neondb->     relname AS table_name,
neondb->     last_autovacuum,
neondb->     last_autoanalyze
neondb-> FROM
neondb->     pg_stat_user_tables
neondb-> ORDER BY
neondb->     last_autovacuum DESC, last_autoanalyze DESC;
    table_name    |        last_autovacuum        | last_autoanalyze 
------------------+-------------------------------+------------------
 pgbench_accounts | 2024-02-08 10:50:40.592862+00 | 
(1 row)

Time: 267.608 ms
neondb=> \! date
Thu Feb  8 11:50:53 CET 2024
neondb=> SELECT pid, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, 
neondb->        index_vacuum_count, max_dead_tuples, num_dead_tuples
neondb-> FROM pg_stat_progress_vacuum;
 pid | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples 
-----+-------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
(0 rows)

Time: 124.550 ms
neondb=> SELECT
neondb->     relname AS table_name,
neondb->     last_autovacuum,
neondb->     last_autoanalyze
neondb-> FROM
neondb->     pg_stat_user_tables
neondb-> ORDER BY
neondb->     last_autovacuum DESC, last_autoanalyze DESC;
    table_name    |        last_autovacuum        | last_autoanalyze 
------------------+-------------------------------+------------------
 pgbench_accounts | 2024-02-08 10:50:40.592862+00 | 
(1 row)

Time: 210.696 ms
neondb=> \! date
Thu Feb  8 11:50:59 CET 2024
neondb=> SELECT pid, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, 
neondb->        index_vacuum_count, max_dead_tuples, num_dead_tuples
neondb-> FROM pg_stat_progress_vacuum;
 pid | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples 
-----+-------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
(0 rows)

Time: 266.726 ms
neondb=> SELECT
neondb->     relname AS table_name,
neondb->     last_autovacuum,
neondb->     last_autoanalyze
neondb-> FROM
neondb->     pg_stat_user_tables
neondb-> ORDER BY
neondb->     last_autovacuum DESC, last_autoanalyze DESC;
    table_name    |        last_autovacuum        |       last_autoanalyze        
------------------+-------------------------------+-------------------------------
 pgbench_accounts | 2024-02-08 10:50:40.592862+00 | 2024-02-08 10:50:57.589918+00
(1 row)

Insert: 84 seconds

Auto-vacuum from 11:37:13 to 11:50:40 - 13 minutes and 27 seconds (807 seconds)

Which is definitively longer than the 300 seconds suspension time.

Now let's check what we still have in the catalog tables about this auto-vacuum run after suspension.

12:00:00:

neondb=> \! date
Thu Feb  8 12:06:49 CET 2024
neondb=> SELECT pid, phase, heap_blks_total, heap_blks_scanned, heap_blks_vacuumed, 
neondb->        index_vacuum_count, max_dead_tuples, num_dead_tuples
neondb-> FROM pg_stat_progress_vacuum;
 pid | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples 
-----+-------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
(0 rows)

Time: 127.633 ms
neondb=> SELECT
neondb->     relname AS table_name,
neondb->     last_autovacuum,
neondb->     last_autoanalyze
neondb-> FROM
neondb->     pg_stat_user_tables
neondb-> ORDER BY
neondb->     last_autovacuum DESC, last_autoanalyze DESC;
    table_name    | last_autovacuum | last_autoanalyze 
------------------+-----------------+------------------
 pgbench_accounts |                 | 
(1 row)

Time: 239.205 ms

So we can no longer see the auto-vacuum run that we saw before!.

PostgreSQL log:

https://neonprod.grafana.net/d/000000039/neon-compute-metrics-by-endpoint-id?var-endpoint_id=ep-withered-block-62154334&var-env=dev&orgId=1&var-env_dev_to_staging=staging&var-cluster=HUNg6jvVk&var-console_db_source=be0eba8b-443e-40b0-971f-c0e236461248&var-logs_source=xHHYY0dVz&var-compute_id=All&var-mode=accessexclusivelock%7Caccesssharelock%7Cexclusivelock%7Crowexclusivelock%7Crowsharelock%7Csharelock%7Csharerowexclusivelock%7Cshareupdateexclusivelock&var-datname_exclude=postgres%7Ctemplate.*&var-interval=30s&var-search=vacuum&var-exclude=%28trying+to+connect%3A+tcp+connect%29%7C%28serving+%2Fstatus+GET+request%29%7C%28login+attempt%29%7C%28closing+because%29&from=1707388200000&to=1707390659000

PG:2024-02-08 10:50:40.592 GMT ttid=09cf17562444e0aa5ec5e0db9f8c2095/3cfd8e3cf8a3de0ed5bf422df9d4d9b4 [1885] LOG:  automatic vacuum of table "neondb.public.pgbench_accounts": index scans: 0​  pages: 0 removed, 1250000 remain, 1250000 scanned (100.00% of total)​   tuples: 0 removed, 20000000 remain, 0 are dead but not yet removable​   removable cutoff: 6150, which was 0 XIDs old when operation ended​  new relfrozenxid: 6149, which is 1 XIDs ahead of previous value​    frozen: 0 pages from table (0.00% of total) had 0 tuples frozen​    index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed​ avg read rate: 11.914 MB/s, avg write rate: 12.064 MB/s​    buffer usage: 1265954 hits, 1234450 misses, 1250043 dirtied​    WAL usage: 1250040 records, 81 full page images, 74093661 bytes​    system usage: CPU: user: 16.46 s, system: 46.48 s, elapsed: 809.50 s

Conclusions:

To verifiy that auto-vacuum is not suspended we can not rely on pg_stat_user_tables relation as it loses the information after suspension.

Instead we have to monitor the endpoint status in "operations" and check that suspension is not 5 minutes after our last SQL but around 800 seconds later.

Let's do that now: We run the experiment again, this time we disconnect after the insert and then watch the endpoint state in the admin console

Experiment 2

neondb=> \! date
Thu Feb  8 12:10:20 CET 2024
neondb=> insert into pgbench_accounts(aid,bid,abalance,filler) select aid, (aid - 1) / 100000 + 1, 0, '' from generate_series(1, 20000000) as aid;
\! date
\q
INSERT 0 20000000
neondb=> \! date
Thu Feb  8 12:11:41 CET 2024
neondb=> \q

At this time we have no other connections to the database.

https://console.stage.neon.tech/admin/regions/console/operations?endpoint_id=ep-withered-block-62154334&sort=updated_at&order=desc

2024-02-08T11:06:19.632Z start_compute

if suspend compute is around 12:16:14 CET then auto-vacuum has been suspended -> bug if suspend compute is around or later than 12:24:41 CET + 5 minutes(suspension wait time) then auto-vacuum has NOT been suspended -> success

2024-02-08T11:30:33.340Z suspend_compute (which is 12:30:33 CET)

==> success

We still need to understand the impact on the missing entries in pg_stat_user_tables and other missing cumulative statistics.

Verify default autovacuum settings

show autovacuum;

 autovacuum 
------------
 on
(1 row)

neondb=> show track_counts;
 track_counts 
--------------
 on
(1 row)

we have no relation specific vacuum settings

SELECT relname, reloptions FROM pg_class WHERE reloptions IS NOT NULL;
      relname       |       reloptions        
--------------------+-------------------------
 pg_stats           | {security_barrier=true}
 pg_stats_ext       | {security_barrier=true}
 pg_stats_ext_exprs | {security_barrier=true}
 pgbench_accounts   | {fillfactor=100}
(4 rows)
SELECT name, setting FROM pg_settings WHERE name LIKE 'autovacuum%';
                 name                  |  setting  
---------------------------------------+-----------
 autovacuum                            | on
 autovacuum_analyze_scale_factor       | 0.1
 autovacuum_analyze_threshold          | 50
 autovacuum_freeze_max_age             | 200000000
 autovacuum_max_workers                | 3
 autovacuum_multixact_freeze_max_age   | 400000000
 autovacuum_naptime                    | 60
 autovacuum_vacuum_cost_delay          | 2
 autovacuum_vacuum_cost_limit          | -1
 autovacuum_vacuum_insert_scale_factor | 0.2
 autovacuum_vacuum_insert_threshold    | 1000
 autovacuum_vacuum_scale_factor        | 0.2
 autovacuum_vacuum_threshold           | 50
 autovacuum_work_mem                   | -1
(14 rows)

neondb=> SELECT name, setting FROM pg_settings WHERE name LIKE 'track%';
           name            | setting 
---------------------------+---------
 track_activities          | on
 track_activity_query_size | 1024
 track_commit_timestamp    | off
 track_counts              | on
 track_functions           | none
 track_io_timing           | off
 track_wal_io_timing       | off
(7 rows)
knizhnik commented 8 months ago

Persisting pgstat information is addressed in this PR: https://github.com/neondatabase/neon/pull/6560