timescale / promscale_extension

[DEPRECATED] Tables, types and functions supporting Promscale
Other
37 stars 17 forks source link

Understand slow upgrade process #306

Open JamesGuthrie opened 2 years ago

JamesGuthrie commented 2 years ago

In #274 a user provided a report of the upgrade from Promscale 0.10.0 to 0.11.0 taking ~60 minutes. This is longer than expected.

In this issue I will gather findings in trying to understand why it is slow.

JamesGuthrie commented 2 years ago

In analysing the logs provided to us by the user, we determined that it is unclear exactly why the upgrade process is slow. One hypothesis is that the background jobs which were running simultaneously to the upgrade are slowing down the upgrade process. The theory is that the background jobs take locks on tables which the upgrade process will also need to lock, so lock contention makes everything slow.

We know that the more objects the database contains, the longer the upgrade script will take.

Benchmarking setup

We set up a test system on an AWS m5.2xlarge with a 100GB gp3 volume with IOPS and throughput set to the max (16'000 IOPS and 1000MB/s).

On this system we installed the latest timescaledb (2.7.0), set up a database in it (test with superuser test), installed the timescaledb extension (no promscale extension), and started promscale 0.10.0 against it in -startup.only mode to get all SQL objects installed.

We then created 1000 metrics, each with 1 series, and for each series inserted 250 time series values with timestamps in the far future, such that each value ends up in a different chunk. The timestamp in the far future is to ensure that compression is not applied to the values. This results in 250'000 chunks, and ~500'000 objects in the database in total.

-- SQL script to create 250'000 chunks, each with 1 value
begin;
select _prom_catalog.get_or_create_metric_table_name(format('my_metric_%s', m))
from generate_series(1, 1000) m;
commit;

call _prom_catalog.finalize_metric_creation();

do $block$
    declare
        _metric text;
        _series_id bigint;
    begin
        for _metric in
            (
                select format('my_metric_%s', m)
                from generate_series(1, 1000) m
            )
            loop
                -- create 1 series per metric
                RAISE LOG 'creating series for %', _metric;
                select _prom_catalog.get_or_create_series_id(
                               format('{"__name__": "%s", "namespace":"dev", "node": "brain"}', _metric)::jsonb
                           ) into strict _series_id
                ;

                RAISE LOG 'inserting future data for %', _metric;

                -- in the future - not compressed
                execute format(
                        $$
        insert into prom_data.%I
        select
            '2035-01-01'::timestamptz + (interval '9 hour' * x),
            x + 0.1,
            %s
        from generate_series(1, 250) x
        $$, _metric, _series_id
                    );
                commit;
            end loop;
    end;
$block$;

Simple upgrade

We ran Promscale 0.11.0 against the database (this time with the promscale extension installed), and observed that the promscale background workers were not running. The following are the Promscale connector logs:

level=info ts=2022-05-25T07:53:34.780Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-25T07:56:47.152Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-25T07:59:51.759Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0
level=info ts=2022-05-25T07:59:51.896Z caller=client.go:85 msg="Migration successful, exiting"

We see that the "takeover" process takes ~3 minutes, and the 0.0.0 -> 0.5.0 upgrade process takes ~3 minutes, for a total of ~6 minutes to upgrade.

Upgrade with background job contention, but no compression

We reset the database to the previous "fresh" state (by making a file-system level copy of the data directory before running any benchmarks, and restoring this - literally:

We then restarted postgres, let the Promscale maintenance jobs run, and triggered the promscale 0.11.0 upgrade process. The resulting Promscale Connector logs can be seen here:

level=info ts=2022-05-25T08:18:41.255Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-25T08:21:56.980Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-25T08:25:03.766Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0
level=info ts=2022-05-25T08:25:03.898Z caller=client.go:85 msg="Migration successful, exiting"

Not much different to before.

These maintenance jobs were not doing any actual work, because all chunks are in the distant future.

JamesGuthrie commented 2 years ago

The next benchmark would like to determine the impact of having maintenance jobs running with actual compression work while the upgrade process runs.

In order to more accurately measure this, we require a dataset with more data per chunk, so that the compression process has some work to do.

This benchmark is on the same machine as before.

In order to produce the test data, we first disabled promscale maintenance jobs with:

SELECT alter_job(job_id, scheduled => false) FROM timescaledb_informations.jobs WHERE 'proc_name' = 'execute_maintenance_job';

and then ran the following script:

do $block$
    declare
        _metric text;
        _series_id bigint;
    begin
        for _metric in
            (
                select format('my_metric_%s', m)
                from generate_series(1, 1000) m
            )
            loop
                -- create 1 series per metric
                RAISE LOG 'creating series for %', _metric;
                select _prom_catalog.get_or_create_series_id(
                               format('{"__name__": "%s", "namespace":"dev", "node": "brain"}', _metric)::jsonb
                           ) into strict _series_id
                ;

                RAISE LOG 'inserting data for %', _metric;

                -- in the past - will be compressed by background job
                execute format(
                        $$
                        insert into prom_data.%I
                        select
                            '2022-01-01'::timestamptz + (interval '30 seconds' * x),
                            x + 0.1,
                            %s
                        from generate_series(1, 250000) x
                        $$, _metric, _series_id
                    );
                commit;
            end loop;
    end;
$block$;

This results in the following chunk and object counts:

test=# SELECT count(*) FROM timescaledb_information.chunks;
 count
--------
 261413
(1 row)
test=# SELECT count(*) AS locked_object_count
    FROM pg_class
    JOIN pg_namespace n ON n.oid = pg_class.relnamespace
    WHERE
          n.nspname <> 'pg_catalog'
      AND n.nspname !~ '^pg_toast'
      AND n.nspname <> 'information_schema'
;
 locked_object_count
---------------------
              534206
(1 row)

Baseline

As a baseline, we would like to understand how long the upgrade takes with this data set, and without benchmark contention.

The following logs show how long the upgrade process took:

level=info ts=2022-05-25T15:57:45.466Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-25T16:01:09.550Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-25T16:04:26.631Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0

With maintenance job contention

If we start the maintenance jobs before triggering the upgrade, we see the following in the promscale connector logs:

level=info ts=2022-05-25T16:46:20.713Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=error ts=2022-05-25T16:46:22.124Z caller=runner.go:116 msg="aborting startup due to error" err="migration error: error while installing promscale extension version 0.0.0: error installing Promscale extension at version 0.0.0: ERROR: deadlock detected (SQLSTATE 40P01)"

If we start the upgrade first, just before the maintenance jobs are going to trigger, we see the following:

level=info ts=2022-05-25T16:51:47.246Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-25T16:54:24.630Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-25T16:57:08.744Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0
JamesGuthrie commented 2 years ago

Upgrade during different phases of maintenance jobs

The promscale maintenance jobs go through two main phases:

  1. Applying retention policies
  2. Compressing data

It is possible that when triggering the upgrade in different phases, the contention behaviour could be different.

Upgrade triggered during retention

level=info ts=2022-05-26T07:34:02.556Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-26T07:37:26.541Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-26T07:40:41.023Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0

Upgrade triggered during compression

level=info ts=2022-05-26T08:40:51.582Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-26T08:42:08.639Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-26T08:43:53.484Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0