timescale / promscale_extension

[DEPRECATED] Tables, types and functions supporting Promscale
Other
37 stars 17 forks source link

Missing upgrade script for 0.3.2 to 0.5.0 #274

Open tomhughes opened 2 years ago

tomhughes commented 2 years ago

Promscale 0.11.0 is refusing to start claiming the extension is too old even though I have 0.5.0 installed. Trying to update it manually fails:

promscale=# alter extension promscale update;
ERROR:  extension "promscale" has no update path from version "0.3.2" to version "0.5.0"

and the package does indeed appear to be missing an SQL script to do the update, as does this repository.

sumerman commented 2 years ago

Hi Tom, which one of the packages did you use to install the extension?

The SQL script is not present in the repository, because it’s generated in the build process.

tomhughes commented 2 years ago

The deb from the timescaledb repository at https://packagecloud.io/timescale/timescaledb/ubuntu.

tomhughes commented 2 years ago

The github doesn't seem to have it either:

% dpkg --contents promscale-extension-0.5.0.pg14.debian11.x86_64.deb
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./usr/
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./usr/lib/
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./usr/lib/postgresql/
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./usr/lib/postgresql/14/
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./usr/lib/postgresql/14/lib/
-rwxr-xr-x 0/0         1267968 2022-05-12 01:21 ./usr/lib/postgresql/14/lib/promscale-0.5.0.so
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./usr/share/
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./usr/share/postgresql/
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./usr/share/postgresql/14/
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./usr/share/postgresql/14/extension/
-rw-r--r-- 0/0           11063 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.3.2.sql
-rw-r--r-- 0/0           48512 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.0.0.sql
-rw-r--r-- 0/0           11063 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.3-beta.sql
-rw-r--r-- 0/0           11063 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.3.1.sql
-rw-r--r-- 0/0             535 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1--0.1.1.sql
-rw-r--r-- 0/0            9256 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.sql
-rw-r--r-- 0/0            9316 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.2.sql
-rw-r--r-- 0/0               0 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.3.0--0.3.1.sql
-rw-r--r-- 0/0          486924 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.5.0.sql
-rw-r--r-- 0/0            2891 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.3.1--0.3.2.sql
-rw-r--r-- 0/0               0 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.2.0--0.3.0.sql
-rw-r--r-- 0/0          486924 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.0.0--0.5.0.sql
-rw-r--r-- 0/0           11063 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.2.0.sql
-rw-r--r-- 0/0               0 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.3-beta--0.2.0.sql
-rw-r--r-- 0/0               0 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.1--0.1.2.sql
-rw-r--r-- 0/0            2950 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.2--0.1.3-beta.sql
-rw-r--r-- 0/0             375 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale.control
-rw-r--r-- 0/0            9316 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.1.sql
-rw-r--r-- 0/0           11063 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.3.0.sql
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./usr/share/doc/
drwxr-xr-x 0/0               0 2022-05-12 01:21 ./usr/share/doc/promscale-extension-postgresql-14/
-rw-r--r-- 0/0             163 2022-05-12 01:21 ./usr/share/doc/promscale-extension-postgresql-14/changelog.Debian.gz
-rw-r--r-- 0/0             394 2022-05-12 01:21 ./usr/share/doc/promscale-extension-postgresql-14/changelog.gz
-rw-r--r-- 0/0             494 2022-05-12 01:21 ./usr/share/doc/promscale-extension-postgresql-14/copyright
sumerman commented 2 years ago

The file layout is correct. There's no direct path between 0.3.x and 0.5.0 because it is a special case. The promscale binary needs to execute the upgrade procedure 0.3.2 -> 0.0.0 -> 0.5.0. But it only does it when -startup.upgrade-extensions flag is not set to false. Could you check the flags you start promscale with?

tomhughes commented 2 years ago

Ah I only looked at the last line of the log where it said the version was wrong and then went and tried to upgrade it manually but I see that there was a different error on the line before:

May 19 20:52:48 stormfly-03.openstreetmap.org systemd[1]: Started Promscale Connector.
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: 2022/05/19 20:52:48 maxprocs: Leaving GOMAXPROCS=56: CPU quota undefined
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: level=info ts=2022-05-19T20:52:48.600Z caller=runner.go:75 msg="Version: 0.11.0, Commit Hash: f49197aac5f58755942fcbeb673271b7def122b2, Branch: HEAD"
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: level=info ts=2022-05-19T20:52:48.600Z caller=runner.go:80 config="{ListenAddr::9201 ThanosStoreAPIListenAddr: OTLPGRPCListenAddr::9202 PgmodelCfg:{CacheConfig:{SeriesCacheInitialSize:250000 seriesCacheMemoryMaxFlag:{kind:0 value:50} SeriesCacheMemoryMaxBytes:81079404134 MetricsCacheSize:10000 LabelsCacheSize:10000 ExemplarKeyPosCacheSize:10000} AppName:promscale@0.11.0 Host:localhost Port:5432 User:postgres Password:**** Database:timescale SslMode:require DbConnectionTimeout:1m0s IgnoreCompressedChunks:false AsyncAcks:false WriteConnectionsPerProc:1 MaxConnections:400 UsesHA:false DbUri:**** EnableStatementsCache:true} LogCfg:{Level:info Format:logfmt} TracerCfg:{OtelCollectorEndpoint: OtelCollectorTLSCertFile: OtelCollectorTLSKeyFile: JaegerCollectorEndpoint: SamplingRatio:1} APICfg:{AllowedOrigin:^(?:.*)$ ReadOnly:false HighAvailability:false AdminAPIEnabled:false TelemetryPath:/metrics Auth:0xc0009d6c30 MultiTenancy:<nil> Rules:<nil>} LimitsCfg:{targetMemoryFlag:{kind:0 value:80} TargetMemoryBytes:162158808268} TenancyCfg:{SkipTenantValidation:false EnableMultiTenancy:false AllowNonMTWrites:false ValidTenantsStr:allow-all ValidTenantsList:[]} PromQLCfg:{EnabledFeatureMap:map[] PromscaleEnabledFeatureList:[] MaxQueryTimeout:2m0s SubQueryStepInterval:1m0s LookBackDelta:5m0s MaxSamples:50000000 MaxPointsPerTs:11000} RulesCfg:{NotificationQueueCapacity:10000 OutageTolerance:1h0m0s ForGracePeriod:10m0s ResendDelay:1m0s PrometheusConfigAddress: PrometheusConfig:<nil>} TracingCfg:{MaxTraceDuration:1h0m0s} ConfigFile:config.yml DatasetConfig: TLSCertFile: TLSKeyFile: ThroughputInterval:1s Migrate:true StopAfterMigrate:false UseVersionLease:true InstallExtensions:true UpgradeExtensions:true UpgradePrereleaseExtensions:false StartupOnly:false}"
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: level=warn ts=2022-05-19T20:52:48.883Z caller=extension.go:154 msg="Unable to install/update the Promscale extension; failed to acquire the lock. Ensure there are no other connectors running and try again."
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: level=error ts=2022-05-19T20:52:48.883Z caller=client.go:119 msg="Unable to run migrations; failed to acquire the lock. If the database is on an incorrect version, ensure there are no other connectors running and try again." err="dependency error: problem checking extension version: problem checking promscale extension version: promscale extension is installed but on an unsupported version: 0.3.2. Expected: >=0.5.0-alpha <0.5.99"
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: level=error ts=2022-05-19T20:52:48.883Z caller=runner.go:116 msg="aborting startup due to error" err="dependency error: problem checking extension version: problem checking promscale extension version: promscale extension is installed but on an unsupported version: 0.3.2. Expected: >=0.5.0-alpha <0.5.99"
May 19 20:52:48 stormfly-03.openstreetmap.org systemd[1]: promscale.service: Main process exited, code=exited, status=1/FAILURE
May 19 20:52:48 stormfly-03.openstreetmap.org systemd[1]: promscale.service: Failed with result 'exit-code'.

so it seems the real problem was a lock:

Unable to install/update the Promscale extension; failed to acquire the lock. Ensure there are no other connectors running and try again.
JamesGuthrie commented 2 years ago

@tomhughes that would explain it. Was the error message clear enough for you to continue with the upgrade? Was it successful?

tomhughes commented 2 years ago

I had to significantly increase the value of max_locks_per_transaction because it kept failing with either a deadlock or shared memory exhaustion. An initial increase from 512 to 1024 didn't work so I just raised it to 16384 which allowed it to work though it took about an hour to upgrade.

JamesGuthrie commented 2 years ago

I had to significantly increase the value of max_locks_per_transaction because it kept failing with either a deadlock or shared memory exhaustion. An initial increase from 512 to 1024 didn't work so I just raised it to 16384 which allowed it to work though it took about an hour to upgrade.

Thank you for the feedback! Do you have any more visibility into which part of the upgrade took an hour? It would be very valuable to better understand this.

tomhughes commented 2 years ago

I'm not sure there's much to go on... This was the log:

May 20 09:12:37 stormfly-03.openstreetmap.org systemd[1]: Started Promscale Connector.
May 20 09:12:37 stormfly-03.openstreetmap.org promscale[3296066]: 2022/05/20 09:12:37 maxprocs: Leaving GOMAXPROCS=56: CPU quota undefined
May 20 09:12:37 stormfly-03.openstreetmap.org promscale[3296066]: level=info ts=2022-05-20T09:12:37.986Z caller=runner.go:75 msg="Version: 0.11.0, Commit Hash: f49197aac5f58755942fcbeb673271b7def122b2, Branch: HEAD"
May 20 09:12:37 stormfly-03.openstreetmap.org promscale[3296066]: level=info ts=2022-05-20T09:12:37.987Z caller=runner.go:80 config="{ListenAddr::9201 ThanosStoreAPIListenAddr: OTLPGRPCListenAddr::9202 PgmodelCfg:{CacheConfig:{SeriesCacheInitialSize:250000 seriesCacheMemoryMaxFlag:{kind:0 value:50} SeriesCac>
May 20 09:12:38 stormfly-03.openstreetmap.org promscale[3296066]: level=info ts=2022-05-20T09:12:38.745Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
May 20 09:39:41 stormfly-03.openstreetmap.org promscale[3296066]: level=info ts=2022-05-20T09:39:41.279Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
May 20 10:11:57 stormfly-03.openstreetmap.org promscale[3296066]: level=info ts=2022-05-20T10:11:57.923Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0

So it first installed the base version witch took about half an hour (restarting the server to increase the locks seemed to make the extension disappear completely, possibly because the old .so was no longer present) and then about another half hour to upgrade it.

As far as I know that second part at least was just it executing an ALTER to update the extension but the CPUs were fairly flat out the whole hour.

JamesGuthrie commented 2 years ago

Thank you, that is incredibly helpful. Could you perhaps also provide the postgres logs for the same time interval?

tomhughes commented 2 years ago

I think this is everything from when I restarted the server after increasing the lock count the second time until the upgrade completed, but with the routine autovacuum noise stripped out:

2022-05-20 09:12:29 GMT LOG:  starting PostgreSQL 14.3 (Ubuntu 14.3-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
2022-05-20 09:12:29 GMT LOG:  listening on IPv6 address "::1", port 5432
2022-05-20 09:12:29 GMT LOG:  listening on IPv4 address "127.0.0.1", port 5432
2022-05-20 09:12:29 GMT LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-05-20 09:12:30 GMT LOG:  database system was shut down at 2022-05-20 09:12:25 GMT
2022-05-20 09:12:30 GMT LOG:  database system is ready to accept connections
2022-05-20 09:12:30 GMT LOG:  TimescaleDB background worker launcher connected to shared catalogs
2022-05-20 09:20:58 GMT ERROR:  canceling autovacuum task
2022-05-20 09:20:58 GMT CONTEXT:  while scanning block 1677529 of relation "_timescaledb_internal._hyper_2_1756579_chunk"
    automatic vacuum of table "promscale._timescaledb_internal._hyper_2_1756579_chunk"
2022-05-20 09:39:42 GMT FATAL:  terminating background worker "User-Defined Action [1000]" due to administrator command
2022-05-20 09:39:42 GMT CONTEXT:  SQL statement "SELECT m.*
            FROM _prom_catalog.metric m
            WHERE EXISTS (
                SELECT 1 FROM
                _prom_catalog.get_storage_hypertable_info(m.table_schema, m.table_name, m.is_view) hi
                INNER JOIN public.show_chunks(hi.hypertable_relation,
                             older_than=>NOW() - _prom_catalog.get_metric_retention_period(m.table_schema, m.metric_name)) sc ON TRUE)
            --random order also to prevent starvation
            ORDER BY random()"
    PL/pgSQL function get_metrics_that_need_drop_chunk() line 13 at RETURN QUERY
    PL/pgSQL function execute_data_retention_policy(boolean) line 11 at FOR over SELECT rows
    SQL statement "CALL _prom_catalog.execute_data_retention_policy(log_verbose=>log_verbose)"
    PL/pgSQL function execute_maintenance(boolean) line 11 at CALL
    SQL statement "CALL prom_api.execute_maintenance(log_verbose=>log_verbose)"
    PL/pgSQL function execute_maintenance_job(integer,jsonb) line 28 at CALL
2022-05-20 09:39:42 GMT LOG:  background worker "User-Defined Action [1000]" (PID 3298695) exited with exit code 1
2022-05-20 09:39:44 GMT LOG:  Schema "_ps_catalog" is already present, skipping creation
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 26 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Table "_ps_catalog.migration" is already present, skipping creation
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 43 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "001-create-ext-schema.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 14 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "001-extension.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 45 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "002-utils.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 70 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "003-users.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 93 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "004-schemas.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 58 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "005-tag-operators.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 22 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "006-tables.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 163 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "007-matcher-operators.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 216 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "008-install-uda.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 86 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "009-tables-ha.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 57 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "010-tables-metadata.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 29 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "011-tables-exemplar.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 32 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "012-tracing.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 344 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "013-tracing-well-known-tags.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 195 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG:  Migration "014-telemetry.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT:  PL/pgSQL function inline_code_block line 36 at RAISE
2022-05-20 09:39:44 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:40:12 GMT LOG:  Applied migration 015-tracing-redesign.sql
2022-05-20 09:40:12 GMT CONTEXT:  PL/pgSQL function inline_code_block line 1271 at RAISE
2022-05-20 09:40:12 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:40:12 GMT LOG:  Applied migration 016-remove-ee-schemas.sql
2022-05-20 09:40:12 GMT CONTEXT:  PL/pgSQL function inline_code_block line 29 at RAISE
2022-05-20 09:40:12 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:40:12 GMT LOG:  Applied migration 017-set-search-path.sql
2022-05-20 09:40:12 GMT CONTEXT:  PL/pgSQL function inline_code_block line 97 at RAISE
2022-05-20 09:40:12 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:40:12 GMT LOG:  Applied migration 018-grant-prom-roles.sql
2022-05-20 09:40:12 GMT CONTEXT:  PL/pgSQL function inline_code_block line 41 at RAISE
2022-05-20 09:40:12 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:40:12 GMT LOG:  Applied migration 019-prom-installation-info.sql
2022-05-20 09:40:12 GMT CONTEXT:  PL/pgSQL function inline_code_block line 33 at RAISE
2022-05-20 09:40:12 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG:  unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG:  unexpected EOF on client connection with an open transaction
2022-05-20 10:04:27 GMT LOG:  Applied migration 020-series-partitions.sql
2022-05-20 10:04:27 GMT CONTEXT:  PL/pgSQL function inline_code_block line 279 at RAISE
2022-05-20 10:04:27 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 10:04:27 GMT LOG:  Applied migration 021-initial-default.sql
2022-05-20 10:04:27 GMT CONTEXT:  PL/pgSQL function inline_code_block line 69 at RAISE
2022-05-20 10:04:27 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 10:04:27 GMT LOG:  Applied migration 022-jit-off.sql
2022-05-20 10:04:27 GMT CONTEXT:  PL/pgSQL function inline_code_block line 45 at RAISE
2022-05-20 10:04:27 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 10:04:27 GMT LOG:  Applied migration 023-privileges.sql
2022-05-20 10:04:27 GMT CONTEXT:  PL/pgSQL function inline_code_block line 61 at RAISE
2022-05-20 10:04:27 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 10:04:30 GMT ERROR:  canceling autovacuum task
2022-05-20 10:04:30 GMT CONTEXT:  while scanning block 1678078 of relation "_timescaledb_internal._hyper_2_1756579_chunk"
    automatic vacuum of table "promscale._timescaledb_internal._hyper_2_1756579_chunk"
2022-05-20 10:10:50 GMT LOG:  Applied migration 024-adjust_autovacuum.sql
2022-05-20 10:10:50 GMT CONTEXT:  PL/pgSQL function inline_code_block line 147 at RAISE
2022-05-20 10:10:50 GMT STATEMENT:  ALTER EXTENSION promscale UPDATE TO '0.5.0'
terminating TimescaleDB background job "User-Defined Action [1000]" due to administrator command
2022-05-20 10:11:56 GMT FATAL:  terminating background worker "User-Defined Action [1000]" due to administrator command at character 785
2022-05-20 10:11:56 GMT QUERY:  
            WITH potentially_drop_series AS (
                SELECT distinct series_id
                FROM prom_data.apache_version
                WHERE time < '2022-02-19 07:42:40.528746+00'
                EXCEPT
                SELECT distinct series_id
                FROM prom_data.apache_version
                WHERE time >= '2022-02-19 07:42:40.528746+00' AND time < '2022-02-19 10:53:25.317794+00'
            ), confirmed_drop_series AS (
                SELECT _prom_catalog.get_confirmed_unused_series('prom_data','apache_version','apache_version', array_agg(series_id), '2022-02-19 10:53:25.317794+00') as ids
                FROM potentially_drop_series
            ) -- we want this next statement to be the last one in the txn since it could block series fetch (both of them update delete_epoch)
            UPDATE prom_data_series.apache_version SET delete_epoch = current_epoch+1
            FROM _prom_catalog.ids_epoch
            WHERE delete_epoch IS NULL
                AND id IN (SELECT unnest(ids) FROM confirmed_drop_series)

2022-05-20 10:11:56 GMT CONTEXT:  PL/pgSQL function _prom_catalog.mark_unused_series(text,text,text,timestamp with time zone,timestamp with time zone) line 7 at EXECUTE
    SQL statement "SELECT _prom_catalog.mark_unused_series(metric_schema, metric_table, metric_series_table, older_than, check_time)"
    PL/pgSQL function drop_metric_chunks(text,text,timestamp with time zone,timestamp with time zone,boolean) line 73 at PERFORM
    SQL statement "CALL _prom_catalog.drop_metric_chunks(r.table_schema, r.metric_name, NOW() - _prom_catalog.get_metric_retention_period(r.table_schema, r.metric_name), log_verbose=>log_verbose)"
    PL/pgSQL function execute_data_retention_policy(boolean) line 19 at CALL
    SQL statement "CALL _prom_catalog.execute_data_retention_policy(log_verbose=>log_verbose)"
    PL/pgSQL function execute_maintenance(boolean) line 11 at CALL
    SQL statement "CALL prom_api.execute_maintenance(log_verbose=>log_verbose)"
    PL/pgSQL function execute_maintenance_job(integer,jsonb) line 28 at CALL
2022-05-20 10:11:56 GMT LOG:  background worker "User-Defined Action [1000]" (PID 3305132) exited with exit code 1
terminating TimescaleDB background job "User-Defined Action [1001]" due to administrator command
2022-05-20 10:11:56 GMT FATAL:  terminating background worker "User-Defined Action [1001]" due to administrator command at character 847
2022-05-20 10:11:56 GMT QUERY:  
            WITH potentially_drop_series AS (
                SELECT distinct series_id
                FROM prom_data.pg_stat_user_tables_n_dead_tup
                WHERE time < '2022-02-19 06:29:45.79166+00'
                EXCEPT
                SELECT distinct series_id
                FROM prom_data.pg_stat_user_tables_n_dead_tup
                WHERE time >= '2022-02-19 06:29:45.79166+00' AND time < '2022-02-19 10:40:41.026879+00'
            ), confirmed_drop_series AS (
                SELECT _prom_catalog.get_confirmed_unused_series('prom_data','pg_stat_user_tables_n_dead_tup','pg_stat_user_tables_n_dead_tup', array_agg(series_id), '2022-02-19 10:40:41.026879+00') as ids
                FROM potentially_drop_series
            ) -- we want this next statement to be the last one in the txn since it could block series fetch (both of them update delete_epoch)
            UPDATE prom_data_series.pg_stat_user_tables_n_dead_tup SET delete_epoch = current_epoch+1
            FROM _prom_catalog.ids_epoch
            WHERE delete_epoch IS NULL
                AND id IN (SELECT unnest(ids) FROM confirmed_drop_series)

2022-05-20 10:11:56 GMT CONTEXT:  PL/pgSQL function _prom_catalog.mark_unused_series(text,text,text,timestamp with time zone,timestamp with time zone) line 7 at EXECUTE
    SQL statement "SELECT _prom_catalog.mark_unused_series(metric_schema, metric_table, metric_series_table, older_than, check_time)"
    PL/pgSQL function drop_metric_chunks(text,text,timestamp with time zone,timestamp with time zone,boolean) line 73 at PERFORM
    SQL statement "CALL _prom_catalog.drop_metric_chunks(r.table_schema, r.metric_name, NOW() - _prom_catalog.get_metric_retention_period(r.table_schema, r.metric_name), log_verbose=>log_verbose)"
    PL/pgSQL function execute_data_retention_policy(boolean) line 19 at CALL
    SQL statement "CALL _prom_catalog.execute_data_retention_policy(log_verbose=>log_verbose)"
    PL/pgSQL function execute_maintenance(boolean) line 11 at CALL
    SQL statement "CALL prom_api.execute_maintenance(log_verbose=>log_verbose)"
    PL/pgSQL function execute_maintenance_job(integer,jsonb) line 28 at CALL
2022-05-20 10:11:56 GMT LOG:  background worker "User-Defined Action [1001]" (PID 3298657) exited with exit code 1
JamesGuthrie commented 2 years ago

That is amazing, thank you again for providing us with this!

JamesGuthrie commented 2 years ago

@tomhughes we've had a chance to take a closer look at the logs that you've provided and have some follow-up questions.

The first question relates to this line:

2022-05-20 09:39:42 GMT FATAL:  terminating background worker "User-Defined Action [1000]" due to administrator command

Did you manually cancel the background worker here using something like pg_cancel_backend or pg_terminate_backend?

The second question relates to these lines:

2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG:  could not receive data from client: Connection reset by peer

Do you know which application this was? I assume that there are applications other than Promscale connected to your DB, perhaps Grafana? Any others which might be relevant? This is mostly so that we can understand what the overall environment looks like.

Thanks in advance!

tomhughes commented 2 years ago

I have no idea what cause the first message I'm afraid. I got rid of our scheduled running of the maintenance task a while ago as timescale now takes care of that so I assumed it had started a background job which got cancelled.

As to the others I guess it was likely grafana or alertmanager though I don't think we have anything there which routinely hits promscale rather than hitting prometheus directly.

The setup is prometheus, feeding into promscale (which is still experimental as we haven't got it to run reliably yet) plus grafana, alertmanager and karma to provide a web interface to alerts.

It could maybe be the postgres collector hitting the database to collect statistics? I also ran some selects in a console session to try and see what it was doing and had pg_top running at times for the same reason.

JamesGuthrie commented 2 years ago

@tomhughes thanks again for the invaluable feedback.

which is still experimental as we haven't got it to run reliably yet

The focus of our team over the next few months will be on improving reliability. Could you perhaps provide us with some insight into the types of problems that you've been having?

tomhughes commented 2 years ago

The principle problem we've been struggling with over the last year is getting autovacuum to keep up with the sheer number of tables and to be able to analyse everything in time to prevent things freezing up due to transaction ID wraparound.

It has improved recently, to the extent that the time we can run for before things collapse is now a matter of weeks rather than days - most recently we were able to run from mid April until last week when we hit the wall again - that was what led to me trying to update to the latest promscale and extension to see if there were any further improvements.

That said upgrading from 0.10.0/0.3.2 to 0.11.0/0.50 seems to have it's own issues in that there has been a significant increase in the CPU used by the ingestor, or rather by postgres on the ingestor's behalf I think, although it's all a bit confusing as you can see here:

image

When I first started 0.11.0 succesfully on the 20th it ran flat out for a few hours and I then reduced the number of autovacuum workers from 56 to 28 (it's a 28 core/56 thread machine) and that improved things for a few days until it hit transaction wraparound issues again on the 22nd at which point I stopped the ingestor while it caught up with autovacuum.

Since I restarted the ingestor on the 23rd it's just being going nuts with the CPUs all flat out... The good news is that the I/O utilisation has been way down since that restart and it has been keeping the unfrozen IDs under control:

image

Whether that is good overall is unclear because I don't really understand why the behaviour changed so dramatically after that restart...

JamesGuthrie commented 2 years ago

Thank you for these insights, we will take a look at them in detail and get back to you with further questions.

In the meantime, could you tell us what the volume of metrics is that you're currently ingesting?

The Promscale connector has log outputs like:

level=info ts=2022-05-25T08:32:06.103Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=500 metrics-max-sent-ts=2022-05-25T08:31:46.915Z

which can help to identify this.

Could you perhaps also provide the output of the following queries:

SELECT count(*) FROM timescaledb_information.chunks;
SELECT count(*)
        FROM pg_class
                 JOIN pg_namespace n ON n.oid = pg_class.relnamespace
        WHERE n.nspname <> 'pg_catalog'
          AND n.nspname !~ '^pg_toast'
          AND n.nspname <> 'information_schema';
tomhughes commented 2 years ago

Sure, here's a snapshot of the throughput:

May 25 08:52:28 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:28.442Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=44000 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:29 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:29.442Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=11500 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:30 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:30.448Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=4500 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:31 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:31.442Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=5500 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:32 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:32.449Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=1000 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:33 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:33.445Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=3500 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:34 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:34.443Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=9000 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:35 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:35.443Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=6500 metrics-max-sent-ts=2022-05-25T07:20:45.07Z
May 25 08:52:36 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:36.443Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=8500 metrics-max-sent-ts=2022-05-25T07:20:45.305Z
May 25 08:52:37 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:37.442Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=6000 metrics-max-sent-ts=2022-05-25T07:20:45.305Z

and the first query:

promscale=# SELECT count(*) FROM timescaledb_information.chunks;
 count  
--------
 392867
(1 row)

and the second:

promscale=# SELECT count(*)
promscale-#         FROM pg_class
promscale-#                  JOIN pg_namespace n ON n.oid = pg_class.relnamespace
promscale-#         WHERE n.nspname <> 'pg_catalog'
promscale-#           AND n.nspname !~ '^pg_toast'
promscale-#           AND n.nspname <> 'information_schema';
  count  
---------
 1595381
(1 row)
jgpruitt commented 2 years ago

This is incredibly valuable feedback! Thank you @tomhughes!

JamesGuthrie commented 2 years ago

@tomhughes sorry to keep bothering you for more information. Could you provide us with some information about the server that you ran the upgrade on?

I'm primarily interested in understanding the performance characteristics of the hardware.

If you're using a cloud hosting provider (AWS/GCP/hetzner etc.) you can just mention the type of instance that you're using.

tomhughes commented 2 years ago

It's this machine which is bare metal with 2 x 14 core Xeons, 192Gb of RAM and 4 x 1Tb SSDs in RAID5.

That wasn't necessarily intended as the final home for this, but part of what we were trying to figure out was what resources (mostly how much disk) we would need before we bought a machine for it and this was a machine we had available to trial it.