Open tomhughes opened 2 years ago
Hi Tom, which one of the packages did you use to install the extension?
The SQL script is not present in the repository, because it’s generated in the build process.
The deb from the timescaledb repository at https://packagecloud.io/timescale/timescaledb/ubuntu.
The github doesn't seem to have it either:
% dpkg --contents promscale-extension-0.5.0.pg14.debian11.x86_64.deb
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./usr/
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./usr/lib/
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./usr/lib/postgresql/
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./usr/lib/postgresql/14/
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./usr/lib/postgresql/14/lib/
-rwxr-xr-x 0/0 1267968 2022-05-12 01:21 ./usr/lib/postgresql/14/lib/promscale-0.5.0.so
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./usr/share/
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./usr/share/postgresql/
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./usr/share/postgresql/14/
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./usr/share/postgresql/14/extension/
-rw-r--r-- 0/0 11063 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.3.2.sql
-rw-r--r-- 0/0 48512 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.0.0.sql
-rw-r--r-- 0/0 11063 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.3-beta.sql
-rw-r--r-- 0/0 11063 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.3.1.sql
-rw-r--r-- 0/0 535 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1--0.1.1.sql
-rw-r--r-- 0/0 9256 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.sql
-rw-r--r-- 0/0 9316 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.2.sql
-rw-r--r-- 0/0 0 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.3.0--0.3.1.sql
-rw-r--r-- 0/0 486924 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.5.0.sql
-rw-r--r-- 0/0 2891 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.3.1--0.3.2.sql
-rw-r--r-- 0/0 0 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.2.0--0.3.0.sql
-rw-r--r-- 0/0 486924 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.0.0--0.5.0.sql
-rw-r--r-- 0/0 11063 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.2.0.sql
-rw-r--r-- 0/0 0 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.3-beta--0.2.0.sql
-rw-r--r-- 0/0 0 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.1--0.1.2.sql
-rw-r--r-- 0/0 2950 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.2--0.1.3-beta.sql
-rw-r--r-- 0/0 375 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale.control
-rw-r--r-- 0/0 9316 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.1.1.sql
-rw-r--r-- 0/0 11063 2022-05-12 01:21 ./usr/share/postgresql/14/extension/promscale--0.3.0.sql
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./usr/share/doc/
drwxr-xr-x 0/0 0 2022-05-12 01:21 ./usr/share/doc/promscale-extension-postgresql-14/
-rw-r--r-- 0/0 163 2022-05-12 01:21 ./usr/share/doc/promscale-extension-postgresql-14/changelog.Debian.gz
-rw-r--r-- 0/0 394 2022-05-12 01:21 ./usr/share/doc/promscale-extension-postgresql-14/changelog.gz
-rw-r--r-- 0/0 494 2022-05-12 01:21 ./usr/share/doc/promscale-extension-postgresql-14/copyright
The file layout is correct. There's no direct path between 0.3.x
and 0.5.0
because it is a special case. The promscale
binary needs to execute the upgrade procedure 0.3.2
-> 0.0.0
-> 0.5.0
. But it only does it when -startup.upgrade-extensions
flag is not set to false
. Could you check the flags you start promscale
with?
Ah I only looked at the last line of the log where it said the version was wrong and then went and tried to upgrade it manually but I see that there was a different error on the line before:
May 19 20:52:48 stormfly-03.openstreetmap.org systemd[1]: Started Promscale Connector.
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: 2022/05/19 20:52:48 maxprocs: Leaving GOMAXPROCS=56: CPU quota undefined
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: level=info ts=2022-05-19T20:52:48.600Z caller=runner.go:75 msg="Version: 0.11.0, Commit Hash: f49197aac5f58755942fcbeb673271b7def122b2, Branch: HEAD"
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: level=info ts=2022-05-19T20:52:48.600Z caller=runner.go:80 config="{ListenAddr::9201 ThanosStoreAPIListenAddr: OTLPGRPCListenAddr::9202 PgmodelCfg:{CacheConfig:{SeriesCacheInitialSize:250000 seriesCacheMemoryMaxFlag:{kind:0 value:50} SeriesCacheMemoryMaxBytes:81079404134 MetricsCacheSize:10000 LabelsCacheSize:10000 ExemplarKeyPosCacheSize:10000} AppName:promscale@0.11.0 Host:localhost Port:5432 User:postgres Password:**** Database:timescale SslMode:require DbConnectionTimeout:1m0s IgnoreCompressedChunks:false AsyncAcks:false WriteConnectionsPerProc:1 MaxConnections:400 UsesHA:false DbUri:**** EnableStatementsCache:true} LogCfg:{Level:info Format:logfmt} TracerCfg:{OtelCollectorEndpoint: OtelCollectorTLSCertFile: OtelCollectorTLSKeyFile: JaegerCollectorEndpoint: SamplingRatio:1} APICfg:{AllowedOrigin:^(?:.*)$ ReadOnly:false HighAvailability:false AdminAPIEnabled:false TelemetryPath:/metrics Auth:0xc0009d6c30 MultiTenancy:<nil> Rules:<nil>} LimitsCfg:{targetMemoryFlag:{kind:0 value:80} TargetMemoryBytes:162158808268} TenancyCfg:{SkipTenantValidation:false EnableMultiTenancy:false AllowNonMTWrites:false ValidTenantsStr:allow-all ValidTenantsList:[]} PromQLCfg:{EnabledFeatureMap:map[] PromscaleEnabledFeatureList:[] MaxQueryTimeout:2m0s SubQueryStepInterval:1m0s LookBackDelta:5m0s MaxSamples:50000000 MaxPointsPerTs:11000} RulesCfg:{NotificationQueueCapacity:10000 OutageTolerance:1h0m0s ForGracePeriod:10m0s ResendDelay:1m0s PrometheusConfigAddress: PrometheusConfig:<nil>} TracingCfg:{MaxTraceDuration:1h0m0s} ConfigFile:config.yml DatasetConfig: TLSCertFile: TLSKeyFile: ThroughputInterval:1s Migrate:true StopAfterMigrate:false UseVersionLease:true InstallExtensions:true UpgradeExtensions:true UpgradePrereleaseExtensions:false StartupOnly:false}"
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: level=warn ts=2022-05-19T20:52:48.883Z caller=extension.go:154 msg="Unable to install/update the Promscale extension; failed to acquire the lock. Ensure there are no other connectors running and try again."
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: level=error ts=2022-05-19T20:52:48.883Z caller=client.go:119 msg="Unable to run migrations; failed to acquire the lock. If the database is on an incorrect version, ensure there are no other connectors running and try again." err="dependency error: problem checking extension version: problem checking promscale extension version: promscale extension is installed but on an unsupported version: 0.3.2. Expected: >=0.5.0-alpha <0.5.99"
May 19 20:52:48 stormfly-03.openstreetmap.org promscale[3070779]: level=error ts=2022-05-19T20:52:48.883Z caller=runner.go:116 msg="aborting startup due to error" err="dependency error: problem checking extension version: problem checking promscale extension version: promscale extension is installed but on an unsupported version: 0.3.2. Expected: >=0.5.0-alpha <0.5.99"
May 19 20:52:48 stormfly-03.openstreetmap.org systemd[1]: promscale.service: Main process exited, code=exited, status=1/FAILURE
May 19 20:52:48 stormfly-03.openstreetmap.org systemd[1]: promscale.service: Failed with result 'exit-code'.
so it seems the real problem was a lock:
Unable to install/update the Promscale extension; failed to acquire the lock. Ensure there are no other connectors running and try again.
@tomhughes that would explain it. Was the error message clear enough for you to continue with the upgrade? Was it successful?
I had to significantly increase the value of max_locks_per_transaction
because it kept failing with either a deadlock or shared memory exhaustion. An initial increase from 512 to 1024 didn't work so I just raised it to 16384 which allowed it to work though it took about an hour to upgrade.
I had to significantly increase the value of
max_locks_per_transaction
because it kept failing with either a deadlock or shared memory exhaustion. An initial increase from 512 to 1024 didn't work so I just raised it to 16384 which allowed it to work though it took about an hour to upgrade.
Thank you for the feedback! Do you have any more visibility into which part of the upgrade took an hour? It would be very valuable to better understand this.
I'm not sure there's much to go on... This was the log:
May 20 09:12:37 stormfly-03.openstreetmap.org systemd[1]: Started Promscale Connector.
May 20 09:12:37 stormfly-03.openstreetmap.org promscale[3296066]: 2022/05/20 09:12:37 maxprocs: Leaving GOMAXPROCS=56: CPU quota undefined
May 20 09:12:37 stormfly-03.openstreetmap.org promscale[3296066]: level=info ts=2022-05-20T09:12:37.986Z caller=runner.go:75 msg="Version: 0.11.0, Commit Hash: f49197aac5f58755942fcbeb673271b7def122b2, Branch: HEAD"
May 20 09:12:37 stormfly-03.openstreetmap.org promscale[3296066]: level=info ts=2022-05-20T09:12:37.987Z caller=runner.go:80 config="{ListenAddr::9201 ThanosStoreAPIListenAddr: OTLPGRPCListenAddr::9202 PgmodelCfg:{CacheConfig:{SeriesCacheInitialSize:250000 seriesCacheMemoryMaxFlag:{kind:0 value:50} SeriesCac>
May 20 09:12:38 stormfly-03.openstreetmap.org promscale[3296066]: level=info ts=2022-05-20T09:12:38.745Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
May 20 09:39:41 stormfly-03.openstreetmap.org promscale[3296066]: level=info ts=2022-05-20T09:39:41.279Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
May 20 10:11:57 stormfly-03.openstreetmap.org promscale[3296066]: level=info ts=2022-05-20T10:11:57.923Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0
So it first installed the base version witch took about half an hour (restarting the server to increase the locks seemed to make the extension disappear completely, possibly because the old .so was no longer present) and then about another half hour to upgrade it.
As far as I know that second part at least was just it executing an ALTER to update the extension but the CPUs were fairly flat out the whole hour.
Thank you, that is incredibly helpful. Could you perhaps also provide the postgres logs for the same time interval?
I think this is everything from when I restarted the server after increasing the lock count the second time until the upgrade completed, but with the routine autovacuum noise stripped out:
2022-05-20 09:12:29 GMT LOG: starting PostgreSQL 14.3 (Ubuntu 14.3-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
2022-05-20 09:12:29 GMT LOG: listening on IPv6 address "::1", port 5432
2022-05-20 09:12:29 GMT LOG: listening on IPv4 address "127.0.0.1", port 5432
2022-05-20 09:12:29 GMT LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-05-20 09:12:30 GMT LOG: database system was shut down at 2022-05-20 09:12:25 GMT
2022-05-20 09:12:30 GMT LOG: database system is ready to accept connections
2022-05-20 09:12:30 GMT LOG: TimescaleDB background worker launcher connected to shared catalogs
2022-05-20 09:20:58 GMT ERROR: canceling autovacuum task
2022-05-20 09:20:58 GMT CONTEXT: while scanning block 1677529 of relation "_timescaledb_internal._hyper_2_1756579_chunk"
automatic vacuum of table "promscale._timescaledb_internal._hyper_2_1756579_chunk"
2022-05-20 09:39:42 GMT FATAL: terminating background worker "User-Defined Action [1000]" due to administrator command
2022-05-20 09:39:42 GMT CONTEXT: SQL statement "SELECT m.*
FROM _prom_catalog.metric m
WHERE EXISTS (
SELECT 1 FROM
_prom_catalog.get_storage_hypertable_info(m.table_schema, m.table_name, m.is_view) hi
INNER JOIN public.show_chunks(hi.hypertable_relation,
older_than=>NOW() - _prom_catalog.get_metric_retention_period(m.table_schema, m.metric_name)) sc ON TRUE)
--random order also to prevent starvation
ORDER BY random()"
PL/pgSQL function get_metrics_that_need_drop_chunk() line 13 at RETURN QUERY
PL/pgSQL function execute_data_retention_policy(boolean) line 11 at FOR over SELECT rows
SQL statement "CALL _prom_catalog.execute_data_retention_policy(log_verbose=>log_verbose)"
PL/pgSQL function execute_maintenance(boolean) line 11 at CALL
SQL statement "CALL prom_api.execute_maintenance(log_verbose=>log_verbose)"
PL/pgSQL function execute_maintenance_job(integer,jsonb) line 28 at CALL
2022-05-20 09:39:42 GMT LOG: background worker "User-Defined Action [1000]" (PID 3298695) exited with exit code 1
2022-05-20 09:39:44 GMT LOG: Schema "_ps_catalog" is already present, skipping creation
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 26 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Table "_ps_catalog.migration" is already present, skipping creation
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 43 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "001-create-ext-schema.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 14 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "001-extension.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 45 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "002-utils.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 70 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "003-users.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 93 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "004-schemas.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 58 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "005-tag-operators.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 22 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "006-tables.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 163 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "007-matcher-operators.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 216 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "008-install-uda.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 86 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "009-tables-ha.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 57 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "010-tables-metadata.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 29 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "011-tables-exemplar.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 32 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "012-tracing.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 344 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "013-tracing-well-known-tags.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 195 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:39:44 GMT LOG: Migration "014-telemetry.sql" already applied, skipping
2022-05-20 09:39:44 GMT CONTEXT: PL/pgSQL function inline_code_block line 36 at RAISE
2022-05-20 09:39:44 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:40:12 GMT LOG: Applied migration 015-tracing-redesign.sql
2022-05-20 09:40:12 GMT CONTEXT: PL/pgSQL function inline_code_block line 1271 at RAISE
2022-05-20 09:40:12 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:40:12 GMT LOG: Applied migration 016-remove-ee-schemas.sql
2022-05-20 09:40:12 GMT CONTEXT: PL/pgSQL function inline_code_block line 29 at RAISE
2022-05-20 09:40:12 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:40:12 GMT LOG: Applied migration 017-set-search-path.sql
2022-05-20 09:40:12 GMT CONTEXT: PL/pgSQL function inline_code_block line 97 at RAISE
2022-05-20 09:40:12 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:40:12 GMT LOG: Applied migration 018-grant-prom-roles.sql
2022-05-20 09:40:12 GMT CONTEXT: PL/pgSQL function inline_code_block line 41 at RAISE
2022-05-20 09:40:12 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:40:12 GMT LOG: Applied migration 019-prom-installation-info.sql
2022-05-20 09:40:12 GMT CONTEXT: PL/pgSQL function inline_code_block line 33 at RAISE
2022-05-20 09:40:12 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG: unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG: unexpected EOF on client connection with an open transaction
2022-05-20 10:04:27 GMT LOG: Applied migration 020-series-partitions.sql
2022-05-20 10:04:27 GMT CONTEXT: PL/pgSQL function inline_code_block line 279 at RAISE
2022-05-20 10:04:27 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 10:04:27 GMT LOG: Applied migration 021-initial-default.sql
2022-05-20 10:04:27 GMT CONTEXT: PL/pgSQL function inline_code_block line 69 at RAISE
2022-05-20 10:04:27 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 10:04:27 GMT LOG: Applied migration 022-jit-off.sql
2022-05-20 10:04:27 GMT CONTEXT: PL/pgSQL function inline_code_block line 45 at RAISE
2022-05-20 10:04:27 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 10:04:27 GMT LOG: Applied migration 023-privileges.sql
2022-05-20 10:04:27 GMT CONTEXT: PL/pgSQL function inline_code_block line 61 at RAISE
2022-05-20 10:04:27 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
2022-05-20 10:04:30 GMT ERROR: canceling autovacuum task
2022-05-20 10:04:30 GMT CONTEXT: while scanning block 1678078 of relation "_timescaledb_internal._hyper_2_1756579_chunk"
automatic vacuum of table "promscale._timescaledb_internal._hyper_2_1756579_chunk"
2022-05-20 10:10:50 GMT LOG: Applied migration 024-adjust_autovacuum.sql
2022-05-20 10:10:50 GMT CONTEXT: PL/pgSQL function inline_code_block line 147 at RAISE
2022-05-20 10:10:50 GMT STATEMENT: ALTER EXTENSION promscale UPDATE TO '0.5.0'
terminating TimescaleDB background job "User-Defined Action [1000]" due to administrator command
2022-05-20 10:11:56 GMT FATAL: terminating background worker "User-Defined Action [1000]" due to administrator command at character 785
2022-05-20 10:11:56 GMT QUERY:
WITH potentially_drop_series AS (
SELECT distinct series_id
FROM prom_data.apache_version
WHERE time < '2022-02-19 07:42:40.528746+00'
EXCEPT
SELECT distinct series_id
FROM prom_data.apache_version
WHERE time >= '2022-02-19 07:42:40.528746+00' AND time < '2022-02-19 10:53:25.317794+00'
), confirmed_drop_series AS (
SELECT _prom_catalog.get_confirmed_unused_series('prom_data','apache_version','apache_version', array_agg(series_id), '2022-02-19 10:53:25.317794+00') as ids
FROM potentially_drop_series
) -- we want this next statement to be the last one in the txn since it could block series fetch (both of them update delete_epoch)
UPDATE prom_data_series.apache_version SET delete_epoch = current_epoch+1
FROM _prom_catalog.ids_epoch
WHERE delete_epoch IS NULL
AND id IN (SELECT unnest(ids) FROM confirmed_drop_series)
2022-05-20 10:11:56 GMT CONTEXT: PL/pgSQL function _prom_catalog.mark_unused_series(text,text,text,timestamp with time zone,timestamp with time zone) line 7 at EXECUTE
SQL statement "SELECT _prom_catalog.mark_unused_series(metric_schema, metric_table, metric_series_table, older_than, check_time)"
PL/pgSQL function drop_metric_chunks(text,text,timestamp with time zone,timestamp with time zone,boolean) line 73 at PERFORM
SQL statement "CALL _prom_catalog.drop_metric_chunks(r.table_schema, r.metric_name, NOW() - _prom_catalog.get_metric_retention_period(r.table_schema, r.metric_name), log_verbose=>log_verbose)"
PL/pgSQL function execute_data_retention_policy(boolean) line 19 at CALL
SQL statement "CALL _prom_catalog.execute_data_retention_policy(log_verbose=>log_verbose)"
PL/pgSQL function execute_maintenance(boolean) line 11 at CALL
SQL statement "CALL prom_api.execute_maintenance(log_verbose=>log_verbose)"
PL/pgSQL function execute_maintenance_job(integer,jsonb) line 28 at CALL
2022-05-20 10:11:56 GMT LOG: background worker "User-Defined Action [1000]" (PID 3305132) exited with exit code 1
terminating TimescaleDB background job "User-Defined Action [1001]" due to administrator command
2022-05-20 10:11:56 GMT FATAL: terminating background worker "User-Defined Action [1001]" due to administrator command at character 847
2022-05-20 10:11:56 GMT QUERY:
WITH potentially_drop_series AS (
SELECT distinct series_id
FROM prom_data.pg_stat_user_tables_n_dead_tup
WHERE time < '2022-02-19 06:29:45.79166+00'
EXCEPT
SELECT distinct series_id
FROM prom_data.pg_stat_user_tables_n_dead_tup
WHERE time >= '2022-02-19 06:29:45.79166+00' AND time < '2022-02-19 10:40:41.026879+00'
), confirmed_drop_series AS (
SELECT _prom_catalog.get_confirmed_unused_series('prom_data','pg_stat_user_tables_n_dead_tup','pg_stat_user_tables_n_dead_tup', array_agg(series_id), '2022-02-19 10:40:41.026879+00') as ids
FROM potentially_drop_series
) -- we want this next statement to be the last one in the txn since it could block series fetch (both of them update delete_epoch)
UPDATE prom_data_series.pg_stat_user_tables_n_dead_tup SET delete_epoch = current_epoch+1
FROM _prom_catalog.ids_epoch
WHERE delete_epoch IS NULL
AND id IN (SELECT unnest(ids) FROM confirmed_drop_series)
2022-05-20 10:11:56 GMT CONTEXT: PL/pgSQL function _prom_catalog.mark_unused_series(text,text,text,timestamp with time zone,timestamp with time zone) line 7 at EXECUTE
SQL statement "SELECT _prom_catalog.mark_unused_series(metric_schema, metric_table, metric_series_table, older_than, check_time)"
PL/pgSQL function drop_metric_chunks(text,text,timestamp with time zone,timestamp with time zone,boolean) line 73 at PERFORM
SQL statement "CALL _prom_catalog.drop_metric_chunks(r.table_schema, r.metric_name, NOW() - _prom_catalog.get_metric_retention_period(r.table_schema, r.metric_name), log_verbose=>log_verbose)"
PL/pgSQL function execute_data_retention_policy(boolean) line 19 at CALL
SQL statement "CALL _prom_catalog.execute_data_retention_policy(log_verbose=>log_verbose)"
PL/pgSQL function execute_maintenance(boolean) line 11 at CALL
SQL statement "CALL prom_api.execute_maintenance(log_verbose=>log_verbose)"
PL/pgSQL function execute_maintenance_job(integer,jsonb) line 28 at CALL
2022-05-20 10:11:56 GMT LOG: background worker "User-Defined Action [1001]" (PID 3298657) exited with exit code 1
That is amazing, thank you again for providing us with this!
@tomhughes we've had a chance to take a closer look at the logs that you've provided and have some follow-up questions.
The first question relates to this line:
2022-05-20 09:39:42 GMT FATAL: terminating background worker "User-Defined Action [1000]" due to administrator command
Did you manually cancel the background worker here using something like pg_cancel_backend
or pg_terminate_backend
?
The second question relates to these lines:
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: unexpected EOF on client connection with an open transaction
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
2022-05-20 09:59:23 GMT LOG: could not receive data from client: Connection reset by peer
Do you know which application this was? I assume that there are applications other than Promscale connected to your DB, perhaps Grafana? Any others which might be relevant? This is mostly so that we can understand what the overall environment looks like.
Thanks in advance!
I have no idea what cause the first message I'm afraid. I got rid of our scheduled running of the maintenance task a while ago as timescale now takes care of that so I assumed it had started a background job which got cancelled.
As to the others I guess it was likely grafana or alertmanager though I don't think we have anything there which routinely hits promscale rather than hitting prometheus directly.
The setup is prometheus, feeding into promscale (which is still experimental as we haven't got it to run reliably yet) plus grafana, alertmanager and karma to provide a web interface to alerts.
It could maybe be the postgres collector hitting the database to collect statistics? I also ran some selects in a console session to try and see what it was doing and had pg_top
running at times for the same reason.
@tomhughes thanks again for the invaluable feedback.
which is still experimental as we haven't got it to run reliably yet
The focus of our team over the next few months will be on improving reliability. Could you perhaps provide us with some insight into the types of problems that you've been having?
The principle problem we've been struggling with over the last year is getting autovacuum to keep up with the sheer number of tables and to be able to analyse everything in time to prevent things freezing up due to transaction ID wraparound.
It has improved recently, to the extent that the time we can run for before things collapse is now a matter of weeks rather than days - most recently we were able to run from mid April until last week when we hit the wall again - that was what led to me trying to update to the latest promscale and extension to see if there were any further improvements.
That said upgrading from 0.10.0/0.3.2 to 0.11.0/0.50 seems to have it's own issues in that there has been a significant increase in the CPU used by the ingestor, or rather by postgres on the ingestor's behalf I think, although it's all a bit confusing as you can see here:
When I first started 0.11.0 succesfully on the 20th it ran flat out for a few hours and I then reduced the number of autovacuum workers from 56 to 28 (it's a 28 core/56 thread machine) and that improved things for a few days until it hit transaction wraparound issues again on the 22nd at which point I stopped the ingestor while it caught up with autovacuum.
Since I restarted the ingestor on the 23rd it's just being going nuts with the CPUs all flat out... The good news is that the I/O utilisation has been way down since that restart and it has been keeping the unfrozen IDs under control:
Whether that is good overall is unclear because I don't really understand why the behaviour changed so dramatically after that restart...
Thank you for these insights, we will take a look at them in detail and get back to you with further questions.
In the meantime, could you tell us what the volume of metrics is that you're currently ingesting?
The Promscale connector has log outputs like:
level=info ts=2022-05-25T08:32:06.103Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=500 metrics-max-sent-ts=2022-05-25T08:31:46.915Z
which can help to identify this.
Could you perhaps also provide the output of the following queries:
SELECT count(*) FROM timescaledb_information.chunks;
SELECT count(*)
FROM pg_class
JOIN pg_namespace n ON n.oid = pg_class.relnamespace
WHERE n.nspname <> 'pg_catalog'
AND n.nspname !~ '^pg_toast'
AND n.nspname <> 'information_schema';
Sure, here's a snapshot of the throughput:
May 25 08:52:28 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:28.442Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=44000 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:29 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:29.442Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=11500 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:30 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:30.448Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=4500 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:31 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:31.442Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=5500 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:32 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:32.449Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=1000 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:33 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:33.445Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=3500 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:34 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:34.443Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=9000 metrics-max-sent-ts=2022-05-25T07:20:45.031Z
May 25 08:52:35 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:35.443Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=6500 metrics-max-sent-ts=2022-05-25T07:20:45.07Z
May 25 08:52:36 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:36.443Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=8500 metrics-max-sent-ts=2022-05-25T07:20:45.305Z
May 25 08:52:37 stormfly-03.openstreetmap.org promscale[1031073]: level=info ts=2022-05-25T08:52:37.442Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=6000 metrics-max-sent-ts=2022-05-25T07:20:45.305Z
and the first query:
promscale=# SELECT count(*) FROM timescaledb_information.chunks;
count
--------
392867
(1 row)
and the second:
promscale=# SELECT count(*)
promscale-# FROM pg_class
promscale-# JOIN pg_namespace n ON n.oid = pg_class.relnamespace
promscale-# WHERE n.nspname <> 'pg_catalog'
promscale-# AND n.nspname !~ '^pg_toast'
promscale-# AND n.nspname <> 'information_schema';
count
---------
1595381
(1 row)
This is incredibly valuable feedback! Thank you @tomhughes!
@tomhughes sorry to keep bothering you for more information. Could you provide us with some information about the server that you ran the upgrade on?
I'm primarily interested in understanding the performance characteristics of the hardware.
If you're using a cloud hosting provider (AWS/GCP/hetzner etc.) you can just mention the type of instance that you're using.
It's this machine which is bare metal with 2 x 14 core Xeons, 192Gb of RAM and 4 x 1Tb SSDs in RAID5.
That wasn't necessarily intended as the final home for this, but part of what we were trying to figure out was what resources (mostly how much disk) we would need before we bought a machine for it and this was a machine we had available to trial it.
Promscale 0.11.0 is refusing to start claiming the extension is too old even though I have 0.5.0 installed. Trying to update it manually fails:
and the package does indeed appear to be missing an SQL script to do the update, as does this repository.