timescale / promscale

[DEPRECATED] Promscale is a unified metric and trace observability backend for Prometheus, Jaeger and OpenTelemetry built on PostgreSQL and TimescaleDB.
https://www.timescale.com/promscale
Apache License 2.0
1.33k stars 168 forks source link

Promscale won't start up after PG13->PG14 upgrade of Timescale docker image #1150

Closed aabrodskiy closed 2 years ago

aabrodskiy commented 2 years ago

I upgraded our server from PG13 to PG14, Timescale from 2.5.0. to 2.5.1 from docker image timescale/timescaledb-ha:pg14.1-ts2.5.1-latest and after the upgrade, I can’t get promscale container to start up (timescale/promscale:0.9.0). Previously, it was 0.8 Promscale version that ran against PG13. It throws the following error, any clues?

Here is the error message:

caller=runner.go:74 config="{ListenAddr::9201 ThanosStoreAPIListenAddr: OTLPGRPCListenAddr:0.0.0.0:9202 PgmodelCfg:{CacheConfig:{SeriesCacheInitialSize:250000 seriesCacheMemoryMaxFlag:{kind:0 value:50} SeriesCacheMemoryMaxBytes:13151515443 MetricsCacheSize:10000 LabelsCacheSize:10000 ExemplarKeyPosCacheSize:10000} AppName:promscale@0.9.0 Host:localhost Port:5432 User:postgres Password:**** Database:timescale SslMode:require DbConnectionTimeout:1m0s IgnoreCompressedChunks:false AsyncAcks:false WriteConnectionsPerProc:1 MaxConnections:-1 UsesHA:false DbUri:**** EnableStatementsCache:true} LogCfg:{Level:info Format:logfmt} TracerCfg:{JaegerCollectorEndpoint: SamplingRatio:0} APICfg:{AllowedOrigin:^(?:.*)$ ReadOnly:false HighAvailability:false AdminAPIEnabled:false TelemetryPath:/metrics-text Auth:0xc00019f450 MultiTenancy:<nil> EnabledFeatureMap:map[tracing:{}] PromscaleEnabledFeatureList:[tracing] MaxQueryTimeout:2m0s SubQueryStepInterval:1m0s LookBackDelta:5m0s MaxSamples:50000000 MaxPointsPerTs:11000} LimitsCfg:{targetMemoryFlag:{kind:0 value:80} TargetMemoryBytes:26303030886} TenancyCfg:{SkipTenantValidation:false EnableMultiTenancy:false AllowNonMTWrites:false ValidTenantsStr:allow-all ValidTenantsList:[]} ConfigFile:config.yml DatasetConfig: TLSCertFile: TLSKeyFile: ThroughputInterval:1s Migrate:true StopAfterMigrate:false UseVersionLease:true InstallExtensions:true UpgradeExtensions:true UpgradePrereleaseExtensions:false StartupOnly:false}"
level=error ts=2022-02-14T15:50:31.914Z caller=runner.go:110 msg="aborting startup due to error" err="migration error: Error while trying to migrate DB: Error encountered during migration: error executing migration script: name idempotent/tracing-tags.sql, err ERROR: features in toolkit_experimental are unstable, and objects depending on them will be deleted on extension update (there will be a DROP SCHEMA toolkit_experimental CASCADE), which on Forge can happen at any time. (SQLSTATE P0001)"

Upgrade of timescale DB was done in a separate temporary container. Dockerfile:

FROM timescale/timescaledb-ha:pg14.1-ts2.5.1-latest
COPY entry.sh /usr/local/bin/
ENTRYPOINT ["/usr/local/bin/entry.sh"]

Upgrade script:

#!/bin/bash
echo ============================= Starting the upgrade procedure from PG13 to PG14.1
echo ============================= 1. Initializing the new data folder with PG14.
/usr/lib/postgresql/14/bin/pg_ctl init -D /var/lib/postgresql/data

echo ============================= 2. Starting briefly PG13 to upgrade timescaledb and toolkit extensions.
/usr/lib/postgresql/13/bin/pg_ctl start -D /var/lib/postgresql_old/data  
psql --user postgres -d db1 -qxc "ALTER EXTENSION timescaledb UPDATE;"
psql --user postgres -d db1 -qxc "ALTER EXTENSION timescaledb_toolkit UPDATE;"
psql --user postgres -d template1 -qxc "ALTER EXTENSION timescaledb UPDATE;"
/usr/lib/postgresql/13/bin/pg_ctl stop -D /var/lib/postgresql_old/data

echo ============================= 3. Performing actual in-place upgrade.
pg_upgrade --old-datadir /var/lib/postgresql_old/data \
        --new-datadir /var/lib/postgresql/data \
        --old-bindir /usr/lib/postgresql/13/bin \
        --new-bindir /usr/lib/postgresql/14/bin \
        -O "-c timescaledb.restoring='on'"

echo 4. Starting briefly PG14 to perform vacuum cleanup
/usr/lib/postgresql/14/bin/pg_ctl start -D /var/lib/postgresql/data 
/usr/lib/postgresql/14/bin/vacuumdb --all --analyze-in-stages
/usr/lib/postgresql/14/bin/pg_ctl stop -D /var/lib/postgresql/data

echo "============================= 5. Fixing allowed hosts in pg_hba.conf"
echo "host all all all md5" >> /var/lib/postgresql/data/pg_hba.conf

echo ============================== Upgrade is completed
cevian commented 2 years ago

@aabrodskiy Hi. Thank you for the detailed report. I have two followup questions: 1) Can you provider the docker logs of the database docker container when the error occurs? 2) Do you use the timescaledb_toolkit extension for anything as far as you know? If so, how do you use it?

Thanks, Mat

cevian commented 2 years ago

One more question: what is your search_path (e.g. show search_path in sql).

aabrodskiy commented 2 years ago

Hi Mat, Thanks for looking into this! I've uploaded the logs from the database container here, there is quite a lot going on in there: timescale.txt

  1. We use unnest and lttb functions from toolkit_experimental heavily.
  2. Search path is the following: ` show search_path; search_path

    "$user", public, ps_tag, _prom_ext, prom_api, prom_metric, _prom_catalog, ps_trace`

Thank you, Alex

cevian commented 2 years ago

Getting back to this now, going back to trying to reproduce. Apologies for the delay.

cevian commented 2 years ago

Ok I believe I found the issue and it's in the toolkit extension. I am going to verify with the toolkit authors and then get back to you with a solution. Thanks for your patience.

cevian commented 2 years ago

@aabrodskiy Ok, the problem is with the disallow_experimental_deps event trigger in toolkit. It will be dropped in the next toolkit release so I suggest doing the following as a workaround right before the promscale 0.9.0 run.

ALTER EXTENSION timescaledb_toolkit DROP EVENT TRIGGER disallow_experimental_deps;
DROP EVENT TRIGGER IF EXISTS disallow_experimental_deps;
ALTER EXTENSION timescaledb_toolkit DROP EVENT TRIGGER disallow_experimental_dependencies_on_views;
DROP EVENT TRIGGER IF EXISTS disallow_experimental_dependencies_on_views;

cc @JLockerman

aabrodskiy commented 2 years ago

Awesome, thanks a lot Mat! That worked for our server and promscale is up and running again. I'll add these steps to our upgrade script for now, until it's fixed in the next version.