Adds monitoring and routing bitte-cell Tempo chart integration as well as a number of fixes and improvements.
See the companion bitte-cells PR for the Tempo job: https://github.com/input-output-hk/bitte-cells/pull/38
services.traefik.acmeDnsCertMgr now defaults to false so that ACME HTTP-01 challenge is default rather than DNS-01
services.traefik.useVaultBackend now defaults to false as majority of clusters have migrated to git TF state
services.traefik.enableTracing is a new option, is true by default as set in the routing profile, and will send traces to Tempo once a bitte-cell tempo nomadChart is running
services.monitoring.useTempo is a new option, is true by default as set in the monitoring module, and will integrate grafana with tempo via datasource config and scrape config
services.victoria.metrics.maxLabelsPerTimeseries is a new option, is 30 by default as set in the victoriametrics module, and is for clusters which have Consul services with a large number of tags that will hit the victorimetrics label ingestion limit per time series if this option is not available to customize.
Consul has been bumped from 1.11.2 to latest: 1.13.1 and consul connect idle_timeout parameter patch modified for this new version
Improvements:
Consul deprecated config has been updated
Refactored the now very large monitoring profile into a proper monitoring module and a much smaller monitoring profile
Monitoring now utilizes caddy instead of nginx for reverse proxy as caddy supports simple http dynamic SRV backend pools which were required for grafana integration with tempo instead of adding more dependencies on routing as a potential central point of failure.
Added bitte CI tests via flake check which auto commit on post hook to cache.iog.io
Fixes:
Fixed: slow traefik systemd restarts due to consul connect cert timeout (requires new consul version to be deployed)
Fixed: IMDSv2 metadata hop limit causing failures and timeouts in containers on EC2 utilizing AWS metadata
Legacy clean up:
Trim deprecated docker and hydra code
Migrate hydra vault role to cache vault role (this only affected ci-world I believe)
Migration:
In the world hydration profile, an s3Tempo = $BUCKET_NAME declaration needs to be added under the cluster scope
The $BUCKET_NAME declared will need to be created manually as a private bucket in the default region for the cluster and Tempo will exclusively use this bucket
Since consul version has been bumped to 1.13.1, a metal rollout across the cluster to upgrade consul will need to be performed using consul deployment best practices (see hashicorp upgrade docs)
A TF plan/apply for the hydrate-cluster workspace is needed for a tempo vault and consul policy set
A TF plan/apply for the clients workspace is needed for IMDSv2 hop fix
A TF plan/apply for the hydrate-monitoring workspace is needed if your world cluster adds new Tempo dashboards and alerts from bitte-cells.
Metal deploy to core-1, routing and monitoring are required for updated systemd services, although deployment of these machines this will already happen during the rollout of consul 1.13.1 above.
Until a Tempo job is deployed from bitte cells, expect to see some errors being logged in grafana on monitoring and traefik on routing services.
If you plan to delay deployment of a Tempo job, you can set the services.traefik.enableTracing and services.monitoring.useTempo module options to false to disable those errors on the next monitoring and routing deployment.
Adds monitoring and routing bitte-cell Tempo chart integration as well as a number of fixes and improvements. See the companion bitte-cells PR for the Tempo job: https://github.com/input-output-hk/bitte-cells/pull/38
See migration notes below.
Important: When following these migration steps, ensure you bitte commit pin is https://github.com/input-output-hk/bitte/commit/3d8c3d7cd6a743d527adf1479ae70032c855858b or newer to get all updated fixes
Summary:
Options changes:
services.traefik.acmeDnsCertMgr
now defaults tofalse
so that ACME HTTP-01 challenge is default rather than DNS-01services.traefik.useVaultBackend
now defaults tofalse
as majority of clusters have migrated to git TF stateservices.traefik.enableTracing
is a new option, istrue
by default as set in the routing profile, and will send traces to Tempo once a bitte-cell tempo nomadChart is runningservices.monitoring.useTempo
is a new option, istrue
by default as set in the monitoring module, and will integrate grafana with tempo via datasource config and scrape configservices.victoria.metrics.maxLabelsPerTimeseries
is a new option, is30
by default as set in the victoriametrics module, and is for clusters which have Consul services with a large number of tags that will hit the victorimetrics label ingestion limit per time series if this option is not available to customize.Package updates:
Improvements:
Fixes:
Legacy clean up:
Migration:
s3Tempo = $BUCKET_NAME
declaration needs to be added under the cluster scope$BUCKET_NAME
declared will need to be created manually as a private bucket in the default region for the cluster and Tempo will exclusively use this bucket1.13.1
, a metal rollout across the cluster to upgrade consul will need to be performed using consul deployment best practices (see hashicorp upgrade docs)hydrate-cluster
workspace is needed for a tempo vault and consul policy setclients
workspace is needed for IMDSv2 hop fixhydrate-monitoring
workspace is needed if your world cluster adds new Tempo dashboards and alerts from bitte-cells.core-1
,routing
andmonitoring
are required for updated systemd services, although deployment of these machines this will already happen during the rollout of consul1.13.1
above.services.traefik.enableTracing
andservices.monitoring.useTempo
module options to false to disable those errors on the next monitoring and routing deployment.