Open mvgijssel opened 1 year ago
For the teleport health check can use the Telegraf http_response input!
Setup basic health dashboard for Teleport
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"target": {
"limit": 100,
"matchAny": false,
"tags": [],
"type": "dashboard"
},
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 333761,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "Fn0r6zw4z"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 4,
"options": {
"alertInstanceLabelFilter": "",
"alertName": "",
"dashboardAlerts": false,
"groupBy": [],
"groupMode": "default",
"maxItems": 20,
"sortOrder": 1,
"stateFilter": {
"error": true,
"firing": true,
"inactive": true,
"noData": true,
"normal": true,
"pending": true
}
},
"title": "Alerts",
"type": "alertlist"
},
{
"datasource": {
"type": "prometheus",
"uid": "Fn0r6zw4z"
},
"fieldConfig": {
"defaults": {
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 1
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"id": 2,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "auto"
},
"pluginVersion": "8.5.1",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "Fn0r6zw4z"
},
"editorMode": "builder",
"expr": "min(http_response_result_code{host=\"provisioner\", server=\"http://localhost:3000/healthz\"})",
"range": true,
"refId": "A"
}
],
"title": "Teleport Health Code",
"type": "stat"
}
],
"schemaVersion": 36,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-30m",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "System Health",
"uid": "EmoPWGwVk",
"version": 4,
"weekStart": ""
}
Asking in the InfluxDB Community why there is overlap in some of the metrics https://influxcommunity.slack.com/archives/CH99HUH8V/p1684753005874559
Also filed support ticket with Logz.io https://support.logz.io/hc/en-us/requests/60657
From https://groups.google.com/g/prometheus-users/c/JcV51GNnXNM
The current staleness handling means that the time series will still be returned by instant vectors for 5 minutes. I'd suggest putting the run number as the value of a single timeseries.
GitHub actions metrics https://promhippie.github.io/github_exporter/#getting-started
Nee Relic seems to have a good offering as well and supports the New Relic Gate https://docs.newrelic.com/whats-new/2023/04/whats-new-04-20-github-integration/ to protect deployments.
Migration:
For tracking GitHub related metrics https://github.com/infinityworks/github-exporter/blob/master/exporter/metrics.go
First need to delete the existing New Relic accounts https://forum.newrelic.com/s/hubtopic/aAX8W0000015AkTWAU/how-to-delete-my-new-relic-user-and-associated-accounts.
Asking https://github.com/promhippie/github_exporter/issues/213 how to interpret the data from the GitHub exporter to setup an alert in New Relic when a workflow fails.
Since you are mentioning promhippie/github_exporter just for actions metrics, this exporter can also provide various metrics generally for your GitHub orgs and repos :)
Created SLO's in New Relic for provisioner deployment and validation
Uninstalled microk8s
in the provisioner and checking if this is picked up by New Relic and PagerDuty!
Works! Got a page from PagerDuty once the invalid provisioner state was detected
Trying to setup a sub account for dev/test doesn't work:
Because New Relic is on the free tier.
Seems snap is broken at the moment https://status.snapcraft.io/ so unable to finish deploy and restore monitors 😅
Tuned SLAs to be (a lot more) lenient. Because the traffic for the SLI's is low, setting the target to 99% correct means there is a very small error budget. Trying with these new numbers
Maybe update the testinfra and pyinfra Teleport clients to use ssh directly and proxy through Teleport. Generating the config doesn’t work, but maybe can generate the connection string as well?
Can use SSH multiplexing if using SSH directly to speed up subsequent connections for both pyinfra and testinfra!
devcontainer@d42ed4873fd2:/workspaces/setup$ time ssh -o 'ControlMaster=auto' -F tmp/ssh_config ubuntu@provisioner.provisioner exit 0
real 0m0.396s
user 0m0.006s
sys 0m0.013s
devcontainer@d42ed4873fd2:/workspaces/setup$ time ssh -o 'ControlMaster=no' -F tmp/ssh_config ubuntu@provisioner.provisioner exit 0
real 0m0.734s
user 0m0.016s
sys 0m0.017s
devcontainer@d42ed4873fd2:/workspaces/setup$ time tsh ssh ubuntu@provisioner exit 0
real 0m0.950s
user 0m0.257s
sys 0m0.224s
So using the SSH client with multiplexing enabled is almost 3x faster than using the tsh ssh
command.
Can use Paramiko to parse the ssh_config file from Teleport https://snyk.io/advisor/python/paramiko/functions/paramiko.SSHConfig
Unsure how to generate the SSH config file for the identity file though https://github.com/gravitational/teleport/issues/27659
Update secrets
macro to
secrets({
“FOO”: “bar”,
“/tmp/secret”: “filesecret”,
“./rel/secret”: “relative file secret”,
})
Use a SaaS offering to notify when the provisioner is not behaving as expected. Try
Setup notifications with PagerDuty.
Interesting setup using Datadog synthetic tests in the CI https://www.datadoghq.com/blog/run-synthetic-tests-in-circeci-pipelines-with-datadog/
Try to use open telemetry inside of pytest so it’s easy to switch vendors. Teleport has support for this https://goteleport.com/docs/management/diagnostics/tracing/.
This is a great list of vendors: https://github.com/magsther/awesome-opentelemetry#ui
TODO