vgijssel / setup

Workstation and server setup
MIT License
8 stars 0 forks source link

Setup monitoring #275

Open mvgijssel opened 1 year ago

mvgijssel commented 1 year ago

Use a SaaS offering to notify when the provisioner is not behaving as expected. Try

Setup notifications with PagerDuty.

Interesting setup using Datadog synthetic tests in the CI https://www.datadoghq.com/blog/run-synthetic-tests-in-circeci-pipelines-with-datadog/

Try to use open telemetry inside of pytest so it’s easy to switch vendors. Teleport has support for this https://goteleport.com/docs/management/diagnostics/tracing/.

This is a great list of vendors: https://github.com/magsther/awesome-opentelemetry#ui

TODO

mvgijssel commented 1 year ago

For the teleport health check can use the Telegraf http_response input!

mvgijssel commented 1 year ago

Setup basic health dashboard for Teleport

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 333761,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Fn0r6zw4z"
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "id": 4,
      "options": {
        "alertInstanceLabelFilter": "",
        "alertName": "",
        "dashboardAlerts": false,
        "groupBy": [],
        "groupMode": "default",
        "maxItems": 20,
        "sortOrder": 1,
        "stateFilter": {
          "error": true,
          "firing": true,
          "inactive": true,
          "noData": true,
          "normal": true,
          "pending": true
        }
      },
      "title": "Alerts",
      "type": "alertlist"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Fn0r6zw4z"
      },
      "fieldConfig": {
        "defaults": {
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 1
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 8
      },
      "id": 2,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "8.5.1",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "Fn0r6zw4z"
          },
          "editorMode": "builder",
          "expr": "min(http_response_result_code{host=\"provisioner\", server=\"http://localhost:3000/healthz\"})",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Teleport Health Code",
      "type": "stat"
    }
  ],
  "schemaVersion": 36,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-30m",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "System Health",
  "uid": "EmoPWGwVk",
  "version": 4,
  "weekStart": ""
}
mvgijssel commented 1 year ago

Asking in the InfluxDB Community why there is overlap in some of the metrics https://influxcommunity.slack.com/archives/CH99HUH8V/p1684753005874559

mvgijssel commented 1 year ago

Also filed support ticket with Logz.io https://support.logz.io/hc/en-us/requests/60657

mvgijssel commented 1 year ago

From https://groups.google.com/g/prometheus-users/c/JcV51GNnXNM

The current staleness handling means that the time series will still be returned by instant vectors for 5 minutes. I'd suggest putting the run number as the value of a single timeseries.

mvgijssel commented 1 year ago

GitHub actions metrics https://promhippie.github.io/github_exporter/#getting-started

mvgijssel commented 1 year ago

Nee Relic seems to have a good offering as well and supports the New Relic Gate https://docs.newrelic.com/whats-new/2023/04/whats-new-04-20-github-integration/ to protect deployments.

Migration:

mvgijssel commented 1 year ago

For tracking GitHub related metrics https://github.com/infinityworks/github-exporter/blob/master/exporter/metrics.go

mvgijssel commented 1 year ago

First need to delete the existing New Relic accounts https://forum.newrelic.com/s/hubtopic/aAX8W0000015AkTWAU/how-to-delete-my-new-relic-user-and-associated-accounts.

mvgijssel commented 1 year ago

Asking https://github.com/promhippie/github_exporter/issues/213 how to interpret the data from the GitHub exporter to setup an alert in New Relic when a workflow fails.

tboerger commented 1 year ago

Since you are mentioning promhippie/github_exporter just for actions metrics, this exporter can also provide various metrics generally for your GitHub orgs and repos :)

mvgijssel commented 1 year ago

Created SLO's in New Relic for provisioner deployment and validation

image
mvgijssel commented 1 year ago

Uninstalled microk8s in the provisioner and checking if this is picked up by New Relic and PagerDuty!

mvgijssel commented 1 year ago

Works! Got a page from PagerDuty once the invalid provisioner state was detected

image
mvgijssel commented 1 year ago

Trying to setup a sub account for dev/test doesn't work:

image

Because New Relic is on the free tier.

mvgijssel commented 1 year ago

Seems snap is broken at the moment https://status.snapcraft.io/ so unable to finish deploy and restore monitors 😅

mvgijssel commented 1 year ago

Tuned SLAs to be (a lot more) lenient. Because the traffic for the SLI's is low, setting the target to 99% correct means there is a very small error budget. Trying with these new numbers

image

mvgijssel commented 1 year ago

Maybe update the testinfra and pyinfra Teleport clients to use ssh directly and proxy through Teleport. Generating the config doesn’t work, but maybe can generate the connection string as well?

mvgijssel commented 1 year ago

Can use SSH multiplexing if using SSH directly to speed up subsequent connections for both pyinfra and testinfra!

mvgijssel commented 1 year ago
devcontainer@d42ed4873fd2:/workspaces/setup$ time ssh -o 'ControlMaster=auto' -F tmp/ssh_config  ubuntu@provisioner.provisioner exit 0

real    0m0.396s
user    0m0.006s
sys     0m0.013s
devcontainer@d42ed4873fd2:/workspaces/setup$ time ssh -o 'ControlMaster=no' -F tmp/ssh_config  ubuntu@provisioner.provisioner exit 0

real    0m0.734s
user    0m0.016s
sys     0m0.017s
devcontainer@d42ed4873fd2:/workspaces/setup$ time tsh ssh ubuntu@provisioner exit 0

real    0m0.950s
user    0m0.257s
sys     0m0.224s

So using the SSH client with multiplexing enabled is almost 3x faster than using the tsh ssh command.

mvgijssel commented 1 year ago

Can use Paramiko to parse the ssh_config file from Teleport https://snyk.io/advisor/python/paramiko/functions/paramiko.SSHConfig

mvgijssel commented 1 year ago

Unsure how to generate the SSH config file for the identity file though https://github.com/gravitational/teleport/issues/27659

mvgijssel commented 1 year ago

Update secrets macro to

secrets({
  “FOO”: “bar”,
  “/tmp/secret”: “filesecret”,
  “./rel/secret”: “relative file secret”,
})