Setup monitoring - Githubissues

mvgijssel commented 1 year ago

Use a SaaS offering to notify when the provisioner is not behaving as expected. Try

datadog
new relic
honeycomb
https://gethelios.dev/pricing/
logz.io
aspecto
lightstep

Setup notifications with PagerDuty.

Interesting setup using Datadog synthetic tests in the CI https://www.datadoghq.com/blog/run-synthetic-tests-in-circeci-pipelines-with-datadog/

Try to use open telemetry inside of pytest so it’s easy to switch vendors. Teleport has support for this https://goteleport.com/docs/management/diagnostics/tracing/.

This is a great list of vendors: https://github.com/magsther/awesome-opentelemetry#ui

TODO

[x] Setup SaaS for monitoring, tracing and logs
[x] Setup provisioner system monitoring
[x] Setup teleport health check
[x] Setup teleport alert when health check fails
[x] Resolve alert automatically
[x] Send notification when alert fires
[x] Setup New Relic agent
[x] Use Renovate bazel modules
[x] Remove logz io telegraf setup
[x] Setup github exporter to track deployment metrics
[x] Connect github exporter to New Relic metrics
[x] Fix broken master
[x] Update renovate to also capture docker-compose.yml.j2
[x] Replace teleport connection test with deploy_test
[x] Use 1Password for deploy identity
[x] Add tests for Docker
[x] Add tests for new relic agent
[x] Add tests for new relic container
[x] Add test for teleport health
[x] Remove provisioner telegraf code
[x] Remove secrets from 1Password Vault
[x] Remove logz io account
[x] Use 1Password on GitHub actions instead of BuildBuddy secret
[x] Create BuildBuddy protobuf client
[x] Run //provisioner:validate against production
[x] Setup testinfra tracing
[x] Run //provisioner:validate on a schedule
[x] Setup alert when provisioner validation fails
[x] Setup alert when master branch fails
[x] Add workspace.bzlmod to renovatebot
[x] Setup schedule for renovatebot
[x] Setup deployment marker for New Relic
[x] Setup Teleport tracing in New Relic
[x] Setup resource constraints docker compose files
[x] track cpu temperature raspberry pi (https://github.com/lukasmalkmus/rpi_exporter)
[x] Remove buildbuddy grpc client
[x] Reply to BuildBuddy community about invocation api (https://buildbuddy.slack.com/archives/CUY16GNK1/p1686226616704259?thread_ts=1685995570.849599&cid=CUY16GNK1)
[x] Setup cronjob for regular provisioner reboots when necessary (stop this one during deployment?)
[x] Setup cronjob for regular docker system prune
[x] Forward all docker container logs to New Relic
[x] Forward all system logs from provisioner to New Relic
[ ] Add tests for New Relic log forwarding
[ ] Fix cron setup by fixing teleport pyinfra connector
[ ] Connect teleport service to teleport logs
[ ] Setup latency/success SLI's for Teleport service
[ ] Setup cpu temperature alert for Raspberry Pi
[ ] Update bootstrap doc
[ ] Ensure provisioner-validate ci timeout is higher
[ ] Ensure provisioner-validate has a test timeout per test so total test suite does not timeout and prevent metrics from submission
[ ] Deal with case when provisioner is offline, error when data/metrics don’t come in?
[ ] Teleport health issue not triggered to PagerDuty?

mvgijssel commented 1 year ago

For the teleport health check can use the Telegraf http_response input!

mvgijssel commented 1 year ago

Setup basic health dashboard for Teleport

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 333761,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Fn0r6zw4z"
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "id": 4,
      "options": {
        "alertInstanceLabelFilter": "",
        "alertName": "",
        "dashboardAlerts": false,
        "groupBy": [],
        "groupMode": "default",
        "maxItems": 20,
        "sortOrder": 1,
        "stateFilter": {
          "error": true,
          "firing": true,
          "inactive": true,
          "noData": true,
          "normal": true,
          "pending": true
        }
      },
      "title": "Alerts",
      "type": "alertlist"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Fn0r6zw4z"
      },
      "fieldConfig": {
        "defaults": {
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 1
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 8
      },
      "id": 2,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "8.5.1",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "Fn0r6zw4z"
          },
          "editorMode": "builder",
          "expr": "min(http_response_result_code{host=\"provisioner\", server=\"http://localhost:3000/healthz\"})",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Teleport Health Code",
      "type": "stat"
    }
  ],
  "schemaVersion": 36,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-30m",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "System Health",
  "uid": "EmoPWGwVk",
  "version": 4,
  "weekStart": ""
}

mvgijssel commented 1 year ago

Asking in the InfluxDB Community why there is overlap in some of the metrics https://influxcommunity.slack.com/archives/CH99HUH8V/p1684753005874559

mvgijssel commented 1 year ago

Also filed support ticket with Logz.io https://support.logz.io/hc/en-us/requests/60657

mvgijssel commented 1 year ago

From https://groups.google.com/g/prometheus-users/c/JcV51GNnXNM

The current staleness handling means that the time series will still be returned by instant vectors for 5 minutes. I'd suggest putting the run number as the value of a single timeseries.

mvgijssel commented 1 year ago

GitHub actions metrics https://promhippie.github.io/github_exporter/#getting-started

mvgijssel commented 1 year ago

Nee Relic seems to have a good offering as well and supports the New Relic Gate https://docs.newrelic.com/whats-new/2023/04/whats-new-04-20-github-integration/ to protect deployments.

Migration:

Telegraf cpu/memory etc -> new relic agent
Telegraf http response -> Docker status/state infrastructure monitoring https://docs.newrelic.com/attribute-dictionary/?event=ContainerSample (or Prometheus black box exporter https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md)
OpenTelemetry tracing -> OpenTelemetry OTLP New Relic endpoint https://docs.newrelic.com/docs/more-integrations/open-source-telemetry-integrations/opentelemetry/get-started/opentelemetry-set-up-your-app/

mvgijssel commented 1 year ago

For tracking GitHub related metrics https://github.com/infinityworks/github-exporter/blob/master/exporter/metrics.go

mvgijssel commented 1 year ago

First need to delete the existing New Relic accounts https://forum.newrelic.com/s/hubtopic/aAX8W0000015AkTWAU/how-to-delete-my-new-relic-user-and-associated-accounts.

mvgijssel commented 1 year ago

Asking https://github.com/promhippie/github_exporter/issues/213 how to interpret the data from the GitHub exporter to setup an alert in New Relic when a workflow fails.

tboerger commented 1 year ago

Since you are mentioning promhippie/github_exporter just for actions metrics, this exporter can also provide various metrics generally for your GitHub orgs and repos :)

mvgijssel commented 1 year ago

Created SLO's in New Relic for provisioner deployment and validation

mvgijssel commented 1 year ago

Uninstalled microk8s in the provisioner and checking if this is picked up by New Relic and PagerDuty!

mvgijssel commented 1 year ago

Works! Got a page from PagerDuty once the invalid provisioner state was detected

mvgijssel commented 1 year ago

Trying to setup a sub account for dev/test doesn't work:

Because New Relic is on the free tier.

mvgijssel commented 1 year ago

Seems snap is broken at the moment https://status.snapcraft.io/ so unable to finish deploy and restore monitors 😅

mvgijssel commented 1 year ago

Tuned SLAs to be (a lot more) lenient. Because the traffic for the SLI's is low, setting the target to 99% correct means there is a very small error budget. Trying with these new numbers

mvgijssel commented 1 year ago

Maybe update the testinfra and pyinfra Teleport clients to use ssh directly and proxy through Teleport. Generating the config doesn’t work, but maybe can generate the connection string as well?

mvgijssel commented 1 year ago

Can use SSH multiplexing if using SSH directly to speed up subsequent connections for both pyinfra and testinfra!

mvgijssel commented 1 year ago

devcontainer@d42ed4873fd2:/workspaces/setup$ time ssh -o 'ControlMaster=auto' -F tmp/ssh_config  ubuntu@provisioner.provisioner exit 0

real    0m0.396s
user    0m0.006s
sys     0m0.013s

devcontainer@d42ed4873fd2:/workspaces/setup$ time ssh -o 'ControlMaster=no' -F tmp/ssh_config  ubuntu@provisioner.provisioner exit 0

real    0m0.734s
user    0m0.016s
sys     0m0.017s

devcontainer@d42ed4873fd2:/workspaces/setup$ time tsh ssh ubuntu@provisioner exit 0

real    0m0.950s
user    0m0.257s
sys     0m0.224s

So using the SSH client with multiplexing enabled is almost 3x faster than using the tsh ssh command.

mvgijssel commented 1 year ago

Can use Paramiko to parse the ssh_config file from Teleport https://snyk.io/advisor/python/paramiko/functions/paramiko.SSHConfig

mvgijssel commented 1 year ago

Unsure how to generate the SSH config file for the identity file though https://github.com/gravitational/teleport/issues/27659

mvgijssel commented 1 year ago

Update secrets macro to

secrets({
  “FOO”: “bar”,
  “/tmp/secret”: “filesecret”,
  “./rel/secret”: “relative file secret”,
})

vgijssel / setup

Setup monitoring #275

TODO