timescale / tobs

tobs - The Observability Stack for Kubernetes. Easy install of a full observability stack into a k8s cluster with Helm charts.
Apache License 2.0
563 stars 60 forks source link

Add tests to check up status for all components #564

Closed nhudson closed 2 years ago

nhudson commented 2 years ago

What this PR does / why we need it

Helm test to check if components of the stack are actually up and running

Which issue this PR fixes

(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged)

Checklist

nhudson commented 2 years ago

After a ton of debugging it looks like kind, sometimes doesn't allow scrapes to the following pods once everything is installed

kube-proxy
kube-controller-manager
kube-scheduler
etcd

This causes the test to fail due to the up status being 0 for these 4 pods. Checking pods these look to be healthy and as far as I can tell the cluster is running with no issues.

nhudson commented 2 years ago

Possible short term solution might be to just disable the rules that would scrape these endpoints

kube-prometheus-stack:
  defaultRules:
    rules:
      etcd: false
      kubeControllerManager: false
      kubeProxy: false
      kubeScheduler: false
  kubeControllerManager:
    enabled: false
  kubeProxy:
    enabled: false
  kubeScheduler:
    enabled: false
  kubeEtcd:
    enabled: false
nhudson commented 2 years ago

Also since it takes some time for Promscale to come online, the up metric does not get updated by the time the test runs. This will cause the test to fail as well.

paulfantom commented 2 years ago

~Let's disable scrape configs and rules for endpoints listed in https://github.com/timescale/tobs/pull/564#issuecomment-1231725609.~

~As for promscale, I was under impression that helm test are run after release is deployed. In our case we are waiting until promscale is in Ready state. This means the up metric should have a value of 1 in next scrape after promscale is Ready. For now let's put just a sleep 30 before running the query. In the future I think we should move all our tests into some golang application and run those queries with a backoff mechanism to reduce test flakiness.~

nvm, I see you already did all of that :smile: