kubernetes / k8s.io

Code and configuration to manage Kubernetes project infrastructure, including various *.k8s.io sites
https://git.k8s.io/community/sig-k8s-infra
Apache License 2.0
733 stars 810 forks source link

Migrate existing Google Cloud alerts from click-ops to git-ops model #1624

Open spiffxp opened 3 years ago

spiffxp commented 3 years ago

Discussed in k8s-infra meeting 2020-02-03

We have some slack alerting setup today, but it's been configured by humans clicking around on the Google Cloud website (aka "click-ops"). It would be ideal if we could drive that configuration automatically via files checked into git (aka "git-ops").

This is likely similar to or overlaps with making a gitops-driven workflow for Google Cloud Monitoring dashboards (https://github.com/kubernetes/k8s.io/issues/1376)

/wg k8s-infra /sig release /area release-eng FYI @kubernetes/release-engineering since #k8s-infra-alerts contains container image promoter alerts /priority important-longterm

spiffxp commented 3 years ago

/help

k8s-ci-robot commented 3 years ago

@spiffxp: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes/k8s.io/issues/1624): >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
rikatz commented 3 years ago

I'm with low bandwidth now, but If we have some time (not urgent) I can take a look into this to see how to manage the alerts and dashboards with Gitops :)

rikatz commented 3 years ago

/assign

rikatz commented 3 years ago

So far:

My thoughts on this specific part: I really like the idea of using crossplane (k8s objects) to manage our cloud env, but I guess a lot of folks are familiar already with Terraform (although I agree with Justin, migration between versions sometimes is...annoying...)

Will create some simple .tf tomorrow with the same approach, trying to create notification channels and alert policies, and seeing how this reflects on stack driver.

rikatz commented 3 years ago

@ameukam will work on this, using @thockin tests to monitor certificates renew and expiration as an example.

rikatz commented 3 years ago

https://github.com/kubernetes/k8s.io/pull/1877 <- Created a PR with a really simple Terraform that adds an uptime check and the current alert policy.

We can improve this, like adding latency/uptime alerting (like for cs.k8s.io and others), etc.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

ameukam commented 3 years ago

/remove-lifecycle stale

spiffxp commented 3 years ago

A good first step would be understanding how to export whatever existing alerts we have as part of audit/audit-gcp.sh

spiffxp commented 3 years ago

https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/oss/terraform/modules/alerts good prior art to start from

spiffxp commented 3 years ago

/milestone v1.23 I think it would be really handy to use this at a bare minimum for uptime checks on the apps we run on aaa

ameukam commented 2 years ago

/milestone clear

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ameukam commented 2 years ago

/remove-lifecycle stale /lifecycle frozen /milestone clear