[SPIKE] Buildling off of SRE-P practices

HumairAK commented 3 years ago

There's a few areas we can improve our practices, and/or learn from SRE-P's experience and procedures. After our meeting we identified a few areas of interest:

Consuming SRE-P's alertmanager config secret (it's auto generated from here, we will need to get a raw yaml dump from them)
Possibly forking their runbooks repo, and buildling off it, adding more runbooks for different cluster alerts
Improving our incident management procedure, influenced from this
Identifying SLOs, burn rates, etc. See here
Explore SRE-P's VALET grafana dashboards and how we can leverage them for our services/clusters.

tumido commented 3 years ago

durandom commented 3 years ago

cc @RiRa12621 JFYI

the altermanager yaml is

alert-manager-config

```yaml global: resolve_timeout: 5m http_config: {} smtp_hello: localhost smtp_require_tls: true pagerduty_url: https://events.pagerduty.com/v2/enqueue hipchat_api_url: https://api.hipchat.com/ opsgenie_api_url: https://api.opsgenie.com/ wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/ victorops_api_url: https://alert.victorops.com/integrations/generic/NNNNNNN/alert/ route: receiver: "null" group_by: - job routes: - receiver: "null" group_by: - alertname - severity continue: true routes: - receiver: "null" match: alertname: KubeQuotaExceeded - receiver: "null" match: alertname: KubeQuotaFullyUsed - receiver: "null" match: alertname: CPUThrottlingHigh - receiver: "null" match: alertname: NodeFilesystemSpaceFillingUp severity: warning - receiver: "null" match: namespace: openshift-customer-monitoring - receiver: "null" match: namespace: openshift-operators - receiver: "null" match: exported_namespace: openshift-operators - receiver: "null" match: alertname: CustomResourceDetected - receiver: "null" match: alertname: ImagePruningDisabled - receiver: "null" match: severity: info - receiver: "null" match: severity: warning match_re: alertname: ^etcd.* - receiver: "null" match: alertname: PodDisruptionBudgetLimit match_re: namespace: ^redhat-.* - receiver: "null" match: alertname: PodDisruptionBudgetAtLimit match_re: namespace: ^redhat-.* - receiver: "null" match: alertname: TargetDown match_re: namespace: ^redhat-.* - receiver: "null" match: alertname: KubeJobFailed namespace: openshift-logging match_re: job_name: ^elasticsearch.* - receiver: "null" match: alertname: HAProxyReloadFail severity: critical - receiver: "null" match: alertname: PrometheusRuleFailures - receiver: "null" match: alertname: ClusterOperatorDegraded name: authentication reason: IdentityProviderConfig_Error - receiver: "null" match: alertname: ClusterOperatorDegraded name: authentication reason: OAuthServerConfigObservation_Error - receiver: "null" match: alertname: CannotRetrieveUpdates - receiver: "null" match: alertname: PrometheusNotIngestingSamples namespace: openshift-user-workload-monitoring - receiver: "null" match: alertname: PrometheusRemoteStorageFailures namespace: openshift-monitoring - receiver: "null" match: alertname: PrometheusRemoteWriteDesiredShards namespace: openshift-monitoring - receiver: "null" match: alertname: PrometheusRemoteWriteBehind namespace: openshift-monitoring - receiver: make-it-warning match: alertname: KubeAPILatencyHigh severity: critical - receiver: pagerduty match: prometheus: openshift-monitoring/k8s match_re: exported_namespace: ^kube.*|^openshift.*|^redhat-.* - receiver: pagerduty match: exported_namespace: "" prometheus: openshift-monitoring/k8s match_re: namespace: ^kube.*|^openshift.*|^redhat-.* - receiver: pagerduty match: job: fluentd prometheus: openshift-monitoring/k8s - receiver: pagerduty match: alertname: FluentdNodeDown prometheus: openshift-monitoring/k8s - receiver: pagerduty match: cluster: elasticsearch prometheus: openshift-monitoring/k8s - receiver: watchdog match: alertname: Watchdog repeat_interval: 5m group_wait: 30s group_interval: 5m repeat_interval: 12h inhibit_rules: - source_match: severity: critical target_match_re: severity: warning|info equal: - namespace - alertname - source_match: severity: warning target_match_re: severity: info equal: - namespace - alertname - source_match: alertname: ClusterOperatorDegraded target_match_re: alertname: ClusterOperatorDown equal: - namespace - name - source_match: alertname: KubeNodeNotReady target_match_re: alertname: KubeNodeUnreachable equal: - node - instance - source_match: alertname: KubeNodeUnreachable target_match_re: alertname: SDNPodNotReady|TargetDown - source_match: alertname: KubeNodeNotReady target_match_re: alertname: KubeDaemonSetRolloutStuck|KubeDaemonSetMisScheduled|KubeDeploymentReplicasMismatch|KubeStatefulSetReplicasMismatch|KubePodNotReady equal: - instance - source_match: alertname: KubeDeploymentReplicasMismatch target_match_re: alertname: KubePodNotReady|KubePodCrashLooping equal: - namespace - source_match: alertname: ElasticsearchOperatorCSVNotSuccessful target_match_re: alertname: ElasticsearchClusterNotHealthy equal: - dummylabel receivers: - name: pagerduty pagerduty_configs: - send_resolved: true http_config: {} routing_key: url: https://events.pagerduty.com/v2/enqueue client: '{{ template "pagerduty.default.client" . }}' client_url: '{{ template "pagerduty.default.clientURL" . }}' description: '{{ .CommonLabels.alertname }} {{ .CommonLabels.severity | toUpper }} ({{ len .Alerts }})' details: cluster_id: ......-.....-.....-......-...................... component: '{{ .CommonLabels.alertname }}' console: https://console-openshift-console.apps.openshift-web.xxxxxxx.openshiftapps.com firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}' group: '{{ .CommonLabels.alertname }}' link: '{{ if .CommonAnnotations.link }}{{ .CommonAnnotations.link }}{{ else }}https://github.com/openshift/ops-sop/tree/master/v4/alerts/{{ .CommonLabels.alertname }}.md{{ end }}' link2: '{{ if .CommonAnnotations.runbook }}{{ .CommonAnnotations.runbook }}{{ else }}{{ end }}' num_firing: '{{ .Alerts.Firing | len }}' num_resolved: '{{ .Alerts.Resolved | len }}' resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}' severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower }}{{ else }}critical{{ end }}' - name: make-it-warning pagerduty_configs: - send_resolved: true http_config: {} routing_key: url: https://events.pagerduty.com/v2/enqueue client: '{{ template "pagerduty.default.client" . }}' client_url: '{{ template "pagerduty.default.clientURL" . }}' description: '{{ .CommonLabels.alertname }} {{ .CommonLabels.severity | toUpper }} ({{ len .Alerts }})' details: cluster_id: ......-.....-.....-......-...................... component: '{{ .CommonLabels.alertname }}' console: https://console-openshift-console.apps.openshift-web.xxxxx.openshiftapps.com firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}' group: '{{ .CommonLabels.alertname }}' link: '{{ if .CommonAnnotations.link }}{{ .CommonAnnotations.link }}{{ else }}https://github.com/openshift/ops-sop/tree/master/v4/alerts/{{ .CommonLabels.alertname }}.md{{ end }}' link2: '{{ if .CommonAnnotations.runbook }}{{ .CommonAnnotations.runbook }}{{ else }}{{ end }}' num_firing: '{{ .Alerts.Firing | len }}' num_resolved: '{{ .Alerts.Resolved | len }}' resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}' severity: warning - name: watchdog webhook_configs: - send_resolved: true http_config: {} url: https://nosnch.in/NNNNNNNNN - name: "null" templates: [] ```

RiRa12621 commented 3 years ago

/cc @jeremyeder

HumairAK commented 3 years ago

there are other burn-rate issues within this org, @hemajv can you link those

jeremyeder commented 3 years ago

Forking the runbooks repo feels weird? Not sure why you'd do that? The template within that repo is from upstream, so just take that maybe? The rest of the bullets makes sense to review and generalize.

RiRa12621 commented 3 years ago

From what I understood, the fork only serves the purpose to be able to submit PRs back

HumairAK commented 3 years ago

@jeremyeder yes, I believe the suggestion was to fork the repo into this org, so that we may continue to add runbooks that we find are useful and may also benefit upstream, and submit them back in the form of PRs. Also, to rebase and consume any updates from upstream when they occur.

Definitely open to other workflows.

tumido commented 3 years ago

FTR Prometheus operator upgrade PR is pending now: https://github.com/operator-framework/community-operators/pull/3599

HumairAK commented 3 years ago

Closing this as progress is being tracked in separate issues.

operate-first / support

[SPIKE] Buildling off of SRE-P practices #822