operate-first / support

This repo should serve as a central source for users to raise issues/questions/requests for Operate First.
GNU General Public License v3.0
15 stars 25 forks source link

[SPIKE] Buildling off of SRE-P practices #822

Closed HumairAK closed 3 years ago

HumairAK commented 3 years ago

There's a few areas we can improve our practices, and/or learn from SRE-P's experience and procedures. After our meeting we identified a few areas of interest:

tumido commented 3 years ago

Related: https://github.com/operator-framework/community-operators/issues/3557

durandom commented 3 years ago

cc @RiRa12621 JFYI

the altermanager yaml is

alert-manager-config ```yaml global: resolve_timeout: 5m http_config: {} smtp_hello: localhost smtp_require_tls: true pagerduty_url: https://events.pagerduty.com/v2/enqueue hipchat_api_url: https://api.hipchat.com/ opsgenie_api_url: https://api.opsgenie.com/ wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/ victorops_api_url: https://alert.victorops.com/integrations/generic/NNNNNNN/alert/ route: receiver: "null" group_by: - job routes: - receiver: "null" group_by: - alertname - severity continue: true routes: - receiver: "null" match: alertname: KubeQuotaExceeded - receiver: "null" match: alertname: KubeQuotaFullyUsed - receiver: "null" match: alertname: CPUThrottlingHigh - receiver: "null" match: alertname: NodeFilesystemSpaceFillingUp severity: warning - receiver: "null" match: namespace: openshift-customer-monitoring - receiver: "null" match: namespace: openshift-operators - receiver: "null" match: exported_namespace: openshift-operators - receiver: "null" match: alertname: CustomResourceDetected - receiver: "null" match: alertname: ImagePruningDisabled - receiver: "null" match: severity: info - receiver: "null" match: severity: warning match_re: alertname: ^etcd.* - receiver: "null" match: alertname: PodDisruptionBudgetLimit match_re: namespace: ^redhat-.* - receiver: "null" match: alertname: PodDisruptionBudgetAtLimit match_re: namespace: ^redhat-.* - receiver: "null" match: alertname: TargetDown match_re: namespace: ^redhat-.* - receiver: "null" match: alertname: KubeJobFailed namespace: openshift-logging match_re: job_name: ^elasticsearch.* - receiver: "null" match: alertname: HAProxyReloadFail severity: critical - receiver: "null" match: alertname: PrometheusRuleFailures - receiver: "null" match: alertname: ClusterOperatorDegraded name: authentication reason: IdentityProviderConfig_Error - receiver: "null" match: alertname: ClusterOperatorDegraded name: authentication reason: OAuthServerConfigObservation_Error - receiver: "null" match: alertname: CannotRetrieveUpdates - receiver: "null" match: alertname: PrometheusNotIngestingSamples namespace: openshift-user-workload-monitoring - receiver: "null" match: alertname: PrometheusRemoteStorageFailures namespace: openshift-monitoring - receiver: "null" match: alertname: PrometheusRemoteWriteDesiredShards namespace: openshift-monitoring - receiver: "null" match: alertname: PrometheusRemoteWriteBehind namespace: openshift-monitoring - receiver: make-it-warning match: alertname: KubeAPILatencyHigh severity: critical - receiver: pagerduty match: prometheus: openshift-monitoring/k8s match_re: exported_namespace: ^kube.*|^openshift.*|^redhat-.* - receiver: pagerduty match: exported_namespace: "" prometheus: openshift-monitoring/k8s match_re: namespace: ^kube.*|^openshift.*|^redhat-.* - receiver: pagerduty match: job: fluentd prometheus: openshift-monitoring/k8s - receiver: pagerduty match: alertname: FluentdNodeDown prometheus: openshift-monitoring/k8s - receiver: pagerduty match: cluster: elasticsearch prometheus: openshift-monitoring/k8s - receiver: watchdog match: alertname: Watchdog repeat_interval: 5m group_wait: 30s group_interval: 5m repeat_interval: 12h inhibit_rules: - source_match: severity: critical target_match_re: severity: warning|info equal: - namespace - alertname - source_match: severity: warning target_match_re: severity: info equal: - namespace - alertname - source_match: alertname: ClusterOperatorDegraded target_match_re: alertname: ClusterOperatorDown equal: - namespace - name - source_match: alertname: KubeNodeNotReady target_match_re: alertname: KubeNodeUnreachable equal: - node - instance - source_match: alertname: KubeNodeUnreachable target_match_re: alertname: SDNPodNotReady|TargetDown - source_match: alertname: KubeNodeNotReady target_match_re: alertname: KubeDaemonSetRolloutStuck|KubeDaemonSetMisScheduled|KubeDeploymentReplicasMismatch|KubeStatefulSetReplicasMismatch|KubePodNotReady equal: - instance - source_match: alertname: KubeDeploymentReplicasMismatch target_match_re: alertname: KubePodNotReady|KubePodCrashLooping equal: - namespace - source_match: alertname: ElasticsearchOperatorCSVNotSuccessful target_match_re: alertname: ElasticsearchClusterNotHealthy equal: - dummylabel receivers: - name: pagerduty pagerduty_configs: - send_resolved: true http_config: {} routing_key: url: https://events.pagerduty.com/v2/enqueue client: '{{ template "pagerduty.default.client" . }}' client_url: '{{ template "pagerduty.default.clientURL" . }}' description: '{{ .CommonLabels.alertname }} {{ .CommonLabels.severity | toUpper }} ({{ len .Alerts }})' details: cluster_id: ......-.....-.....-......-...................... component: '{{ .CommonLabels.alertname }}' console: https://console-openshift-console.apps.openshift-web.xxxxxxx.openshiftapps.com firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}' group: '{{ .CommonLabels.alertname }}' link: '{{ if .CommonAnnotations.link }}{{ .CommonAnnotations.link }}{{ else }}https://github.com/openshift/ops-sop/tree/master/v4/alerts/{{ .CommonLabels.alertname }}.md{{ end }}' link2: '{{ if .CommonAnnotations.runbook }}{{ .CommonAnnotations.runbook }}{{ else }}{{ end }}' num_firing: '{{ .Alerts.Firing | len }}' num_resolved: '{{ .Alerts.Resolved | len }}' resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}' severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower }}{{ else }}critical{{ end }}' - name: make-it-warning pagerduty_configs: - send_resolved: true http_config: {} routing_key: url: https://events.pagerduty.com/v2/enqueue client: '{{ template "pagerduty.default.client" . }}' client_url: '{{ template "pagerduty.default.clientURL" . }}' description: '{{ .CommonLabels.alertname }} {{ .CommonLabels.severity | toUpper }} ({{ len .Alerts }})' details: cluster_id: ......-.....-.....-......-...................... component: '{{ .CommonLabels.alertname }}' console: https://console-openshift-console.apps.openshift-web.xxxxx.openshiftapps.com firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}' group: '{{ .CommonLabels.alertname }}' link: '{{ if .CommonAnnotations.link }}{{ .CommonAnnotations.link }}{{ else }}https://github.com/openshift/ops-sop/tree/master/v4/alerts/{{ .CommonLabels.alertname }}.md{{ end }}' link2: '{{ if .CommonAnnotations.runbook }}{{ .CommonAnnotations.runbook }}{{ else }}{{ end }}' num_firing: '{{ .Alerts.Firing | len }}' num_resolved: '{{ .Alerts.Resolved | len }}' resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}' severity: warning - name: watchdog webhook_configs: - send_resolved: true http_config: {} url: https://nosnch.in/NNNNNNNNN - name: "null" templates: [] ```
RiRa12621 commented 3 years ago

/cc @jeremyeder

HumairAK commented 3 years ago

there are other burn-rate issues within this org, @hemajv can you link those

jeremyeder commented 3 years ago

Forking the runbooks repo feels weird? Not sure why you'd do that? The template within that repo is from upstream, so just take that maybe? The rest of the bullets makes sense to review and generalize.

RiRa12621 commented 3 years ago

From what I understood, the fork only serves the purpose to be able to submit PRs back

HumairAK commented 3 years ago

@jeremyeder yes, I believe the suggestion was to fork the repo into this org, so that we may continue to add runbooks that we find are useful and may also benefit upstream, and submit them back in the form of PRs. Also, to rebase and consume any updates from upstream when they occur.

Definitely open to other workflows.

tumido commented 3 years ago

FTR Prometheus operator upgrade PR is pending now: https://github.com/operator-framework/community-operators/pull/3599

HumairAK commented 3 years ago

Closing this as progress is being tracked in separate issues.