stormshift / support

This repo should serve as a central source for reporting issues with stormshift
GNU General Public License v3.0
3 stars 0 forks source link

TargetDown alert #190

Open rbo opened 1 month ago

rbo commented 1 month ago

100% of the alertmanager-metrics/alertmanager-metrics targets in open-cluster-management-observability namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.

/cc @DanielFroehlich

rbo commented 1 month ago
$ oc describe -n open-cluster-management-observability  svc alertmanager-metrics
Name:              alertmanager-metrics
Namespace:         open-cluster-management-observability
Labels:            app=multicluster-observability-alertmanager-metrics
Annotations:       service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1700045229
                   service.beta.openshift.io/serving-cert-secret-name: alertmanager-tls-metrics
                   service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1700045229
Selector:          alertmanager=observability,app=multicluster-observability-alertmanager
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                172.30.136.253
IPs:               172.30.136.253
Port:              metrics  9096/TCP
TargetPort:        metrics/TCP
Endpoints:         10.128.11.126:9096,10.130.12.5:9096,10.131.15.7:9096
Session Affinity:  None
Events:            <none>
$ oc get pods -l alertmanager=observability,app=multicluster-observability-alertmanager
NAME                           READY   STATUS    RESTARTS   AGE
observability-alertmanager-0   4/4     Running   0          16d
observability-alertmanager-1   4/4     Running   0          16d
observability-alertmanager-2   4/4     Running   0          16d
$ oc logs observability-alertmanager-0
Defaulted container "alertmanager" out of: alertmanager, config-reloader, alertmanager-proxy, kube-rbac-proxy
ts=2024-06-26T08:34:51.585Z caller=main.go:240 level=info msg="Starting Alertmanager" version="(version=0.25.0, branch=non-git, revision=non-git)"
ts=2024-06-26T08:34:51.585Z caller=main.go:241 level=info build_context="(go=go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime, platform=linux/amd64, user=root@cd720cdc1cd3, date=20240502-09:12:20, tags=netgo)"
ts=2024-06-26T08:34:51.630Z caller=cluster.go:261 level=warn component=cluster msg="failed to join cluster" err="3 errors occurred:\n\t* Failed to resolve observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:34:51.630Z caller=cluster.go:263 level=info component=cluster msg="will retry joining cluster every 10s"
ts=2024-06-26T08:34:51.630Z caller=main.go:338 level=warn msg="unable to join gossip mesh" err="3 errors occurred:\n\t* Failed to resolve observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:34:51.630Z caller=cluster.go:681 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s
ts=2024-06-26T08:34:51.678Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
ts=2024-06-26T08:34:51.678Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml
ts=2024-06-26T08:34:51.682Z caller=tls_config.go:274 level=info msg="Listening on" address=127.0.0.1:9093
ts=2024-06-26T08:34:51.682Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=127.0.0.1:9093
ts=2024-06-26T08:34:53.631Z caller=cluster.go:706 level=info component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000810406s
ts=2024-06-26T08:35:01.634Z caller=cluster.go:698 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.003464146s
ts=2024-06-26T08:35:06.648Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:06.651Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:06.654Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:21.659Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:21.663Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:21.667Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:36.648Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:36.651Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:51.653Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:51.661Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:36:06.649Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:36:21.649Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
rbo commented 1 month ago
$ oc get svc -n open-cluster-management-observability alertmanager-operated
NAME                    TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
alertmanager-operated   ClusterIP   None         <none>        9094/TCP,9094/UDP   16d
$ oc describe svc -n open-cluster-management-observability alertmanager-operated
Name:              alertmanager-operated
Namespace:         open-cluster-management-observability
Labels:            <none>
Annotations:       <none>
Selector:          alertmanager=observability,app=multicluster-observability-alertmanager
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                None
IPs:               None
Port:              tcp-mesh  9094/TCP
TargetPort:        9094/TCP
Endpoints:         10.128.11.126:9094,10.130.12.5:9094,10.131.15.7:9094
Port:              udp-mesh  9094/UDP
TargetPort:        9094/UDP
Endpoints:         10.128.11.126:9094,10.130.12.5:9094,10.131.15.7:9094
Session Affinity:  None
Events:            <none>
$ oc get pods -l alertmanager=observability,app=multicluster-observability-alertmanager
NAME                           READY   STATUS    RESTARTS   AGE
observability-alertmanager-0   4/4     Running   0          16d
observability-alertmanager-1   4/4     Running   0          16d
observability-alertmanager-2   4/4     Running   0          16d
$ 
rbo commented 1 month ago

DNS Looks good:

$ oc rsh observability-alertmanager-0
Defaulted container "alertmanager" out of: alertmanager, config-reloader, alertmanager-proxy, kube-rbac-proxy
sh-5.1$ getent hosts observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc
10.128.11.126   observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc.cluster.local
sh-5.1$ getent hosts observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc
10.131.15.7     observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc.cluster.local
sh-5.1$ getent hosts observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc
10.130.12.5     observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc.cluster.local
sh-5.1$ getent hosts alertmanager-operated.open-cluster-management-observability.svc
10.130.12.5     alertmanager-operated.open-cluster-management-observability.svc.cluster.local
10.131.15.7     alertmanager-operated.open-cluster-management-observability.svc.cluster.local
10.128.11.126   alertmanager-operated.open-cluster-management-observability.svc.cluster.local
sh-5.1$ 
DanielFroehlich commented 3 weeks ago

Restarting of the pod (oc delete pod observability-alertmanager-0) does not help. Restarting the DNS pod on the node also does not help. Restarting all the pods in ns observability-alertmanager-0 also does not help Delete the servuce (oc delete service alertmanager-operated) and let it re-create also does not help.

I get the feeling this is a bug. Looking at the service, it does not get a cluster IP assigned to it (compare to e.g. alertmanager-metrics, which targets the same pods). WDYT?