rancher / opni

Multi Cluster Observability with AIOps
https://opni.io
Apache License 2.0
337 stars 53 forks source link

Panic in Alertmanager API #1719

Open dbason opened 1 year ago

dbason commented 1 year ago
2023/09/19 02:46:43 http: panic serving 10.42.53.174:45502: write tcp 10.42.53.185:9093->10.42.53.174:45502: write: broken pipe
goroutine 60488 [running]:
net/http.(*conn).serve.func1()
    net/http/server.go:1868 +0xb9
panic({0x4584640?, 0xc0046e3e50?})
    runtime/panic.go:920 +0x270
[github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroupsOK).WriteResponse(0xc003486270](http://github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroupsOK).WriteResponse(0xc003486270), {0x63ecd00, 0xc0029c2000}, {0x63d0f20, 0x5c645b0})
    [github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups_responses.go:74](http://github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups_responses.go:74) +0xc5
[github.com/go-openapi/runtime/middleware.(*Context).Respond(0xc001957b30](http://github.com/go-openapi/runtime/middleware.(*Context).Respond(0xc001957b30), {0x63ecd00?, 0xc0029c2000}, 0xc002984e00, {0xc001f7fd70?, 0x1, 0x1}, 0xc002984d00, {0x4504760, 0xc003486270})
    [github.com/go-openapi/runtime@v0.25.0/middleware/context.go:523](http://github.com/go-openapi/runtime@v0.25.0/middleware/context.go:523) +0x6f6
[github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroups).ServeHTTP(0xc001914e88](http://github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroups).ServeHTTP(0xc001914e88), {0x63ecd00, 0xc0029c2000}, 0xc002984e00)
    [github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups.go:68](http://github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups.go:68) +0x286
[github.com/go-openapi/runtime/middleware.(*Context).RoutesHandler.NewOperationExecutor.func1({0x63ecd00](http://github.com/go-openapi/runtime/middleware.(*Context).RoutesHandler.NewOperationExecutor.func1(%7B0x63ecd00), 0xc0029c2000}, 0xc002984e00)
    [github.com/go-openapi/runtime@v0.25.0/middleware/operation.go:28](http://github.com/go-openapi/runtime@v0.25.0/middleware/operation.go:28) +0x53
net/http.HandlerFunc.ServeHTTP(0x15?, {0x63ecd00?, 0xc0029c2000?}, 0x4ececf?)
    net/http/server.go:2136 +0x29
[github.com/go-openapi/runtime/middleware.NewRouter.func1({0x63ecd00](http://github.com/go-openapi/runtime/middleware.NewRouter.func1(%7B0x63ecd00), 0xc0029c2000}, 0xc002984c00)
    [github.com/go-openapi/runtime@v0.25.0/middleware/router.go:78](http://github.com/go-openapi/runtime@v0.25.0/middleware/router.go:78) +0x257
net/http.HandlerFunc.ServeHTTP(0x7f87a5667f18?, {0x63ecd00?, 0xc0029c2000?}, 0x6786f53877ffefe4?)
    net/http/server.go:2136 +0x29
[github.com/go-openapi/runtime/middleware.Spec.func1({0x63ecd00](http://github.com/go-openapi/runtime/middleware.Spec.func1(%7B0x63ecd00), 0xc0029c2000}, 0xc001116240?)
    [github.com/go-openapi/runtime@v0.25.0/middleware/spec.go:46](http://github.com/go-openapi/runtime@v0.25.0/middleware/spec.go:46) +0x182
net/http.HandlerFunc.ServeHTTP(0xc003e3f938?, {0x63ecd00?, 0xc0029c2000?}, 0x4c3ae5f?)
    net/http/server.go:2136 +0x29
[github.com/prometheus/alertmanager/api/v2.NewAPI.setResponseHeaders.func2({0x63ecd00](http://github.com/prometheus/alertmanager/api/v2.NewAPI.setResponseHeaders.func2(%7B0x63ecd00), 0xc0029c2000}, 0xc00471f688?)
    [github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/api.go:147](http://github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/api.go:147) +0x11e
net/http.HandlerFunc.ServeHTTP(0xc0012b7680?, {0x63ecd00?, 0xc0029c2000?}, 0xc002984c00?)
    net/http/server.go:2136 +0x29
[github.com/rs/cors.(*Cors).Handler-fm.(*Cors).Handler.func1({0x63ecd00](http://github.com/rs/cors.(*Cors).Handler-fm.(*Cors).Handler.func1(%7B0x63ecd00), 0xc0029c2000}, 0xc002984c00)
    [github.com/rs/cors@v1.9.0/cors.go:236](http://github.com/rs/cors@v1.9.0/cors.go:236) +0x184
net/http.HandlerFunc.ServeHTTP(0xc000166940?, {0x63ecd00?, 0xc0029c2000?}, 0x0?)
    net/http/server.go:2136 +0x29
[github.com/prometheus/alertmanager/api.(*API).limitHandler.func1({0x63ecd00](http://github.com/prometheus/alertmanager/api.(*API).limitHandler.func1(%7B0x63ecd00)?, 0xc0029c2000?}, 0xc002984c00?)
    [github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/api.go:221](http://github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/api.go:221) +0x1d2
net/http.HandlerFunc.ServeHTTP(0x447300?, {0x63ecd00?, 0xc0029c2000?}, 0x70917a?)
    net/http/server.go:2136 +0x29
net/http.(*ServeMux).ServeHTTP(0x92266c0?, {0x63ecd00, 0xc0029c2000}, 0xc002984c00)
    net/http/server.go:2514 +0x142
net/http.serverHandler.ServeHTTP({0xc003e12810?}, {0x63ecd00?, 0xc0029c2000?}, 0x6?)
    net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc001da5680, {0x640a1c0, 0xc00086dc50})
    net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 401
    net/http/server.go:3086 +0x5cb
2023/09/19 02:46:43 http: panic serving 10.42.53.174:38706: write tcp 10.42.53.185:9093->10.42.53.174:38706: write: broken pipe
goroutine 60993 [running]:
net/http.(*conn).serve.func1()
    net/http/server.go:1868 +0xb9
panic({0x4584640?, 0xc002b177c0?})
    runtime/panic.go:920 +0x270
[github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroupsOK).WriteResponse(0xc00344dfe0](http://github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroupsOK).WriteResponse(0xc00344dfe0), {0x63ecd00, 0xc00452ac40}, {0x63d0f20, 0x5c645b0})
    [github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups_responses.go:74](http://github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups_responses.go:74) +0xc5
[github.com/go-openapi/runtime/middleware.(*Context).Respond(0xc001957b30](http://github.com/go-openapi/runtime/middleware.(*Context).Respond(0xc001957b30), {0x63ecd00?, 0xc00452ac40}, 0xc0046e1500, {0xc001f7fd70?, 0x1, 0x1}, 0xc0046e1400, {0x4504760, 0xc00344dfe0})
    [github.com/go-openapi/runtime@v0.25.0/middleware/context.go:523](http://github.com/go-openapi/runtime@v0.25.0/middleware/context.go:523) +0x6f6
[github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroups).ServeHTTP(0xc001914e88](http://github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroups).ServeHTTP(0xc001914e88), {0x63ecd00, 0xc00452ac40}, 0xc0046e1500)
    [github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups.go:68](http://github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups.go:68) +0x286
[github.com/go-openapi/runtime/middleware.(*Context).RoutesHandler.NewOperationExecutor.func1({0x63ecd00](http://github.com/go-openapi/runtime/middleware.(*Context).RoutesHandler.NewOperationExecutor.func1(%7B0x63ecd00), 0xc00452ac40}, 0xc0046e1500)
    [github.com/go-openapi/runtime@v0.25.0/middleware/operation.go:28](http://github.com/go-openapi/runtime@v0.25.0/middleware/operation.go:28) +0x53
net/http.HandlerFunc.ServeHTTP(0x15?, {0x63ecd00?, 0xc00452ac40?}, 0x4ececf?)
    net/http/server.go:2136 +0x29
[github.com/go-openapi/runtime/middleware.NewRouter.func1({0x63ecd00](http://github.com/go-openapi/runtime/middleware.NewRouter.func1(%7B0x63ecd00), 0xc00452ac40}, 0xc0046e1300)
    [github.com/go-openapi/runtime@v0.25.0/middleware/router.go:78](http://github.com/go-openapi/runtime@v0.25.0/middleware/router.go:78) +0x257
net/http.HandlerFunc.ServeHTTP(0x7f87a5669688?, {0x63ecd00?, 0xc00452ac40?}, 0xb6e604d5b17578e?)
    net/http/server.go:2136 +0x29
[github.com/go-openapi/runtime/middleware.Spec.func1({0x63ecd00](http://github.com/go-openapi/runtime/middleware.Spec.func1(%7B0x63ecd00), 0xc00452ac40}, 0xc001116240?)
    [github.com/go-openapi/runtime@v0.25.0/middleware/spec.go:46](http://github.com/go-openapi/runtime@v0.25.0/middleware/spec.go:46) +0x182
net/http.HandlerFunc.ServeHTTP(0xc002891938?, {0x63ecd00?, 0xc00452ac40?}, 0x4c3ae5f?)
    net/http/server.go:2136 +0x29
[github.com/prometheus/alertmanager/api/v2.NewAPI.setResponseHeaders.func2({0x63ecd00](http://github.com/prometheus/alertmanager/api/v2.NewAPI.setResponseHeaders.func2(%7B0x63ecd00), 0xc00452ac40}, 0xc004519108?)
    [github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/api.go:147](http://github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/api.go:147) +0x11e
net/http.HandlerFunc.ServeHTTP(0xc0012b7680?, {0x63ecd00?, 0xc00452ac40?}, 0xc0046e1300?)
    net/http/server.go:2136 +0x29
[github.com/rs/cors.(*Cors).Handler-fm.(*Cors).Handler.func1({0x63ecd00](http://github.com/rs/cors.(*Cors).Handler-fm.(*Cors).Handler.func1(%7B0x63ecd00), 0xc00452ac40}, 0xc0046e1300)
    [github.com/rs/cors@v1.9.0/cors.go:236](http://github.com/rs/cors@v1.9.0/cors.go:236) +0x184
net/http.HandlerFunc.ServeHTTP(0xc000166940?, {0x63ecd00?, 0xc00452ac40?}, 0x0?)
    net/http/server.go:2136 +0x29
[github.com/prometheus/alertmanager/api.(*API).limitHandler.func1({0x63ecd00](http://github.com/prometheus/alertmanager/api.(*API).limitHandler.func1(%7B0x63ecd00)?, 0xc00452ac40?}, 0xc0046e1300?)
    [github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/api.go:221](http://github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/api.go:221) +0x1d2
net/http.HandlerFunc.ServeHTTP(0x447300?, {0x63ecd00?, 0xc00452ac40?}, 0x70917a?)
    net/http/server.go:2136 +0x29
net/http.(*ServeMux).ServeHTTP(0x92266c0?, {0x63ecd00, 0xc00452ac40}, 0xc0046e1300)
    net/http/server.go:2514 +0x142
net/http.serverHandler.ServeHTTP({0xc003b11ad0?}, {0x63ecd00?, 0xc00452ac40?}, 0x6?)
    net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc0018dcab0, {0x640a1c0, 0xc00086dc50})
    net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 401
    net/http/server.go:3086 +0x5cb
{"caller":"coordinator.go:113","component":"configuration","file":"/var/lib/alertmanager.yaml","level":"info","msg":"Loading configuration file","ts":"2023-09-19T02:47:43.292Z"}
{"caller":"coordinator.go:126","component":"configuration","file":"/var/lib/alertmanager.yaml","level":"info","msg":"Completed loading of configuration file","ts":"2023-09-19T02:47:43.293Z"}
ron1 commented 1 year ago

A panic like the one above occurs for me after these steps:

  1. Add Webhook Endpoint alertmanager-webhook-logger (https://github.com/tomtom-international/alertmanager-webhook-logger) Deployment (see below)
  2. Add a Monitoring Backend Alarm with a reference to the above Endpoint
  3. Wait for the Alarm to transition from Pending to Ok
  4. Panic occurs shortly thereafter

The above panic seems to result in all Alerting state being lost, including Endpoints, Alarms, and even remembrance of the Alerting installation itself.

Should restarting opni-manager, opni-gateway, and opni-alertmanager-alerting be sufficient to resume use of Alerting? When I attempt to re-install after initiating the restarts, I receive the following error:

connection error: desc = "transport: error while dialing: dial unix /tmp/plugin655497085: connect: connection refused"

How do I recover from this state without completely re-installing opni?

alertmanager-webhook-logger deployment manifests:

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    component: alertmanager-webhook-logger
  name: alertmanager-webhook-logger
  namespace: opni
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    component: alertmanager-webhook-logger
  name: system:alertmanager-webhook-logger
  namespace: opni
rules:
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
  - update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    component: alertmanager-webhook-logger
  name: system:alertmanager-webhook-logger
  namespace: opni
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: system:alertmanager-webhook-logger
subjects:
- kind: ServiceAccount
  name: alertmanager-webhook-logger
  namespace: opni
- kind: User
  name: alertmanager-webhook-logger
---
apiVersion: v1
kind: Service
metadata:
  labels:
    component: alertmanager-webhook-logger
  name: alertmanager-webhook-logger
  namespace: opni
spec:
  ports:
  - port: 6725
    protocol: TCP
    targetPort: 6725
  selector:
    component: alertmanager-webhook-logger
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: alertmanager-webhook-logger
  name: alertmanager-webhook-logger
  namespace: opni
spec:
  replicas: 1
  selector:
    matchLabels:
      name: alertmanager-webhook-logger
  strategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        component: alertmanager-webhook-logger
        name: alertmanager-webhook-logger
    spec:
      containers:
      - image: ghcr.io/tomtom-international/alertmanager-webhook-logger:1.0
        name: alertmanager-webhook-logger
        ports:
        - containerPort: 6725
          name: http
          protocol: TCP
        resources:
          requests:
            cpu: 200m
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
          runAsGroup: 65534
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
      serviceAccountName: alertmanager-webhook-logger
alexandreLamarre commented 1 year ago

Thanks, I'll take a look. This seems like a fatal panic in the openapi schema of the alertmanager API, which should never happen.

In terms of recovering, I'm hoping to get back to you soon when I have a reproduction.

Note we also have the functionality of the alertmanager-webhook-logger implemented in opni, under normal circumstances you would be able to see the received alerts in the alerting overview in the timeline.

alexandreLamarre commented 1 year ago

Should restarting opni-manager, opni-gateway, and opni-alertmanager-alerting be sufficient to resume use of Alerting?

Yes indeed restarting the opni-gateway should be always sufficient to resume the use of Alerting, so this seems like a critical bug

ron1 commented 1 year ago

Hi @alexandreLamarre. Thanks for getting back to me.

Our intention is to use alertmanager-webhook-logger to forward alerts via syslog to an external system that globally manages alerts for us and others. Does opni include a simpler solution for satisfying this requirement?

alexandreLamarre commented 1 year ago

Ah I see, not at the moment but we would be open to a feature request to satisfy requirements of that nature, if you feel that would make things easier

ron1 commented 1 year ago

I think it is reasonable for opni users to deploy their own custom webhook receivers like alertmanager-webhook-logger in our case. However, it would be nice if opni could help manage the TLS certs required to securely connect alertmanager to custom webhook receivers. Is this something that opni could simplify via some cert-manager/alertmanager automation? It would be great if a custom webhook receiver Deployment could simply mount an opni-generated Secret to secure the connection from alertmanager. WDYT?

alexandreLamarre commented 1 year ago

@ron1 100%, this is something I've been wanting to get around to

ron1 commented 1 year ago

I'm happy to open such a feature request if you agree.

Also, once you have a proposed fix for this issue, feel free to push a patched image to the registry for me to verify.

ron1 commented 1 year ago

Feature Request opened.

alexandreLamarre commented 1 year ago

@ron1 Would you mind providing some additional info here? While taking care to redact any secrets, if applicable.

In a fresh install, I managed to successfully setup the alertmanager-webhook-logger using a webhook endpoint.

alexandreLamarre commented 1 year ago

If you restart the gateway pod, you should be able to tail the logs for the panic caused in the gateway code which would also be helpful because I'm wondering if that crash is also related to the open api spec

alexandreLamarre commented 1 year ago

Open api stubs are delegating handling errors when writing payloads to middleware that should recover from panics.

    if err := producer.Produce(rw, payload); err != nil {
        panic(err) // let the recovery middleware deal with this
    }

For the time being, to remediate the issue (without necessarily addressing the root cause), I'm going to perform the Alertmanager upgrade v0.26.0 but with a forked alertmanager where I can inject some recovery and debug middle ware into the openapi APIs

alexandreLamarre commented 1 year ago

I'm taking a shot in the dark and hoping https://github.com/rancher/opni/pull/1726 may solve this issue. It will be hard to know without having the gateway panic logs. Also I think the gateway pod was never restarted after the panic, so I can't tell yet if this issue is a one-off panic or a more persistent problem

alexandreLamarre commented 1 year ago
2023-09-22T01:11:44Z [34mINFO[0m [36mplugin.alerting[39m [2mv1/streams.go:577[22m evaluation context is exiting, exiting evaluation loop {"component": "alarms", "onCortexClusterStatusCreate": "ZSvms69v8wUwGWPLTxZYgH"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x12be1f1]

goroutine 7577 [running]:
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1.reduceCortexAdminStates({0xc0010fc800, 0x4, 0xc001fd3078?}, 0xc001b25a40)
    github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1/streams.go:279 +0x251
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1.(*AlarmServerComponent).onCortexClusterStatusCreate.func1(0xc0004a4000?)
    github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1/streams.go:409 +0x45
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1.(*InternalConditionEvaluator[...]).SubscriberLoop(0x2b68460)
    github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1/streams.go:554 +0x486
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1.(*AlarmServerComponent).onCortexClusterStatusCreate.func4()
    github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1/streams.go:436 +0x38
created by github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1.(*AlarmServerComponent).onCortexClusterStatusCreate in goroutine 5799
    github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1/streams.go:434 +0x913

is the offending panic in the gateway

alexandreLamarre commented 1 year ago

The original symptom/error :

alertmanager code 023/09/19 02:46:43 http: panic serving 10.42.53.174:45502: write tcp 10.42.53.185:9093->10.42.53.174:45502: write: broken pipe
goroutine 60488 [running]:
net/http.(*conn).serve.func1()
    net/http/server.go:1868 +0xb9
panic({0x4584640?, 0xc0046e3e50?})
    runtime/panic.go:920 +0x270
[github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroupsOK).WriteResponse(0xc003486270](http://github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroupsOK).WriteResponse(0xc003486270), {0x63ecd00, 0xc0029c2000}, {0x63d0f20, 0x5c645b0})
    [github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups_responses.go:74](http://github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups_responses.go:74) +0xc5
[github.com/go-openapi/runtime/middleware.(*Context).Respond(0xc001957b30](http://github.com/go-openapi/runtime/middleware.(*Context).Respond(0xc001957b30), {0x63ecd00?, 0xc0029c2000}, 0xc002984e00, {0xc001f7fd70?, 0x1, 0x1}, 0xc002984d00, {0x4504760, 0xc003486270})
    [github.com/go-openapi/runtime@v0.25.0/middleware/context.go:523](http://github.com/go-openapi/runtime@v0.25.0/middleware/context.go:523) +0x6f6

is still present in the alertmanager v0.26.0 / cortex status fix