Open dbason opened 1 year ago
A panic like the one above occurs for me after these steps:
The above panic seems to result in all Alerting state being lost, including Endpoints, Alarms, and even remembrance of the Alerting installation itself.
Should restarting opni-manager, opni-gateway, and opni-alertmanager-alerting be sufficient to resume use of Alerting? When I attempt to re-install after initiating the restarts, I receive the following error:
connection error: desc = "transport: error while dialing: dial unix /tmp/plugin655497085: connect: connection refused"
How do I recover from this state without completely re-installing opni?
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
component: alertmanager-webhook-logger
name: alertmanager-webhook-logger
namespace: opni
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
labels:
component: alertmanager-webhook-logger
name: system:alertmanager-webhook-logger
namespace: opni
rules:
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
component: alertmanager-webhook-logger
name: system:alertmanager-webhook-logger
namespace: opni
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: system:alertmanager-webhook-logger
subjects:
- kind: ServiceAccount
name: alertmanager-webhook-logger
namespace: opni
- kind: User
name: alertmanager-webhook-logger
---
apiVersion: v1
kind: Service
metadata:
labels:
component: alertmanager-webhook-logger
name: alertmanager-webhook-logger
namespace: opni
spec:
ports:
- port: 6725
protocol: TCP
targetPort: 6725
selector:
component: alertmanager-webhook-logger
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
component: alertmanager-webhook-logger
name: alertmanager-webhook-logger
namespace: opni
spec:
replicas: 1
selector:
matchLabels:
name: alertmanager-webhook-logger
strategy:
type: RollingUpdate
template:
metadata:
labels:
component: alertmanager-webhook-logger
name: alertmanager-webhook-logger
spec:
containers:
- image: ghcr.io/tomtom-international/alertmanager-webhook-logger:1.0
name: alertmanager-webhook-logger
ports:
- containerPort: 6725
name: http
protocol: TCP
resources:
requests:
cpu: 200m
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
serviceAccountName: alertmanager-webhook-logger
Thanks, I'll take a look. This seems like a fatal panic in the openapi schema of the alertmanager API, which should never happen.
In terms of recovering, I'm hoping to get back to you soon when I have a reproduction.
Note we also have the functionality of the alertmanager-webhook-logger
implemented in opni, under normal circumstances you would be able to see the received alerts in the alerting overview in the timeline.
Should restarting opni-manager, opni-gateway, and opni-alertmanager-alerting be sufficient to resume use of Alerting?
Yes indeed restarting the opni-gateway
should be always sufficient to resume the use of Alerting, so this seems like a critical bug
Hi @alexandreLamarre. Thanks for getting back to me.
Our intention is to use alertmanager-webhook-logger
to forward alerts via syslog to an external system that globally manages alerts for us and others. Does opni include a simpler solution for satisfying this requirement?
Ah I see, not at the moment but we would be open to a feature request to satisfy requirements of that nature, if you feel that would make things easier
I think it is reasonable for opni users to deploy their own custom webhook receivers like alertmanager-webhook-logger
in our case. However, it would be nice if opni could help manage the TLS certs required to securely connect alertmanager to custom webhook receivers. Is this something that opni could simplify via some cert-manager/alertmanager automation? It would be great if a custom webhook receiver Deployment could simply mount an opni-generated Secret to secure the connection from alertmanager. WDYT?
@ron1 100%, this is something I've been wanting to get around to
I'm happy to open such a feature request if you agree.
Also, once you have a proposed fix for this issue, feel free to push a patched image to the registry for me to verify.
Feature Request opened.
@ron1 Would you mind providing some additional info here? While taking care to redact any secrets, if applicable.
Your alerting cluster CRD :
k describe alertingclusters.core.opni.io opni-alerting -n <opni-install-namespace>
The webhook configuration you remember configuring.
In a fresh install, I managed to successfully setup the alertmanager-webhook-logger using a webhook endpoint.
If you restart the gateway pod, you should be able to tail the logs for the panic
caused in the gateway code which would also be helpful because I'm wondering if that crash is also related to the open api spec
Open api stubs are delegating handling errors when writing payloads to middleware that should recover from panics.
if err := producer.Produce(rw, payload); err != nil {
panic(err) // let the recovery middleware deal with this
}
For the time being, to remediate the issue (without necessarily addressing the root cause), I'm going to perform the Alertmanager upgrade v0.26.0 but with a forked alertmanager where I can inject some recovery and debug middle ware into the openapi APIs
I'm taking a shot in the dark and hoping https://github.com/rancher/opni/pull/1726 may solve this issue. It will be hard to know without having the gateway panic logs. Also I think the gateway pod was never restarted after the panic, so I can't tell yet if this issue is a one-off panic or a more persistent problem
2023-09-22T01:11:44Z [34mINFO[0m [36mplugin.alerting[39m [2mv1/streams.go:577[22m evaluation context is exiting, exiting evaluation loop {"component": "alarms", "onCortexClusterStatusCreate": "ZSvms69v8wUwGWPLTxZYgH"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x12be1f1]
goroutine 7577 [running]:
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1.reduceCortexAdminStates({0xc0010fc800, 0x4, 0xc001fd3078?}, 0xc001b25a40)
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1/streams.go:279 +0x251
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1.(*AlarmServerComponent).onCortexClusterStatusCreate.func1(0xc0004a4000?)
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1/streams.go:409 +0x45
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1.(*InternalConditionEvaluator[...]).SubscriberLoop(0x2b68460)
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1/streams.go:554 +0x486
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1.(*AlarmServerComponent).onCortexClusterStatusCreate.func4()
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1/streams.go:436 +0x38
created by github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1.(*AlarmServerComponent).onCortexClusterStatusCreate in goroutine 5799
github.com/rancher/opni/plugins/alerting/pkg/alerting/alarms/v1/streams.go:434 +0x913
is the offending panic in the gateway
The original symptom/error :
alertmanager code 023/09/19 02:46:43 http: panic serving 10.42.53.174:45502: write tcp 10.42.53.185:9093->10.42.53.174:45502: write: broken pipe
goroutine 60488 [running]:
net/http.(*conn).serve.func1()
net/http/server.go:1868 +0xb9
panic({0x4584640?, 0xc0046e3e50?})
runtime/panic.go:920 +0x270
[github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroupsOK).WriteResponse(0xc003486270](http://github.com/prometheus/alertmanager/api/v2/restapi/operations/alertgroup.(*GetAlertGroupsOK).WriteResponse(0xc003486270), {0x63ecd00, 0xc0029c2000}, {0x63d0f20, 0x5c645b0})
[github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups_responses.go:74](http://github.com/prometheus/alertmanager@v0.25.1-0.20230505130626-263ca5c9438e/api/v2/restapi/operations/alertgroup/get_alert_groups_responses.go:74) +0xc5
[github.com/go-openapi/runtime/middleware.(*Context).Respond(0xc001957b30](http://github.com/go-openapi/runtime/middleware.(*Context).Respond(0xc001957b30), {0x63ecd00?, 0xc0029c2000}, 0xc002984e00, {0xc001f7fd70?, 0x1, 0x1}, 0xc002984d00, {0x4504760, 0xc003486270})
[github.com/go-openapi/runtime@v0.25.0/middleware/context.go:523](http://github.com/go-openapi/runtime@v0.25.0/middleware/context.go:523) +0x6f6
is still present in the alertmanager v0.26.0 / cortex status fix