Closed snikch closed 7 months ago
We do see some errors in the Pyrra pods.
2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.603386975Z stderr F 2024-04-17T21:44:32Z ERROR Reconciler error {"controller": "servicelevelobjective", "controllerGroup": "pyrra.dev", "controllerKind": "ServiceLevelObjective", "ServiceLevelObjective": {"name":"inmusicprofile-authorised-devices","namespace":"monitoring"}, "namespace": "monitoring", "name": "inmusicprofile-authorised-devices", "reconcileID": "71068fe1-4d2e-4c25-a4c0-68569c4f60c3", "error": "failed to update prometheus rule: prometheusrules.monitoring.coreos.com \"inmusicprofile-authorised-devices\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update"} |
-- | -- | -- | --
| | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.586588632Z stderr F level=info ts=2024-04-17T21:44:32.584682542Z caller=servicelevelobjective.go:89 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-authorised-devices msg="updating prometheus rule" namespace= name= |
| | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.486439391Z stderr F level=info ts=2024-04-17T21:44:32.486105827Z caller=servicelevelobjective.go:78 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-authorised-devices msg="creating prometheus rule" namespace= name= |
| | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.3001328Z stderr F level=info ts=2024-04-17T21:44:32.298281629Z caller=servicelevelobjective.go:89 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-device-auth-rest-api msg="updating prometheus rule" namespace=monitoring name=inmusicprofile-device-auth-rest-api |
| | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.251899805Z stderr F level=info ts=2024-04-17T21:44:32.251727212Z caller=servicelevelobjective.go:89 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-device-auth-rest-api msg="updating prometheus rule" namespace=monitoring name=inmusicprofile-device-auth-rest-api |
| | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246897311Z stderr F sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:227 |
| | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246892669Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 |
| | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246887735Z stderr F sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:266 |
| | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.24688296Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem |
| | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246877871Z stderr F sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:329 |
| | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246870407Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
The SLO is defined like so
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: inmusicprofile-authorised-devices
namespace: monitoring
labels:
prometheus: k8s
role: alert-rules
pyrra.dev/team: webservices
pyrra.dev/ns: inmusicprofile
pyrra.dev/service: AuthorisedDevicesService
pyrra.dev/tier: "4"
spec:
target: "99"
window: 4w
description: AuthorisedDevicesService public endpoints.
indicator:
ratio:
errors:
metric: traces_spanmetrics_latency_count{span_name=~"inmusicapi\\.v1\\.AuthorisedDevicesService\\/.*", status_code="STATUS_CODE_ERROR"}
total:
metric: traces_spanmetrics_latency_count{span_name=~"inmusicapi\\.v1\\.AuthorisedDevicesService\\/.*"}
Hi there, and thank you for Pyrra!
I'm aware this is going to be a really vague issue report, but we've been plagued with Prometheus stability issues for the last month and have come to realise that Pyrra is causing this.
We see our Prometheus pod being killed by Kubernetes and logging it received a SIGTERM. There are no OOM messages nor any probe issues on the container. This happens about every 10-40 minutes.
You can see a graph here where we removed the entire Pyrra helm chart for a few days and then turned it back on today.
I'd like to be able to dig into why this might be, but I'm not really sure where to start. It took me several days of digging to even realise it was Pyrra at "fault". Perhaps you could point me in the right direction?