pyrra-dev / pyrra

Making SLOs with Prometheus manageable, accessible, and easy to use for everyone!
https://demo.pyrra.dev
Apache License 2.0
1.21k stars 108 forks source link

Pyrra causes instability of Prometheus #1149

Closed snikch closed 5 months ago

snikch commented 5 months ago

Hi there, and thank you for Pyrra!

I'm aware this is going to be a really vague issue report, but we've been plagued with Prometheus stability issues for the last month and have come to realise that Pyrra is causing this.

We see our Prometheus pod being killed by Kubernetes and logging it received a SIGTERM. There are no OOM messages nor any probe issues on the container. This happens about every 10-40 minutes.

You can see a graph here where we removed the entire Pyrra helm chart for a few days and then turned it back on today.

image

I'd like to be able to dig into why this might be, but I'm not really sure where to start. It took me several days of digging to even realise it was Pyrra at "fault". Perhaps you could point me in the right direction?

snikch commented 5 months ago

We do see some errors in the Pyrra pods.


2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.603386975Z stderr F 2024-04-17T21:44:32Z    ERROR   Reconciler error    {"controller": "servicelevelobjective", "controllerGroup": "pyrra.dev", "controllerKind": "ServiceLevelObjective", "ServiceLevelObjective": {"name":"inmusicprofile-authorised-devices","namespace":"monitoring"}, "namespace": "monitoring", "name": "inmusicprofile-authorised-devices", "reconcileID": "71068fe1-4d2e-4c25-a4c0-68569c4f60c3", "error": "failed to update prometheus rule: prometheusrules.monitoring.coreos.com \"inmusicprofile-authorised-devices\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update"} |  
-- | -- | -- | --
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.586588632Z stderr F level=info ts=2024-04-17T21:44:32.584682542Z caller=servicelevelobjective.go:89 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-authorised-devices msg="updating prometheus rule" namespace= name= |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.486439391Z stderr F level=info ts=2024-04-17T21:44:32.486105827Z caller=servicelevelobjective.go:78 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-authorised-devices msg="creating prometheus rule" namespace= name= |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.3001328Z stderr F level=info ts=2024-04-17T21:44:32.298281629Z caller=servicelevelobjective.go:89 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-device-auth-rest-api msg="updating prometheus rule" namespace=monitoring name=inmusicprofile-device-auth-rest-api |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.251899805Z stderr F level=info ts=2024-04-17T21:44:32.251727212Z caller=servicelevelobjective.go:89 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=monitoring/inmusicprofile-device-auth-rest-api msg="updating prometheus rule" namespace=monitoring name=inmusicprofile-device-auth-rest-api |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246897311Z stderr F     sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:227 |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246892669Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246887735Z stderr F     sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:266 |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.24688296Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246877871Z stderr F     sigs.k8s.io/controller-runtime@v0.16.1/pkg/internal/controller/controller.go:329 |  
  |   | 2024-04-18 09:44:32 | ip-10-0-35-81.ec2.internalpyrra-56f8db5b5-tpdcc | 2024-04-17T21:44:32.246870407Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler

The SLO is defined like so

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: inmusicprofile-authorised-devices
  namespace: monitoring
  labels:
    prometheus: k8s
    role: alert-rules
    pyrra.dev/team: webservices
    pyrra.dev/ns: inmusicprofile
    pyrra.dev/service: AuthorisedDevicesService
    pyrra.dev/tier: "4"
spec:
  target: "99"
  window: 4w
  description: AuthorisedDevicesService public endpoints.
  indicator:
    ratio:
      errors:
        metric: traces_spanmetrics_latency_count{span_name=~"inmusicapi\\.v1\\.AuthorisedDevicesService\\/.*", status_code="STATUS_CODE_ERROR"}
      total:
        metric: traces_spanmetrics_latency_count{span_name=~"inmusicapi\\.v1\\.AuthorisedDevicesService\\/.*"}