pyrra-dev / pyrra

Making SLOs with Prometheus manageable, accessible, and easy to use for everyone!
https://demo.pyrra.dev
Apache License 2.0
1.25k stars 113 forks source link

Issue when showing some SLOs (white screen) #1130

Open juanjcsr opened 7 months ago

juanjcsr commented 7 months ago

Hello Everyone

I've found a possible bug or issue when displaying some SLOs. When I navigate from the SLO list to the details page or when I check the details of a multiburn alert, I get a white screen:

image

image

The javascript error is the following:

TypeError: Cannot read properties of undefined (reading 'push')
    at aligneddata.tsx:121:19
    at Array.forEach (<anonymous>)
    at aligneddata.tsx:120:18
    at Array.forEach (<anonymous>)
    at aligneddata.tsx:117:15
    at Pd (BurnrateGraph.tsx:115:7)
    at Ea (react-dom.production.min.js:167:137)
    at Ou (react-dom.production.min.js:197:258)
    at Sl (react-dom.production.min.js:292:88)
    at bs (react-dom.production.min.js:280:389)

The SLO that generates the error is the following:

spec:
  target: "99"
  window: 2w
  description: tarantool latency
  indicator:
    latency:
      success:
        metric: http_server_request_latency_bucket{job=~"tarantool-tnt-cluster",status="200", le="0.05", path="/accommodation/pointOfSale/:pointOfSale/seoUrl/*seoUrl"}
      total:
        metric: http_server_request_latency_count{job=~"tarantool-tnt-cluster",status="200",path="/accommodation/pointOfSale/:pointOfSale/seoUrl/*seoUrl"}

I locally launched the UI (master branch) against my deployed pyrra server and encountered the same issue.

I'm running Pyrra v0.7.4

Do you have any idea why is this happening or how can I help to debug this issue?

Thank you, I really enjoy the work everyone is doing with Pyrra 😄

msvechla commented 4 months ago

I'm experiencing the exact same issue with a latency slo

EDIT: Never mind, in my case I used an incorrect metric (also used a _bucket metric for the total query)

EDIT2: Actually this is still an issue, I just experienced it again for another metric

msvechla commented 4 months ago

I'm retrieving the exact same error in the browser console:

TypeError: Cannot read properties of undefined (reading 'push')
    at aligneddata.tsx:121:19
    at Array.forEach (<anonymous>)
    at aligneddata.tsx:120:18
    at Array.forEach (<anonymous>)
    at aligneddata.tsx:117:15
    at Pd (BurnrateGraph.tsx:115:7)
    at Ea (react-dom.production.min.js:167:137)
    at Ou (react-dom.production.min.js:197:258)
    at Sl (react-dom.production.min.js:292:88)
    at bs (react-dom.production.min.js:280:389)
aligneddata.tsx:121 Uncaught 
TypeError: Cannot read properties of undefined (reading 'push')
    at aligneddata.tsx:121:19
    at Array.forEach (<anonymous>)
    at aligneddata.tsx:120:18
    at Array.forEach (<anonymous>)
    at aligneddata.tsx:117:15
    at Pd (BurnrateGraph.tsx:115:7)
    at Ea (react-dom.production.min.js:167:137)
    at Ou (react-dom.production.min.js:197:258)
    at Sl (react-dom.production.min.js:292:88)
    at bs (react-dom.production.min.js:280:389)

The SLO I used:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  labels:
    pyrra.dev/app: tempo
  name: tempo-reads-errors-test
  namespace: default
spec:
  alerting:
    absent: true
    burnrates: true
  description: Reading traces from Tempo API endpoints should answer queries 99% successfully
    over 2w.
  indicator:
    ratio:
      errors:
        metric: tempo_request_duration_seconds_count{cluster=~"tempo", job=~"default/query-frontend",
          route=~"api_.*", status_code=~"5.*"}
      total:
        metric: tempo_request_duration_seconds_count{cluster=~"tempo", job=~"default/query-frontend",
          route=~"api_.*"}
  target: "99.5"
  window: 2w
msvechla commented 4 months ago

@metalmatze I added a debug log and it looks like in my case timeValues is larger than the pre-initialized values array here: https://github.com/pyrra-dev/pyrra/blob/a5e3b4606daf843156111f791ff669d864163e7a/ui/src/components/graphs/aligneddata.tsx#L121

timeValues: 3 values: 1

Any idea how to fix this? The SLO is displayed correctly in Grafana

EDIT: Now about 45 minutes later I no longer get the error, I did not change anything. So it looks like an intermittent error that we should somehow catch

keithherron commented 2 months ago

Hi, looks like I ran into this as well. In my troubleshooting it seemed related to an empty response for a burnrate metric query.

In my case pyrra was querying for

istio_request_duration_milliseconds:burnrate12d{destination_canonical_service=\"enwiki-articlequality-predictor-default\", kubernetes_namespace=\"istio-system\",response_code=~\"2..\",site=\"codfw\",slo=\"liftwing-articlequality-latency\"}

However this gave an empty response as the istio_request_duration_milliseconds:burnrate12d recording rule metric didn't have response_code label

For the time being I'm working around it by inverting the slo definition like response_code!~"[345].." instead of response_code=~"2.." Although the response_code label not making it through to the burnrate recording rule metric may be a deeper issue. At any rate, after updating the query the page renders again for me.