pyrra-dev / pyrra

Making SLOs with Prometheus manageable, accessible, and easy to use for everyone!
https://demo.pyrra.dev
Apache License 2.0
1.21k stars 108 forks source link

upgrade from v0.4.4 to v0.5.0 problem #515

Open KSPlatform opened 1 year ago

KSPlatform commented 1 year ago

Hi, Just used the new image as seems the only change for OpenShift is the image version in the yaml files. I see the new latency column. But availability and error budget columns are all "Error". I see client error 403 in api logs:

API logs

level=debug ts=2022-11-09T07:16:54.388679254Z caller=main.go:337 msg="running instant query" query="sum(coredns_dns_request_duration_seconds:increase2w{job=\"dns-default\",slo=\"coredns-request-latency\"})" ts=2022-11-09T07:16:54.388619884Z
level=warn ts=2022-11-09T07:16:54.38940379Z caller=main.go:509 msg="failed to query total" query="sum(apiserver_request:increase2w{job=\"apiserver\",slo=\"apiserver-write-response-errors\",verb=~\"POST|PUT|PATCH|DELETE\"})" err="prometheus query: client_error: client error: 403"
level=debug ts=2022-11-09T07:16:54.391933777Z caller=main.go:337 msg="running instant query" query="sum(etcd_disk_backend_commit_duration_seconds:increase2w{slo=\"etcd-disk-backend-commit-duration\"})" ts=2022-11-09T07:16:54.391872536Z
level=warn ts=2022-11-09T07:16:54.39443402Z caller=main.go:509 msg="failed to query total" query="sum(apiserver_request:increase2w{job=\"apiserver\",slo=\"apiserver-read-response-errors\",verb=~\"LIST|GET\"})" err="prometheus query: client_error: client error: 403"
level=debug ts=2022-11-09T07:16:54.485613481Z caller=main.go:337 msg="running instant query" query="sum(etcd_disk_wal_fsync_duration_seconds:increase2w{slo=\"etcd-disk-wal-fsync-duration\"})" ts=2022-11-09T07:16:54.48553364Z
level=debug ts=2022-11-09T07:16:54.385787246Z caller=main.go:337 msg="running instant query" query="sum(apiserver_request_duration_seconds:increase2w{job=\"apiserver\",scope=~\"resource|\",slo=\"apiserver-read-resource-request-latency\",verb=~\"LIST|GET\"})" ts=2022-11-09T07:16:54.297815847Z
level=warn ts=2022-11-09T07:16:54.487994127Z caller=main.go:509 msg="failed to query total" query="sum(coredns_dns_request_duration_seconds:increase2w{job=\"dns-default\",slo=\"coredns-request-latency\"})" err="prometheus query: client_error: client error: 403"
level=warn ts=2022-11-09T07:16:54.58605344Z caller=main.go:509 msg="failed to query total" query="sum(etcd_disk_wal_fsync_duration_seconds:increase2w{slo=\"etcd-disk-wal-fsync-duration\"})" err="prometheus query: client_error: client error: 403"
level=debug ts=2022-11-09T07:16:54.587555298Z caller=main.go:337 msg="running instant query" query="sum(haproxy_backend_connections:increase2w{slo=\"haproxy-backend-connection-errors\"})" ts=2022-11-09T07:16:54.587485411Z
level=debug ts=2022-11-09T07:16:54.587590772Z caller=main.go:337 msg="running instant query" query="sum(etcd_network_peer_round_trip_time_seconds:increase2w{job=\"etcd\",slo=\"etcd-network-peer-round-trip-time\"})" ts=2022-11-09T07:16:54.587504424Z
level=warn ts=2022-11-09T07:16:54.591427868Z caller=main.go:509 msg="failed to query total" query="sum(etcd_disk_backend_commit_duration_seconds:increase2w{slo=\"etcd-disk-backend-commit-duration\"})" err="prometheus query: client_error: client error: 403"
level=debug ts=2022-11-09T07:16:54.685597909Z caller=main.go:337 msg="running instant query" query="sum(prometheus_http_requests:increase2w{handler=\"/api/v1/query\",slo=\"prometheus-api-query\"})" ts=2022-11-09T07:16:54.685479464Z
level=warn ts=2022-11-09T07:16:54.685811101Z caller=main.go:509 msg="failed to query total" query="sum(apiserver_request_duration_seconds:increase2w{job=\"apiserver\",scope=~\"resource|\",slo=\"apiserver-read-resource-request-latency\",verb=~\"LIST|GET\"})" err="prometheus query: client_error: client error: 403"
level=warn ts=2022-11-09T07:16:54.685960645Z caller=main.go:509 msg="failed to query total" query="sum(haproxy_backend_connections:increase2w{slo=\"haproxy-backend-connection-errors\"})" err="prometheus query: client_error: client error: 403"
level=warn ts=2022-11-09T07:16:54.686321991Z caller=main.go:509 msg="failed to query total" query="sum(coredns_dns_responses:increase2w{job=\"dns-default\",slo=\"coredns-response-errors\"})" err="prometheus query: client_error: client error: 403"
level=warn ts=2022-11-09T07:16:54.785606802Z caller=main.go:509 msg="failed to query total" query="sum(prometheus_http_requests:increase2w{handler=\"/api/v1/query\",slo=\"prometheus-api-query\"})" err="prometheus query: client_error: client error: 403"
level=warn ts=2022-11-09T07:16:54.785780088Z caller=main.go:509 msg="failed to query total" query="sum(etcd_network_peer_round_trip_time_seconds:increase2w{job=\"etcd\",slo=\"etcd-network-peer-round-trip-time\"})" err="prometheus query: client_error: client error: 403"
level=debug ts=2022-11-09T07:16:56.030006926Z caller=main.go:337 msg="running instant query" query="sum(apiserver_request_duration_seconds:increase2w{job=\"apiserver\",scope=~\"resource|\",slo=\"apiserver-read-resource-request-latency\",verb=~\"LIST|GET\"})" ts=2022-11-09T07:16:56.029911012Z
level=debug ts=2022-11-09T07:16:56.03226426Z caller=main.go:337 msg="running instant query" query="ALERTS{slo=~\".+\"}" ts=2022-11-09T07:16:56.03225556Z
level=debug ts=2022-11-09T07:16:56.033663027Z caller=main.go:337 msg="running instant query" query="sum(apiserver_request:increase2w{job=\"apiserver\",slo=\"apiserver-write-response-errors\",verb=~\"POST|PUT|PATCH|DELETE\"})" ts=2022-11-09T07:16:56.033601893Z
level=debug ts=2022-11-09T07:16:56.034223307Z caller=main.go:337 msg="running instant query" query="sum(apiserver_request:increase2w{job=\"apiserver\",slo=\"apiserver-read-response-errors\",verb=~\"LIST|GET\"})" ts=2022-11-09T07:16:56.034170889Z
level=warn ts=2022-11-09T07:16:56.038058689Z caller=main.go:509 msg="failed to query total" query="sum(apiserver_request_duration_seconds:increase2w{job=\"apiserver\",scope=~\"resource|\",slo=\"apiserver-read-resource-request-latency\",verb=~\"LIST|GET\"})" err="prometheus query: client_error: client error: 403"
level=debug ts=2022-11-09T07:16:56.038731314Z caller=main.go:337 msg="running instant query" query="sum(coredns_dns_responses:increase2w{job=\"dns-default\",slo=\"coredns-response-errors\"})" ts=2022-11-09T07:16:56.038654976Z
level=warn ts=2022-11-09T07:16:56.039627642Z caller=main.go:728 msg="failed to query alerts" query="ALERTS{slo=~\".+\"}" err="prometheus query: client_error: client error: 403"
level=warn ts=2022-11-09T07:16:56.040068187Z caller=main.go:509 msg="failed to query total" query="sum(apiserver_request:increase2w{job=\"apiserver\",slo=\"apiserver-write-response-errors\",verb=~\"POST|PUT|PATCH|DELETE\"})" err="prometheus query: client_error: client error: 403"
level=debug ts=2022-11-09T07:16:56.08501562Z caller=main.go:337 msg="running instant query" query="sum(coredns_dns_request_duration_seconds:increase2w{job=\"dns-default\",slo=\"coredns-request-latency\"})" ts=2022-11-09T07:16:56.039325905Z
level=debug ts=2022-11-09T07:16:56.087660398Z caller=main.go:337 msg="running instant query" query="sum(etcd_disk_wal_fsync_duration_seconds:increase2w{slo=\"etcd-disk-wal-fsync-duration\"})" ts=2022-11-09T07:16:56.08755965Z
level=warn ts=2022-11-09T07:16:56.089298901Z caller=main.go:509 msg="failed to query total" query="sum(coredns_dns_responses:increase2w{job=\"dns-default\",slo=\"coredns-response-errors\"})" err="prometheus query: client_error: client error: 403"
level=debug ts=2022-11-09T07:16:56.08976687Z caller=main.go:337 msg="running instant query" query="sum(etcd_disk_backend_commit_duration_seconds:increase2w{slo=\"etcd-disk-backend-commit-duration\"})" ts=2022-11-09T07:16:56.089713159Z
level=warn ts=2022-11-09T07:16:56.185175562Z caller=main.go:509 msg="failed to query total" query="sum(apiserver_request:increase2w{job=\"apiserver\",slo=\"apiserver-read-response-errors\",verb=~\"LIST|GET\"})" err="prometheus query: client_error: client error: 403"
level=warn ts=2022-11-09T07:16:56.090820756Z caller=main.go:509 msg="failed to query total" query="sum(coredns_dns_request_duration_seconds:increase2w{job=\"dns-default\",slo=\"coredns-request-latency\"})" err="prometheus query: client_error: client error: 403"
level=warn ts=2022-11-09T07:16:56.092541163Z caller=main.go:509 msg="failed to query total" query="sum(etcd_disk_wal_fsync_duration_seconds:increase2w{slo=\"etcd-disk-wal-fsync-duration\"})" err="prometheus query: client_error: client error: 403"
level=debug ts=2022-11-09T07:16:56.285693625Z caller=main.go:337 msg="running instant query" query="sum(etcd_network_peer_round_trip_time_seconds:increase2w{job=\"etcd\",slo=\"etcd-network-peer-round-trip-time\"})" ts=2022-11-09T07:16:56.285611341Z
level=debug ts=2022-11-09T07:16:56.285689635Z caller=main.go:337 msg="running instant query" query="sum(haproxy_backend_connections:increase2w{slo=\"haproxy-backend-connection-errors\"})" ts=2022-11-09T07:16:56.285568877Z
level=warn ts=2022-11-09T07:16:56.286181191Z caller=main.go:509 msg="failed to query total" query="sum(etcd_disk_backend_commit_duration_seconds:increase2w{slo=\"etcd-disk-backend-commit-duration\"})" err="prometheus query: client_error: client error: 403"
level=debug ts=2022-11-09T07:16:56.288124531Z caller=main.go:337 msg="running instant query" query="sum(prometheus_http_requests:increase2w{handler=\"/api/v1/query\",slo=\"prometheus-api-query\"})" ts=2022-11-09T07:16:56.288064494Z
level=warn ts=2022-11-09T07:16:56.289913473Z caller=main.go:509 msg="failed to query total" query="sum(etcd_network_peer_round_trip_time_seconds:increase2w{job=\"etcd\",slo=\"etcd-network-peer-round-trip-time\"})" err="prometheus query: client_error: client error: 403"
level=warn ts=2022-11-09T07:16:56.290273657Z caller=main.go:509 msg="failed to query total" query="sum(haproxy_backend_connections:increase2w{slo=\"haproxy-backend-connection-errors\"})" err="prometheus query: client_error: client error: 403"
level=warn ts=2022-11-09T07:16:56.385373328Z caller=main.go:509 msg="failed to query total" query="sum(prometheus_http_requests:increase2w{handler=\"/api/v1/query\",slo=\"prometheus-api-query\"})" err="prometheus query: client_error: client error: 403"

Kubernetes pod logs:

level=info ts=2022-11-09T07:04:55.114819645Z caller=main.go:118 msg="using Prometheus" url=http://localhost:9090
I1109 07:04:56.460617       1 request.go:682] Waited for 1.044036519s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/logging.openshift.io/v1?timeout=32s
1.6679774982156076e+09  INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8080"}
1.667977498216453e+09   INFO    setup   starting manager
1.6679774982168262e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6679774982170098e+09  INFO    Starting EventSource    {"controller": "servicelevelobjective", "controllerGroup": "pyrra.dev", "controllerKind": "ServiceLevelObjective", "source": "kind source: *v1alpha1.ServiceLevelObjective"}
1.6679774982170737e+09  INFO    Starting Controller     {"controller": "servicelevelobjective", "controllerGroup": "pyrra.dev", "controllerKind": "ServiceLevelObjective"}
1.6679774983185127e+09  INFO    Starting workers        {"controller": "servicelevelobjective", "controllerGroup": "pyrra.dev", "controllerKind": "ServiceLevelObjective", "worker count": 1}
level=debug ts=2022-11-09T07:04:58.318649029Z caller=servicelevelobjective.go:55 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/etcd-disk-backend-commit-duration msg=reconciling
level=info ts=2022-11-09T07:04:58.520181585Z caller=servicelevelobjective.go:90 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/etcd-disk-backend-commit-duration msg="updating prometheus rule" namespace=openshift-monitoring name=etcd-disk-backend-commit-duration
level=debug ts=2022-11-09T07:04:58.609525145Z caller=servicelevelobjective.go:55 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/etcd-disk-wal-fsync-duration msg=reconciling
level=info ts=2022-11-09T07:04:58.611015182Z caller=servicelevelobjective.go:90 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/etcd-disk-wal-fsync-duration msg="updating prometheus rule" namespace=openshift-monitoring name=etcd-disk-wal-fsync-duration
level=debug ts=2022-11-09T07:04:58.645481938Z caller=servicelevelobjective.go:55 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/coredns-response-errors msg=reconciling
level=info ts=2022-11-09T07:04:58.646357806Z caller=servicelevelobjective.go:90 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/coredns-response-errors msg="updating prometheus rule" namespace=openshift-monitoring name=coredns-response-errors
level=debug ts=2022-11-09T07:04:58.692054086Z caller=servicelevelobjective.go:55 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/etcd-network-peer-round-trip-time msg=reconciling
level=info ts=2022-11-09T07:04:58.692904235Z caller=servicelevelobjective.go:90 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/etcd-network-peer-round-trip-time msg="updating prometheus rule" namespace=openshift-monitoring name=etcd-network-peer-round-trip-time
level=debug ts=2022-11-09T07:04:58.733895357Z caller=servicelevelobjective.go:55 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/haproxy-backend-connection-errors msg=reconciling
level=info ts=2022-11-09T07:04:58.734671711Z caller=servicelevelobjective.go:90 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/haproxy-backend-connection-errors msg="updating prometheus rule" namespace=openshift-monitoring name=haproxy-backend-connection-errors
level=debug ts=2022-11-09T07:04:58.766997713Z caller=servicelevelobjective.go:55 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/apiserver-write-response-errors msg=reconciling
level=info ts=2022-11-09T07:04:58.810784124Z caller=servicelevelobjective.go:90 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/apiserver-write-response-errors msg="updating prometheus rule" namespace=openshift-monitoring name=apiserver-write-response-errors
level=debug ts=2022-11-09T07:04:58.847817885Z caller=servicelevelobjective.go:55 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/coredns-request-latency msg=reconciling
level=info ts=2022-11-09T07:04:58.849192027Z caller=servicelevelobjective.go:90 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/coredns-request-latency msg="updating prometheus rule" namespace=openshift-monitoring name=coredns-request-latency
level=debug ts=2022-11-09T07:04:58.963078366Z caller=servicelevelobjective.go:55 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/apiserver-read-resource-request-latency msg=reconciling
level=info ts=2022-11-09T07:04:58.964537975Z caller=servicelevelobjective.go:90 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/apiserver-read-resource-request-latency msg="updating prometheus rule" namespace=openshift-monitoring name=apiserver-read-resource-request-latency
level=debug ts=2022-11-09T07:04:59.028936459Z caller=servicelevelobjective.go:55 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/apiserver-read-response-errors msg=reconciling
level=info ts=2022-11-09T07:04:59.02980182Z caller=servicelevelobjective.go:90 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/apiserver-read-response-errors msg="updating prometheus rule" namespace=openshift-monitoring name=apiserver-read-response-errors
level=debug ts=2022-11-09T07:04:59.101252821Z caller=servicelevelobjective.go:55 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/prometheus-api-query msg=reconciling
level=info ts=2022-11-09T07:04:59.101995586Z caller=servicelevelobjective.go:90 controllers=ServiceLevelObjective reconciler=servicelevelobjective namespace=openshift-monitoring/prometheus-api-query msg="updating prometheus rule" namespace=openshift-monitoring name=prometheus-api-query
    queryTotal := objective.QueryTotal(objective.Window)
    value, _, err := s.promAPI.Query(contextSetPromCache(ctx, 15*time.Second), queryTotal, ts)
    if err != nil {
        level.Warn(s.logger).Log("msg", "failed to query total", "query", queryTotal, "err", err)
        return nil, connect.NewError(connect.CodeInternal, err)
    }

All service accounts and cluster role bindings are in place. Am i missing a point for the new version? Thanks

metalmatze commented 1 year ago

Thanks for reporting. That is weird. The Prometheus API client used by Pyrra didn't change between releases. What happens if you rollback to v0.4.4? Does it work with the same configuration still?

KSPlatform commented 1 year ago

Yes it works when I get back to v0.4.4. Somehow same configuration.. What I have/don't have extra in the config is:

  1. Extra ingress certificate mount for external-url in api (Used this also in v0.4.4)
  2. No --generic-rules for kubernetes in v0.4.4 (Used this in v0.5.0)

Thanks

metalmatze commented 1 year ago

Did you have any updates? It's sadly super hard to judge remotely.

KSPlatform commented 1 year ago

Tried with a fresh system on OpenShift v4.10, same case. Used directly v0.5.0 image. Error in those columns.. Looked for any debug parameters, but couldn't find one.. Please redirect me for debugging the issue..

KSPlatform commented 1 year ago

Hi, Would be great if direct me for debugging..Somehow stuck for migration to newer versions.. Thanks

metalmatze commented 1 year ago

I'm sadly not sure how to help you further. I don't have access to an OpenShift cluster. If you could dig some more and post findings that might be helpful to get to the bottom of this. Sorry.

luigiaparicio commented 1 year ago

Hi, I'm trying Pyrra for the first time. I also have OpenShift 4.10 and I'm facing this same issue

metalmatze commented 1 year ago

I wish I had such easy access to an OpenShift cluster compared to back when working at Red Hat on OpenShift. Sadly this isn't the case right now. If you could help by providing more information that'd be fantastic.

metalmatze commented 1 year ago

Someone was very kind and I finally had access to OpenShift 4.12. Sadly I couldn't get it working until I had to leave again. It looks like Pyrra v0.4 didn't work and v0.5 doesn't either. I could reproduce this.

It's definitely an authorization issue. I'm a bit rusty on what exactly needed to be done for OpenShift, so help is appreciated.

The service-ca can be read from the mounted ConfigMap volume and I don't think we do anything with it.

https://github.com/pyrra-dev/pyrra/blob/e88234910527fe08f125f718319d1e9b93463ec5/main.go#L105-L109

+   TLSConfig: promconfig.TLSConfig{
+       CAFile: "/etc/ssl/certs/service-ca.crt",
+   },

It might be that we need to send Authorization Bearer tokens along with each request? Even if that's the case, do we need to tell the Prometheus to accept incoming requests from Pyrra?

Side note: It probably makes more sense to query Thanos queriers on latest OpenShift versions?

RiRa12621 commented 1 year ago

We are sending the bearer token, as mounted https://github.com/pyrra-dev/pyrra/blob/e88234910527fe08f125f718319d1e9b93463ec5/examples/openshift/deploy/api.yaml#L32

unless the syntax in pyrra changed, that should be the same still.

are you following the examples from here ?

chetan-bbd commented 1 year ago

I have exactly the same problem. Pyrra v0.4.4 is working fine but if I upgrade to v0.5.0 then UI shows the same errors. I am using OpenShift 4.12. Note: I am following the example here. I am connecting to the Thanos queriers.

KSPlatform commented 1 year ago

Hi, An addition to inform Alert manager, Thanos Querier and Thanos Ruler "web" access is removed and the data can only be accessed by command line or via OCP console after v4.10(https://docs.openshift.com/container-platform/4.10/release_notes/ocp-4-10-release-notes.html?extIdCarryOver=true&intcmp=7013a000002CtetAAC&sc_cid=7013a0000034hRXAAY#ocp-4-10-third-party-monitoring-component-uis-removal). So, not sure Prometheus links in Pyrra are useful .. Things are getting harder for OCP side..