SLO Status API & Cortex bug fixes e2e

rancher / opni

Multi Cluster Observability with AIOps

https://opni.io

Apache License 2.0

338 stars 53 forks source link

SLO Status API & Cortex bug fixes e2e #363

Closed alexandreLamarre closed 2 years ago

alexandreLamarre commented 2 years ago

Create an API that queries the status of an SLO, and returns one of :

No data - grey (no incoming data for the SLO)
Ok - green (within budget)
Warning - orange (current burn rate suggests we are consuming too much of the remaining budget)
Breaching - red (no longer within budget for the current period)
InternalError - black (only in very rare bad cases)

based on the recording rule data / rule metadata / rule alert created from SLOs API

alexandreLamarre commented 2 years ago

After fixing the cortex query syntax and unmarshalling for the SLO Status API, there is no HTTP response/latency data to measure for the prometheus server (because cortex uses remotewrite).

Doesn't seem like a metric definition issue, so I am doing some local testing to see if I can add some configuration scrape targets for a dummy application that will export HTTP response/latency data.

alexandreLamarre commented 2 years ago

I added the uuid of the SLO objects as a prometheus label to all the generated recording, metadata & alerting rules, so we can better aggregate them individually & unify the way we track IDs for SLOs.

However cortex returns 0 data for all queries made against the recording rules, so I'm gonna have to spend a day/afternoon in the debugger, debugging cortex itself. (Since the recording rules show as loaded & active in the cortex ruler & the metrics exist but the result is somehow 0???)

alexandreLamarre commented 2 years ago

Looks like SLO status returning no data was a "bug" of the test environment & narrowed down my user story

alexandreLamarre commented 2 years ago

Since SLO Status API seemed to be working fine in the kubernetes e2e setting, I've started working on e2e tests for SLOs

alexandreLamarre commented 2 years ago

e2e Blocked by agents not being able to be bootstrapped :

error during bootstrap: auth request failed: rpc error: code = Unavailable desc = capability "metrics" cannot be installed: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin1214421444: connect: connection refused"[

Edit : Joe found the issue and is working on a fix. Restarting the manager pod should fix it on my cluster

alexandreLamarre commented 2 years ago

Upon further debugging in the test environment, there are two possibilities :

mock instrumentation server needs to explicitly export its own job name
cortex query for a rule is broken

alexandreLamarre commented 2 years ago

If a cluster gets disconnected, SLO still query the cluster, because LoadRules isn't causing an error in the SLO create API.

alexandreLamarre commented 2 years ago

There are two things going wrong:

cortex is refusing all rule applications when deployed in its microservice form (any POST api call for /api/v1/rules/{namespace} returns a 404 page not found)
service discovery backend is turning out to be pretty nearly impossible to tune to any degree of consistency, so it will probably not be very usable

alexandreLamarre commented 2 years ago

Fixed application of recording rules in opni kubernetes deployments

alexandreLamarre commented 2 years ago

Service discovery backend is completely busted

alexandreLamarre commented 2 years ago

prometheus agent scrapers are totally busted and keep timeing out

alexandreLamarre commented 2 years ago

Blocked on waitctx. testenv context, prometheus agent scraper timeout, cortex metadata api, cortex recording rule empty vector, as opposed to result from the raw recording rule query.