rancher / opni

Multi Cluster Observability with AIOps
https://opni.io
Apache License 2.0
338 stars 53 forks source link

SLO Status API & Cortex bug fixes e2e #363

Closed alexandreLamarre closed 2 years ago

alexandreLamarre commented 2 years ago

Create an API that queries the status of an SLO, and returns one of :

based on the recording rule data / rule metadata / rule alert created from SLOs API

alexandreLamarre commented 2 years ago

After fixing the cortex query syntax and unmarshalling for the SLO Status API, there is no HTTP response/latency data to measure for the prometheus server (because cortex uses remotewrite).

Doesn't seem like a metric definition issue, so I am doing some local testing to see if I can add some configuration scrape targets for a dummy application that will export HTTP response/latency data.

alexandreLamarre commented 2 years ago

I added the uuid of the SLO objects as a prometheus label to all the generated recording, metadata & alerting rules, so we can better aggregate them individually & unify the way we track IDs for SLOs.

However cortex returns 0 data for all queries made against the recording rules, so I'm gonna have to spend a day/afternoon in the debugger, debugging cortex itself. (Since the recording rules show as loaded & active in the cortex ruler & the metrics exist but the result is somehow 0???)

alexandreLamarre commented 2 years ago

Looks like SLO status returning no data was a "bug" of the test environment & narrowed down my user story

alexandreLamarre commented 2 years ago

Since SLO Status API seemed to be working fine in the kubernetes e2e setting, I've started working on e2e tests for SLOs

alexandreLamarre commented 2 years ago

e2e Blocked by agents not being able to be bootstrapped :

error during bootstrap: auth request failed: rpc error: code = Unavailable desc = capability "metrics" cannot be installed: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin1214421444: connect: connection refused"[ 

Edit : Joe found the issue and is working on a fix. Restarting the manager pod should fix it on my cluster

alexandreLamarre commented 2 years ago

Upon further debugging in the test environment, there are two possibilities :

alexandreLamarre commented 2 years ago

If a cluster gets disconnected, SLO still query the cluster, because LoadRules isn't causing an error in the SLO create API.

alexandreLamarre commented 2 years ago

There are two things going wrong:

alexandreLamarre commented 2 years ago

Fixed application of recording rules in opni kubernetes deployments

alexandreLamarre commented 2 years ago

Service discovery backend is completely busted

alexandreLamarre commented 2 years ago

prometheus agent scrapers are totally busted and keep timeing out

alexandreLamarre commented 2 years ago

Blocked on waitctx. testenv context, prometheus agent scraper timeout, cortex metadata api, cortex recording rule empty vector, as opposed to result from the raw recording rule query.