Closed alexandreLamarre closed 2 years ago
After fixing the cortex query syntax and unmarshalling for the SLO Status API, there is no HTTP response/latency data to measure for the prometheus server (because cortex uses remotewrite).
Doesn't seem like a metric definition issue, so I am doing some local testing to see if I can add some configuration scrape targets for a dummy application that will export HTTP response/latency data.
I added the uuid of the SLO objects as a prometheus label to all the generated recording, metadata & alerting rules, so we can better aggregate them individually & unify the way we track IDs for SLOs.
However cortex returns 0 data for all queries made against the recording rules, so I'm gonna have to spend a day/afternoon in the debugger, debugging cortex itself. (Since the recording rules show as loaded & active in the cortex ruler & the metrics exist but the result is somehow 0???)
Looks like SLO status returning no data was a "bug" of the test environment & narrowed down my user story
Since SLO Status API seemed to be working fine in the kubernetes e2e setting, I've started working on e2e tests for SLOs
e2e Blocked by agents not being able to be bootstrapped :
error during bootstrap: auth request failed: rpc error: code = Unavailable desc = capability "metrics" cannot be installed: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin1214421444: connect: connection refused"[
Edit : Joe found the issue and is working on a fix. Restarting the manager pod should fix it on my cluster
Upon further debugging in the test environment, there are two possibilities :
If a cluster gets disconnected, SLO still query the cluster, because LoadRules
isn't causing an error in the SLO create API.
There are two things going wrong:
POST
api call for /api/v1/rules/{namespace} returns a 404 page not found
)Fixed application of recording rules in opni kubernetes deployments
Service discovery backend is completely busted
prometheus agent scrapers are totally busted and keep timeing out
Blocked on waitctx. testenv context, prometheus agent scraper timeout, cortex metadata api, cortex recording rule empty vector, as opposed to result from the raw recording rule query.
Create an API that queries the status of an SLO, and returns one of :
based on the recording rule data / rule metadata / rule alert created from SLOs API