krkn-chaos / cerberus

Guardian of Kubernetes clusters. Tool to monitor clusters health and signal/alert on failures.
Apache License 2.0
92 stars 42 forks source link

Measuring overhead of API calls #179

Open jtaleric opened 2 years ago

jtaleric commented 2 years ago

Do we have measurements of the load Cerberus puts on the cluster, since it continuously monitors aspects of the cloud?

Thoughts on using the cluster events to determine go/no-go? Maybe have an option for no sqllite/api calls?

I built a prototype called rlgl (Red Light, Green Light). That simply uses the events to determine if things should stop/go. The only gotcha here is the ttl of events. https://github.com/jtaleric/rlgl -- this won't keep track of things like Cerberus (since it has no db), but for maybe for CI signaling, we can reduce the scope of what we need to know for go/no-go?

chaitanyaenr commented 2 years ago

We currently do not have the measurements @jtaleric. We will look into measuring the number of API calls every iteration from Cerberus to understand it's overhead. The polling duration is something that's configurable if we do not want to aggressively hit the API.

For the chaos use case, some failures are expected and we query the json stored in the sqlite DB for particular fields and failure counts to pass/fail.

On the extending the ttl of the events will put load on the Etcd AFAIK - something to explore.

jtaleric commented 2 years ago

On the extending the ttl of the events will put load on the Etcd AFAIK - something to explore.

ack. I am not sure we need to increase the ttl.. It would be good to get some measurements of this.

My .02 here is that if the "bad" events are clearing during reconciliation do we care? It could just be part of the life-cycle.