Open jtaleric opened 2 years ago
We currently do not have the measurements @jtaleric. We will look into measuring the number of API calls every iteration from Cerberus to understand it's overhead. The polling duration is something that's configurable if we do not want to aggressively hit the API.
For the chaos use case, some failures are expected and we query the json stored in the sqlite DB for particular fields and failure counts to pass/fail.
On the extending the ttl of the events will put load on the Etcd AFAIK - something to explore.
On the extending the ttl of the events will put load on the Etcd AFAIK - something to explore.
ack. I am not sure we need to increase the ttl.. It would be good to get some measurements of this.
My .02 here is that if the "bad" events are clearing during reconciliation do we care? It could just be part of the life-cycle.
Do we have measurements of the load Cerberus puts on the cluster, since it continuously monitors aspects of the cloud?
Thoughts on using the cluster events to determine go/no-go? Maybe have an option for no sqllite/api calls?
I built a prototype called
rlgl
(Red Light, Green Light). That simply uses the events to determine if things should stop/go. The only gotcha here is thettl
of events. https://github.com/jtaleric/rlgl -- this won't keep track of things like Cerberus (since it has no db), but for maybe for CI signaling, we can reduce the scope of what we need to know for go/no-go?