[RFE ] Bake in Performance and Scalability knowledge for Kraken to pass/fail

chaitanyaenr commented 3 years ago

Kraken currently has the ability to check if the targeted component recovered from the failures injected in addition to checking the health of the cluster as a whole to make sure it didn't get impacted because of the chaos. Kraken needs to take into account the performance and scalability of the component under test as well to expose issues where the component/cluster doesn't perform well after the recovery.

It supports performance monitoring which helps with tracking the performance and scale metrics: https://github.com/cloud-bulldozer/kraken#performance-monitoring but we need a way for it to analyze things without having to manually check on the cluster.

The proposal is to have the ability in Kraken to accept a profile which consists of metrics to scrape from the prometheus and evaluates them with a gold set of values found from the OpenShift/Kubernetes performance and scale runs and pass/fails the run based on it. This is inspired from Kube-burner which we use heavily for performance/scale testing OpenShift: https://github.com/cloud-bulldozer/kube-burner/blob/master/docs/alerting.md.

More on the importance of doing this can be found in the blog we published recently: https://www.openshift.com/blog/making-chaos-part-of-kubernetes/openshift-performance-and-scalability-tests.

chaitanyaenr commented 3 years ago

cc: @paigerube14 @mffiedler @rsevilla87 @smalleni

paigerube14 commented 3 years ago

This sounds like a great addition!! Only thought is would this make more sense in cerberus?

If kept in kraken would we have a profile for each of the components we are testing (ex etcd, kube-apiserver) that we would run after each specific component? Or just an overall profile run at each iteration?

krkn-chaos / krkn

[RFE ] Bake in Performance and Scalability knowledge for Kraken to pass/fail #76