One of the requirements in any chaos engineering framework is to provide an effective means of carrying out health checks (synonymous with jargons like steady-state checks, pre/post chaos checks, entry/exit criteria, "liveness" checks etc.,) specific to the object under test. The common requirement of such checks is to Effectively relay information about the availability, and optimal performance of the application or infra under test. An ideal liveness checks would to do so even as the chaos/fault is in progress (rather than give a one-rime indicator before and after fault - as this may not account for transient errors and recovery).
In litmus, most checks fall under the former category (one time command-based investigation before and after chaos) and few in the latter (continuous check during chaos). These are executed as jobs executing a bash/python script running in perpetual mode (infinite loops) that are launched and killed by the experiment logic. Typically these take some service endpoints, polling intervals, retry counts/timeouts as inputs to perform the checks.
As is apparent, these liveness checks are very simplistic and probably do not provide an entirely accurate picture of availability, at least from the experiment standpoint. As cases in point:
The MySQL checker performs a status query to check "health". However, from a user context, the Key Performance Indicators (KPIs) associated with the MySQL deployment may be far more than just this. It could be the min transactions/queries per seconds, the count in the slow query log, error counts etc.., When Chaos is performed, on say, a MySQL statefulset, the liveness checker should look for these attributes and relay the "liveness" summary as, say, a metric which the experiment can consume to determine resiliency. Of course, such a checker will expose multiple "optional" indicators which the SRE/developer may choose to consume or ignore (a status check might be sufficient)
The NFS checker does a showmount command that just verifies if the NFS provisioner continues to export a certain path (which contains a openebs volume string). It does not mention if the NFS volumes (which internally uses OpenEBS PVs) are writable, if the provisioner continues to support creation and deletion of new PVs, if the NFS volumes exposed continue to offer a desired level of QoS, if data integrity is maintained across chaos (the experiment logic does this today in a "manual" fashion by reading data/performing md5sums after chaos), etc.,
Requirement
Metac, which offers great scope in terms of low turnaround/development times and flexible custom resource definitions can be used to create improved "liveness controllers", on per application basis, with each controller associated with rich, but focused schema for the respective application. These controllers themselves can be shipped with the litmus chaos operator, with the lifecycle of the associated custom resources being managed by the experiment logic, much like how it does today for the previously described liveness job/pods specs. These "liveness controllers" can also be used in a standalone capacity (not part of a litmus chaos experiment, per se) in a production system or used by other chaos frameworks.
While the above description sounds like it might have a bit of an overlap with traditional existing monitoring solutions which use exporters and alert rules -- there is no one integrated way today to use app information to decide on it's "health". At least not w/o much plumbing on the part of the SREs. These controllers can fill an important gap.
Come up w/ first cut schema of the liveness CRs for MySQL and NFS. Any other app can also be chosen. Maybe OpenEBS itself can also fit here (note that, today there is no integrated health indicator for the OpenEBS control plane itself)
Background
One of the requirements in any chaos engineering framework is to provide an effective means of carrying out health checks (synonymous with jargons like steady-state checks, pre/post chaos checks, entry/exit criteria, "liveness" checks etc.,) specific to the object under test. The common requirement of such checks is to Effectively relay information about the availability, and optimal performance of the application or infra under test. An ideal liveness checks would to do so even as the chaos/fault is in progress (rather than give a one-rime indicator before and after fault - as this may not account for transient errors and recovery).
In litmus, most checks fall under the former category (one time command-based investigation before and after chaos) and few in the latter (continuous check during chaos). These are executed as jobs executing a bash/python script running in perpetual mode (infinite loops) that are launched and killed by the experiment logic. Typically these take some service endpoints, polling intervals, retry counts/timeouts as inputs to perform the checks.
Some references:
Problem Statement
As is apparent, these liveness checks are very simplistic and probably do not provide an entirely accurate picture of availability, at least from the experiment standpoint. As cases in point:
The MySQL checker performs a status query to check "health". However, from a user context, the Key Performance Indicators (KPIs) associated with the MySQL deployment may be far more than just this. It could be the min transactions/queries per seconds, the count in the slow query log, error counts etc.., When Chaos is performed, on say, a MySQL statefulset, the liveness checker should look for these attributes and relay the "liveness" summary as, say, a metric which the experiment can consume to determine resiliency. Of course, such a checker will expose multiple "optional" indicators which the SRE/developer may choose to consume or ignore (a status check might be sufficient)
The NFS checker does a
showmount
command that just verifies if the NFS provisioner continues to export a certain path (which contains a openebs volume string). It does not mention if the NFS volumes (which internally uses OpenEBS PVs) are writable, if the provisioner continues to support creation and deletion of new PVs, if the NFS volumes exposed continue to offer a desired level of QoS, if data integrity is maintained across chaos (the experiment logic does this today in a "manual" fashion by reading data/performing md5sums after chaos), etc.,Requirement
Metac, which offers great scope in terms of low turnaround/development times and flexible custom resource definitions can be used to create improved "liveness controllers", on per application basis, with each controller associated with rich, but focused schema for the respective application. These controllers themselves can be shipped with the litmus chaos operator, with the lifecycle of the associated custom resources being managed by the experiment logic, much like how it does today for the previously described liveness job/pods specs. These "liveness controllers" can also be used in a standalone capacity (not part of a litmus chaos experiment, per se) in a production system or used by other chaos frameworks.
While the above description sounds like it might have a bit of an overlap with traditional existing monitoring solutions which use exporters and alert rules -- there is no one integrated way today to use app information to decide on it's "health". At least not w/o much plumbing on the part of the SREs. These controllers can fill an important gap.