Proposal: Designing a cloud DSL for analyzing metastable failures

Researchers have recently characterized a new class of computer system failures: metastable failures. A general blueprint for these failures is as follows:

The system is operating within a given environment.
A trigger event happens. The trigger can come in different flavors: server crash, network partition, load surge, etc.
The trigger pushes the system into a self-sustaining loop, in which the system is also "unavailable".
Even after the triggering effect is remedied, the system remains in the self-sustaining loop.

I have been trying to rigorously define metastability as a property of a suitably general system abstraction. We have a working definition that seems to capture the gist of it. However, since we are interested in actual computer systems that might experience metastability, it would be great if we could:

express a generic "computer system", especially one that resembles the cloud in that it is made of clusters of homogenous machines,
and analyze whether a given system - expressed in the previous step - "is metastable or not".

To this end, I am planning to design a generic domain-specific language (DSL) for the cloud context. There are various challenges to this from a research perspective: getting the syntax and the semantics, checking whether it is accurate with respect to real-world systems, and so on. Even more, it would be super cool if the DSL itself aids us in studying the metastability behavior of a given system via some sort of inference engine baked into it (NetKAT is a great example of this if you are interested).

My goal for this project is far more humble: to design a primitive DSL that captures most of the essentials of a cloudesque system, and to interpret it into a trace containing concrete and actual results from the execution of the system. This would serve as a basis for all future inquiries concerning metastability. Of course, the interpreter would have to output information that would be useful from a metastability analysis perspective.

For a realistic implementation, the real test for such a DSL-interpreter mash-up is accuracy with respect to some notion of a realistic modern computer system. Mimicking Kubernetes is an example that comes to mind. However, to keep things in the scope of a course project, I will only test to see whether my implementation is "correct". To expose the flavor of this correctness, here is an example. Say I have modeled an entity in the system as an agent with one input queue and one output queue, with some primitive logic to take stuff from the input, process them, and then give them to the output. A correctness criterion for this would be: "if the input queue is empty, you should not be able to get anything from it".

sampsyo / cs6120

Proposal: Designing a cloud DSL for analyzing metastable failures #395