Open ikdekker opened 2 years ago
Dawson, S., & Jahanian, F. (1995). Deterministic fault injection of distributed systems. In Theory and Practice in Distributed Systems (pp. 178-196). Springer, Berlin, Heidelberg. I started out with a paper from 1995, which introduces a script driven layer. They call it the probe and fault injection (PFI) layer. The authors define three message types (filtering, manipulation and injection), modifying the protocol which sits on top of the PFI layer. This idea has been implemented in a framework called ORCHESTRA, presented in the paper: Dawson, S., Jahanian, F., & Mitton, T. (1996, September). ORCHESTRA: A probing and fault injection environment for testing protocol implementations. In Proceedings of IEEE International Computer Performance and Dependability Symposium (p. 56). IEEE. The next paper focuses more on stress testing systems. The method they used was to use the UML 2.0 model, to create stress test cases, which cause stress on the system under test, preferably before they are released to see effects of heavy loads on a system. They use sequence diagrams, annotated with timing information, to generate valid test cases. The aim is to discover network faults by running optimized test cases. Garousi, V., Briand, L. C., & Labiche, Y. (2006, May). Traffic-aware stress testing of distributed systems based on UML models. In Proceedings of the 28th international conference on Software engineering (pp. 391-400).
Loki is a fault injector, with post-runtime analysis based on a partial state of the global system. They attempt to mitigate and assess damage done by correlated faults. Cukier, M., Chandra, R., Henke, D., Pistole, J., & Sanders, W. H. (1999, October). Fault injection based on a partial view of the global state of a distributed system. In Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (pp. 168-177). IEEE. another injector/monitor is DEFINE: Kao, W. L., & Iyer, R. K. (1996, June). DEFINE: A distributed fault injection and monitoring environment. In Proceedings of IEEE workshop on fault-tolerant parallel and distributed systems (pp. 252-259). IEEE. Besides those methods I have kept the frameworks in mind that I have seen before and catalogued in https://freshcoders.nl/nick/papers/A_Survey_of_Chaos_Engineering_Frameworks.pdf.
I also looked at predicting failures in distributed systems. A failure predictor, called PreMiSE tries to locate failures by looking at anomalies in recorded key performance indicator data. The data is compared to a baseline. Mariani, L., Pezzè, M., Riganelli, O., & Xin, R. (2020). Predicting failures in multi-tier distributed systems. Journal of Systems and Software, 161, 110464. If the faults can not be monitored directly, a similar approach may be of use in Adyen's context.
Finally, if tests are going to be run on a distributed system, it would be important to look at effective testing. The framework presented in this last paper can help, with tests running in parallel and stress-testing through middleware agents. El Yamany, H. F., Capretz, M. A., & Capretz, L. F. (2006, September). A multi-agent framework for testing distributed systems. In 30th Annual International Computer Software and Applications Conference (COMPSAC'06) (Vol. 2, pp. 151-156). IEEE.
Further literature search concerning testing distributed systems.
Qu, R., Hirano, S., Ohkawa, T., Kubota, T., & Nicolescu, R. (2006). Distributed unit testing. Technical Report CITR-TR-191.
Verdi is a framework for verifying distributed systems, before runtime. The goal is to enable users to construct distributed systems that are reliable and fault-tolerant.
RemoteTest is a type of distributed testing framework which tests functional components of distributed systems. It considers the dependency between components, running scripts and emulating parts of the system to see where errors have originated.
It is shown by
The following papers talk about how partition tolerance affects the system, introducing bugs and the analysis of these situations.
Cloud recovery testing frameworks:
Gunawi, H. S., Do, T., Joshi, P., Alvaro, P., Hellerstein, J. M., Arpaci-Dusseau, A. C., ... & Borthakur, D. (2011). {FATE} and {DESTINI}: A Framework for Cloud Recovery Testing. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11). This framework is also listed in a taxonomy paper about cloud recovery testing. The following paper has some useful classifications and frameworks or methods about recovery testing.
Fu, M., Bass, L., & Liu, A. (2014, June). Towards a taxonomy of cloud recovery strategies. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (pp. 696-701). IEEE.
Nasiri, R., & Hosseini, S. (2014). A novel framework for cloud testing. Int J Electron Commun Comput Eng, 5(4), 850-854. lists some methods and tests for testing a cloud system. Another listing of tools can be found in
Bai, X., Li, M., Chen, B., Tsai, W. T., & Gao, J. (2011, December). Cloud testing tools. In Proceedings of 2011 IEEE 6th International Symposium on Service Oriented System (SOSE) (pp. 1-12). IEEE.
Cloud systems are interesting to look at, because they have similar properties to distributed systems (and sometimes are distributed systems).
Regarding resiliency testing, we can look at several patterns that are often used in distributed systems. Such as timeout, retry, circuit breakers, failover and such algorithms. By looking at such patterns and manipulations of the system that would break the system, we can inspect the resilience gained by using them.
The following paper shows that using redundancy can still mean failures for end users with certain kinds of faults. They analysed behaviour of well-known distributed systems, such as Zookeeper, Cassandra and Redis.
Similarly, analysing several projects, a study about bugs in the distributed cloud services was done, listing types of bugs and their implications. This shows common, or likely bugs in distributed systems, and may provide information about how to prevent them.
Other papers which are related to testing distributed systems, but I have yet to establish the relevance of:
I also thought about conceptualizing the topics into a mind map, along with frameworks, to better see where gaps in literature were present. As most maps only go into either their specific scenario and scope, or only show the topics directly related to DSs. An overview of such information seems to be missing in the literature. e.g.
The following image is a WIP of a mind map of the topics, which I created.
Overview of some methods and frameworks which are interesting to compare or extract data from.
Name | Year | Type | Summary | Contributions | Environment/requirement |
---|---|---|---|---|---|
CREB | '18 | Empirical | Identifies bug-patterns and findings of why bugs occur | root-cause analysis | - |
Frisbee | '21 | FI FW | Fault injection framework with monitoring capabilities | fault classification | Kubernetes |
ThorFI | '22 | FIaaS | Combines SOTA FI solution's capabilities for ease of use | generic/all-in-one | OpenStack / K8s |
NEAT | '18 | FI impl | Shows simple network partition failures are catastrophic | classification, evalutation | OpenFlow / iptables |
Filibuster | '21 | FI impl | Relies on generated tests to attain coverage | evaluation, test generation and reduction | Python / Java |
Fallout | '21 | FIaaS | Supports chaos experiments, stress-test and logging | evaluation | Kubernetes / Docker |
Molly | '15 | FI FW | SAT-solver based execution, concolic | evaluation, test generation | - |
FATE | '11 | FI impl | Systematic failure injection | test generation | - |
In this table, I use a few abbreviations:
Even though many papers I have looked at call their techniques by a similar name, they either do different jobs, look at other aspects or use different metrics for essentially the same end-result. I want to classify the papers by the types of contribution they do, finding the common metrics they look at when quantifying their own performance and compare relevant systems to each other. I would like to see if it is really necessary for a new framework to be spawned each time, for example NEAT, which says Jepsen did not support network partitions and unit tests or ThorFI, which listed many methods and decided to build their own.
In many of the papers they look at either bugs in a system using network partitioning, software fault injection or analysis of bugs found in issue tracking systems. All types may be useful for coming up with a classification (perhaps using existing taxonomy), but it might be good to focus on a single type of failure / fault injection technique.
I think a few good RQs are:
Not all solutions are applicable to every environment, so there is some question on how that could be handled as well. Porting, extensions?
For my review, a possible list of chapters is as follows:
Here are some relevant papers that may help in writing a survey about frameworks that implement fault injection supporting network partition.
IoT is related, but often not applicable. These papers are useful for seeing how and what was compared in a similar domain.
iot comparison: https://arxiv.org/pdf/2112.09580.pdf
This DS https://www.usenix.org/system/files/nsdi20-paper-brooker.pdf uses jepsen and uses simulation to test their system.
ZERMIA, FI FW: https://link.springer.com/chapter/10.1007/978-3-030-92708-0_3 Testing Distributed Storage with P# https://www.usenix.org/system/files/conference/fast16/fast16-papers-deligiannis.pdf
Updated RQs:
With related chapters:
In the comments of this issue, I will be collecting some of the papers relevant to the topics, which for now are