Collection of Papers with Descriptions

In the comments of this issue, I will be collecting some of the papers relevant to the topics, which for now are

analysing distributed systems behaviour under specific situations like how
- to stress test
- to test fault-tolerance
- to test fault-injection
explore how to test the outcome of resilience models, or
explore resiliency models in general

initial list of papers, with context

Dawson, S., & Jahanian, F. (1995). Deterministic fault injection of distributed systems. In Theory and Practice in Distributed Systems (pp. 178-196). Springer, Berlin, Heidelberg. I started out with a paper from 1995, which introduces a script driven layer. They call it the probe and fault injection (PFI) layer. The authors define three message types (filtering, manipulation and injection), modifying the protocol which sits on top of the PFI layer. This idea has been implemented in a framework called ORCHESTRA, presented in the paper: Dawson, S., Jahanian, F., & Mitton, T. (1996, September). ORCHESTRA: A probing and fault injection environment for testing protocol implementations. In Proceedings of IEEE International Computer Performance and Dependability Symposium (p. 56). IEEE. The next paper focuses more on stress testing systems. The method they used was to use the UML 2.0 model, to create stress test cases, which cause stress on the system under test, preferably before they are released to see effects of heavy loads on a system. They use sequence diagrams, annotated with timing information, to generate valid test cases. The aim is to discover network faults by running optimized test cases. Garousi, V., Briand, L. C., & Labiche, Y. (2006, May). Traffic-aware stress testing of distributed systems based on UML models. In Proceedings of the 28th international conference on Software engineering (pp. 391-400).

Loki is a fault injector, with post-runtime analysis based on a partial state of the global system. They attempt to mitigate and assess damage done by correlated faults. Cukier, M., Chandra, R., Henke, D., Pistole, J., & Sanders, W. H. (1999, October). Fault injection based on a partial view of the global state of a distributed system. In Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (pp. 168-177). IEEE. another injector/monitor is DEFINE: Kao, W. L., & Iyer, R. K. (1996, June). DEFINE: A distributed fault injection and monitoring environment. In Proceedings of IEEE workshop on fault-tolerant parallel and distributed systems (pp. 252-259). IEEE. Besides those methods I have kept the frameworks in mind that I have seen before and catalogued in https://freshcoders.nl/nick/papers/A_Survey_of_Chaos_Engineering_Frameworks.pdf.

I also looked at predicting failures in distributed systems. A failure predictor, called PreMiSE tries to locate failures by looking at anomalies in recorded key performance indicator data. The data is compared to a baseline. Mariani, L., Pezzè, M., Riganelli, O., & Xin, R. (2020). Predicting failures in multi-tier distributed systems. Journal of Systems and Software, 161, 110464. If the faults can not be monitored directly, a similar approach may be of use in Adyen's context.

Finally, if tests are going to be run on a distributed system, it would be important to look at effective testing. The framework presented in this last paper can help, with tests running in parallel and stress-testing through middleware agents. El Yamany, H. F., Capretz, M. A., & Capretz, L. F. (2006, September). A multi-agent framework for testing distributed systems. In 30th Annual International Computer Software and Applications Conference (COMPSAC'06) (Vol. 2, pp. 151-156). IEEE.

Further literature search concerning testing distributed systems.

Qu, R., Hirano, S., Ohkawa, T., Kubota, T., & Nicolescu, R. (2006). Distributed unit testing. Technical Report CITR-TR-191.

Verdi is a framework for verifying distributed systems, before runtime. The goal is to enable users to construct distributed systems that are reliable and fault-tolerant.

Wilcox, J. R., Woos, D., Panchekha, P., Tatlock, Z., Wang, X., Ernst, M. D., & Anderson, T. (2015, June). Verdi: a framework for implementing and formally verifying distributed systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (pp. 357-368).

RemoteTest is a type of distributed testing framework which tests functional components of distributed systems. It considers the dependency between components, running scripts and emulating parts of the system to see where errors have originated.

Torens, C., & Ebrecht, L. (2010, August). RemoteTest: a framework for testing distributed systems. In 2010 Fifth International Conference on Software Engineering Advances (pp. 441-446). IEEE.

It is shown by

Yuan, D., Luo, Y., Zhuang, X., Rodrigues, G. R., Zhao, X., Zhang, Y., ... & Stumm, M. (2014). Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed {Data-Intensive} Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (pp. 249-265). that simple tests are already a big step toward finding problems in a DS.

The following papers talk about how partition tolerance affects the system, introducing bugs and the analysis of these situations.

Alquraan, A., Takruri, H., Alfatafta, M., & Al-Kiswany, S. (2018). An Analysis of {Network-Partitioning} Failures in Cloud Systems. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (pp. 51-68).
Alfatafta, M., Alkhatib, B., Alquraan, A., & Al-Kiswany, S. (2020). Toward a generic fault tolerance technique for partial network partitioning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (pp. 351-368).
Majumdar, R., & Niksic, F. (2017). Why is random testing effective for partition tolerance bugs?. Proceedings of the ACM on Programming Languages, 2(POPL), 1-24.

Cloud recovery testing frameworks:

Gunawi, H. S., Do, T., Joshi, P., Alvaro, P., Hellerstein, J. M., Arpaci-Dusseau, A. C., ... & Borthakur, D. (2011). {FATE} and {DESTINI}: A Framework for Cloud Recovery Testing. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11). This framework is also listed in a taxonomy paper about cloud recovery testing. The following paper has some useful classifications and frameworks or methods about recovery testing.
Fu, M., Bass, L., & Liu, A. (2014, June). Towards a taxonomy of cloud recovery strategies. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (pp. 696-701). IEEE.
Nasiri, R., & Hosseini, S. (2014). A novel framework for cloud testing. Int J Electron Commun Comput Eng, 5(4), 850-854. lists some methods and tests for testing a cloud system. Another listing of tools can be found in
Bai, X., Li, M., Chen, B., Tsai, W. T., & Gao, J. (2011, December). Cloud testing tools. In Proceedings of 2011 IEEE 6th International Symposium on Service Oriented System (SOSE) (pp. 1-12). IEEE.

Cloud systems are interesting to look at, because they have similar properties to distributed systems (and sometimes are distributed systems).

Regarding resiliency testing, we can look at several patterns that are often used in distributed systems. Such as timeout, retry, circuit breakers, failover and such algorithms. By looking at such patterns and manipulations of the system that would break the system, we can inspect the resilience gained by using them.

The following paper shows that using redundancy can still mean failures for end users with certain kinds of faults. They analysed behaviour of well-known distributed systems, such as Zookeeper, Cassandra and Redis.

Ganesan, A., Alagappan, R., Arpaci-Dusseau, A. C., & Arpaci-Dusseau, R. H. (2017). Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In 15th USENIX Conference on File and Storage Technologies (FAST 17) (pp. 149-166).

Similarly, analysing several projects, a study about bugs in the distributed cloud services was done, listing types of bugs and their implications. This shows common, or likely bugs in distributed systems, and may provide information about how to prevent them.

Gunawi, H. S., Hao, M., Leesatapornwongsa, T., Patana-anake, T., Do, T., Adityatama, J., ... & Satria, A. D. (2014, November). What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM symposium on cloud computing (pp. 1-14).

Other papers which are related to testing distributed systems, but I have yet to establish the relevance of:

Hanawa, T., Banzai, T., Koizumi, H., Kanbayashi, R., Imada, T., & Sato, M. (2010, April). Large-scale software testing environment using cloud computing technology for dependable parallel and distributed systems. In 2010 Third International Conference on Software Testing, Verification, and Validation Workshops (pp. 428-433). IEEE.
Ozkan, B. K., Majumdar, R., Niksic, F., Befrouei, M. T., & Weissenbacher, G. (2018). Randomized testing of distributed systems with probabilistic guarantees. Proceedings of the ACM on Programming Languages, 2(OOPSLA), 1-28.
Vilkomir, S. (2012). Cloud testing: A state-of-the-art review. Information & Security, 28(2), 213.
Mendonça, N. C., & Kramer, J. (2001). An approach for recovering distributed system architectures. Automated Software Engineering, 8(3), 311-354.
Leesatapornwongsa, T., Lukman, J. F., Lu, S., & Gunawi, H. S. (2016, March). TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 517-530).

I also thought about conceptualizing the topics into a mind map, along with frameworks, to better see where gaps in literature were present. As most maps only go into either their specific scenario and scope, or only show the topics directly related to DSs. An overview of such information seems to be missing in the literature. e.g.

The following image is a WIP of a mind map of the topics, which I created.

Overview of some methods and frameworks which are interesting to compare or extract data from.

Name	Year	Type	Summary	Contributions	Environment/requirement
CREB	'18	Empirical	Identifies bug-patterns and findings of why bugs occur	root-cause analysis	-
Frisbee	'21	FI FW	Fault injection framework with monitoring capabilities	fault classification	Kubernetes
ThorFI	'22	FIaaS	Combines SOTA FI solution's capabilities for ease of use	generic/all-in-one	OpenStack / K8s
NEAT	'18	FI impl	Shows simple network partition failures are catastrophic	classification, evalutation	OpenFlow / iptables
Filibuster	'21	FI impl	Relies on generated tests to attain coverage	evaluation, test generation and reduction	Python / Java
Fallout	'21	FIaaS	Supports chaos experiments, stress-test and logging	evaluation	Kubernetes / Docker
Molly	'15	FI FW	SAT-solver based execution, concolic	evaluation, test generation	-
FATE	'11	FI impl	Systematic failure injection	test generation	-

In this table, I use a few abbreviations:

FI FW: Fault Injection Framework
FI impl: Fault Injection Implementation
FIaaS: Fault Injection as a Service
SOTA: state-of-the-art

Even though many papers I have looked at call their techniques by a similar name, they either do different jobs, look at other aspects or use different metrics for essentially the same end-result. I want to classify the papers by the types of contribution they do, finding the common metrics they look at when quantifying their own performance and compare relevant systems to each other. I would like to see if it is really necessary for a new framework to be spawned each time, for example NEAT, which says Jepsen did not support network partitions and unit tests or ThorFI, which listed many methods and decided to build their own.

In many of the papers they look at either bugs in a system using network partitioning, software fault injection or analysis of bugs found in issue tracking systems. All types may be useful for coming up with a classification (perhaps using existing taxonomy), but it might be good to focus on a single type of failure / fault injection technique.

I think a few good RQs are:

what types of fault tolerance techniques are there?
what implementations exist for these techniques?
how can the implementations be used for testing?
how do the current solutions complement each other?
is there an effective way to find faults in a Cloud System?
how can we improve resilience to certain faults?

Not all solutions are applicable to every environment, so there is some question on how that could be handled as well. Porting, extensions?

For my review, a possible list of chapters is as follows:

Method / Search Strategy
Failure Scenarios (list of possible failures that can occur, defining the scope)
Industry Solutions and Resiliency Methods (how existing frameworks handle failures and how systems combat failures)
Testing of Distributed Systems and Existing Frameworks (generating/executing tests, types of tests)
Inspecting the Actual State (monitoring of: errors, performance, state / usefulness and similarity of parsing logs for example)
Comparison of State of the Art Works (using the "normalized" monitored states to find a generic/universal rating between solutions)
Discussion, Conclusion..

Here are some relevant papers that may help in writing a survey about frameworks that implement fault injection supporting network partition.

IoT is related, but often not applicable. These papers are useful for seeing how and what was compared in a similar domain.

iot comparison: https://arxiv.org/pdf/2112.09580.pdf

This DS https://www.usenix.org/system/files/nsdi20-paper-brooker.pdf uses jepsen and uses simulation to test their system.

ZERMIA, FI FW: https://link.springer.com/chapter/10.1007/978-3-030-92708-0_3 Testing Distributed Storage with P# https://www.usenix.org/system/files/conference/fast16/fast16-papers-deligiannis.pdf

Updated RQs:

What types of network faults are there in the context of distributed systems?
What frameworks and techniques exist for identifying these network faults?
How do we (or these solutions) create tests for network faults?
How do the current solutions complement each other? or What are strengths and weaknesses of the testing frameworks?
What are improvements to the current testing frameworks?

With related chapters:

Failure Scenarios
Industry Solutions
Testing of Distributed Systems with Testing Frameworks
Comparison of State of the Art Works
Improvements

ikdekker / literature-study

Collection of Papers with Descriptions #1

initial list of papers, with context