ikdekker / literature-study

0 stars 0 forks source link

Collection of Papers with Descriptions #1

Open ikdekker opened 2 years ago

ikdekker commented 2 years ago

In the comments of this issue, I will be collecting some of the papers relevant to the topics, which for now are

ikdekker commented 2 years ago

initial list of papers, with context

Dawson, S., & Jahanian, F. (1995). Deterministic fault injection of distributed systems. In Theory and Practice in Distributed Systems (pp. 178-196). Springer, Berlin, Heidelberg. I started out with a paper from 1995, which introduces a script driven layer. They call it the probe and fault injection (PFI) layer. The authors define three message types (filtering, manipulation and injection), modifying the protocol which sits on top of the PFI layer. This idea has been implemented in a framework called ORCHESTRA, presented in the paper: Dawson, S., Jahanian, F., & Mitton, T. (1996, September). ORCHESTRA: A probing and fault injection environment for testing protocol implementations. In Proceedings of IEEE International Computer Performance and Dependability Symposium (p. 56). IEEE. The next paper focuses more on stress testing systems. The method they used was to use the UML 2.0 model, to create stress test cases, which cause stress on the system under test, preferably before they are released to see effects of heavy loads on a system. They use sequence diagrams, annotated with timing information, to generate valid test cases. The aim is to discover network faults by running optimized test cases. Garousi, V., Briand, L. C., & Labiche, Y. (2006, May). Traffic-aware stress testing of distributed systems based on UML models. In Proceedings of the 28th international conference on Software engineering (pp. 391-400).

Loki is a fault injector, with post-runtime analysis based on a partial state of the global system. They attempt to mitigate and assess damage done by correlated faults. Cukier, M., Chandra, R., Henke, D., Pistole, J., & Sanders, W. H. (1999, October). Fault injection based on a partial view of the global state of a distributed system. In Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (pp. 168-177). IEEE. another injector/monitor is DEFINE: Kao, W. L., & Iyer, R. K. (1996, June). DEFINE: A distributed fault injection and monitoring environment. In Proceedings of IEEE workshop on fault-tolerant parallel and distributed systems (pp. 252-259). IEEE. Besides those methods I have kept the frameworks in mind that I have seen before and catalogued in https://freshcoders.nl/nick/papers/A_Survey_of_Chaos_Engineering_Frameworks.pdf.

I also looked at predicting failures in distributed systems. A failure predictor, called PreMiSE tries to locate failures by looking at anomalies in recorded key performance indicator data. The data is compared to a baseline. Mariani, L., Pezzè, M., Riganelli, O., & Xin, R. (2020). Predicting failures in multi-tier distributed systems. Journal of Systems and Software, 161, 110464. If the faults can not be monitored directly, a similar approach may be of use in Adyen's context.

Finally, if tests are going to be run on a distributed system, it would be important to look at effective testing. The framework presented in this last paper can help, with tests running in parallel and stress-testing through middleware agents. El Yamany, H. F., Capretz, M. A., & Capretz, L. F. (2006, September). A multi-agent framework for testing distributed systems. In 30th Annual International Computer Software and Applications Conference (COMPSAC'06) (Vol. 2, pp. 151-156). IEEE.

ikdekker commented 2 years ago

Further literature search concerning testing distributed systems.

Qu, R., Hirano, S., Ohkawa, T., Kubota, T., & Nicolescu, R. (2006). Distributed unit testing. Technical Report CITR-TR-191.

Verdi is a framework for verifying distributed systems, before runtime. The goal is to enable users to construct distributed systems that are reliable and fault-tolerant.

RemoteTest is a type of distributed testing framework which tests functional components of distributed systems. It considers the dependency between components, running scripts and emulating parts of the system to see where errors have originated.

It is shown by

The following papers talk about how partition tolerance affects the system, introducing bugs and the analysis of these situations.

Cloud recovery testing frameworks:

Cloud systems are interesting to look at, because they have similar properties to distributed systems (and sometimes are distributed systems).

Regarding resiliency testing, we can look at several patterns that are often used in distributed systems. Such as timeout, retry, circuit breakers, failover and such algorithms. By looking at such patterns and manipulations of the system that would break the system, we can inspect the resilience gained by using them.

The following paper shows that using redundancy can still mean failures for end users with certain kinds of faults. They analysed behaviour of well-known distributed systems, such as Zookeeper, Cassandra and Redis.

Similarly, analysing several projects, a study about bugs in the distributed cloud services was done, listing types of bugs and their implications. This shows common, or likely bugs in distributed systems, and may provide information about how to prevent them.

Other papers which are related to testing distributed systems, but I have yet to establish the relevance of:

I also thought about conceptualizing the topics into a mind map, along with frameworks, to better see where gaps in literature were present. As most maps only go into either their specific scenario and scope, or only show the topics directly related to DSs. An overview of such information seems to be missing in the literature. e.g.

The following image is a WIP of a mind map of the topics, which I created.

ikdekker commented 2 years ago

Overview of some methods and frameworks which are interesting to compare or extract data from.

Name Year Type Summary Contributions Environment/requirement
CREB '18 Empirical Identifies bug-patterns and findings of why bugs occur root-cause analysis -
Frisbee '21 FI FW Fault injection framework with monitoring capabilities fault classification Kubernetes
ThorFI '22 FIaaS Combines SOTA FI solution's capabilities for ease of use generic/all-in-one OpenStack / K8s
NEAT '18 FI impl Shows simple network partition failures are catastrophic classification, evalutation OpenFlow / iptables
Filibuster '21 FI impl Relies on generated tests to attain coverage evaluation, test generation and reduction Python / Java
Fallout '21 FIaaS Supports chaos experiments, stress-test and logging evaluation Kubernetes / Docker
Molly '15 FI FW SAT-solver based execution, concolic evaluation, test generation -
FATE '11 FI impl Systematic failure injection test generation -

In this table, I use a few abbreviations:

Even though many papers I have looked at call their techniques by a similar name, they either do different jobs, look at other aspects or use different metrics for essentially the same end-result. I want to classify the papers by the types of contribution they do, finding the common metrics they look at when quantifying their own performance and compare relevant systems to each other. I would like to see if it is really necessary for a new framework to be spawned each time, for example NEAT, which says Jepsen did not support network partitions and unit tests or ThorFI, which listed many methods and decided to build their own.

In many of the papers they look at either bugs in a system using network partitioning, software fault injection or analysis of bugs found in issue tracking systems. All types may be useful for coming up with a classification (perhaps using existing taxonomy), but it might be good to focus on a single type of failure / fault injection technique.

I think a few good RQs are:

Not all solutions are applicable to every environment, so there is some question on how that could be handled as well. Porting, extensions?

For my review, a possible list of chapters is as follows:

  1. Method / Search Strategy
  2. Failure Scenarios (list of possible failures that can occur, defining the scope)
  3. Industry Solutions and Resiliency Methods (how existing frameworks handle failures and how systems combat failures)
  4. Testing of Distributed Systems and Existing Frameworks (generating/executing tests, types of tests)
  5. Inspecting the Actual State (monitoring of: errors, performance, state / usefulness and similarity of parsing logs for example)
  6. Comparison of State of the Art Works (using the "normalized" monitored states to find a generic/universal rating between solutions)
  7. Discussion, Conclusion..
ikdekker commented 2 years ago

Here are some relevant papers that may help in writing a survey about frameworks that implement fault injection supporting network partition.

IoT is related, but often not applicable. These papers are useful for seeing how and what was compared in a similar domain.

iot comparison: https://arxiv.org/pdf/2112.09580.pdf

This DS https://www.usenix.org/system/files/nsdi20-paper-brooker.pdf uses jepsen and uses simulation to test their system.

ZERMIA, FI FW: https://link.springer.com/chapter/10.1007/978-3-030-92708-0_3 Testing Distributed Storage with P# https://www.usenix.org/system/files/conference/fast16/fast16-papers-deligiannis.pdf

Updated RQs:

  1. What types of network faults are there in the context of distributed systems?
  2. What frameworks and techniques exist for identifying these network faults?
  3. How do we (or these solutions) create tests for network faults?
  4. How do the current solutions complement each other? or What are strengths and weaknesses of the testing frameworks?
  5. What are improvements to the current testing frameworks?

With related chapters:

  1. Failure Scenarios
  2. Industry Solutions
  3. Testing of Distributed Systems with Testing Frameworks
  4. Comparison of State of the Art Works
  5. Improvements