ID | Paper name/link | Testing Section | Data Published | Notes
1 | DeepXplore: Automated Whitebox Testing of Deep Learning Systems | Data Input, Debug and Repair (Data Resampling), Test coverage adequacy | 2017 | “Proposed a white-box differential testing technique to generate test inputs for deep learning systems. Inspired by test coverage in traditional software testing,the authors proposed neuron coverage to drive test generation.”
2 | DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars | Data input generation, Debug and Repair (Data Resampling) | 2019 | “To create useful and effective data for autonomous driving systems, DeepTest performed a greedy search with nine different realistic image transformations: changing brightness, changing contrast, translation,scaling,horizontal shearing, rotation, blurring, fog effect, and rain effect.”
3 | Generative Adversarial Nets | Test Data Generation | 2014 | “Zhanget al. [79] applied GAN to deliver driving scene based test generation with various weather conditions.They sampled images from Udacity Challenge dataset [77] and YouTube videos(snowy or rainy scenes), and fed them into the UNIT framework 7 for training. The trained model takes the whole Udacity images as the seed inputs and yields transformed images as generated tests.”
4 | DeepBillboard: Systematic Physical-World Testing of Autonomous Driving Systems | Testing Data Generation | 2018 | “Proposed DeepBillboard to generate real world adversarial billboards that can trigger potential steering errors of autonomous driving systems.”
5 | DeepCruiser: Automated Guided Testing for Stateful Deep Learning Systems | Testing Data Generation | 2018 | “To test audio-based deep learning systems, Du et al. designed a set of transformations tailored to audio inputs considering background noise and volume variation.”
6 | Evaluation of Generalizability of Neural Program Analyzers under Semantic-PreservingTransformations | Testing Data Generation | 2019 | “Rabin et al. discussed the possibilities of testing code2vec (a code embedding approach) with semantic preserving program transformations serving as test inputs.”
7 | Data Validation For Machine Learning | Testing Data Generation | 2019 | “Breck et al. used synthetic training data that adhere to schema constraints to trigger the hidden assumptions in the code that don't agree with the constraints.
8 | Perturbed Model Validation: A New Framework to Validate Model Relevance | Testing Data Generation | 2019 | “Zhang et al. used synthetic data with known distributions to test overfitting.”
9 | DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems | Nueron Coverage (Test Adequecy) | 2018 | “Ma et al. extended the concept of neuron coverage. They first profile a DNN based on the training data, so they obtain the activation behaviour of each neuron against the training data. Based on this, they propose more fine grained criteria, k-multisection neuron coverage, neuron boundary coverage,and strong neuron activation coverage to represent the major functional behaviour and corner behavior of DNN.”
10 | The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction | Rule-based checking for test adequacy | 2017 | “Breck et al. offered 28 test aspects to consider and a scoring system used by Google. Their focus is to measure how well a given machine learning system is tested. The 28 test aspects are classified into four types: 1) the tests for the ML model itself; 2) the tests for the ML infrastructure used to build the model; 3) the tests for the ML data used to build the model; and 4) the tests that check whether the ML system works correctly over time. Most of them are some must-to-check rules that could applied to guide test generation.”
11 | Storm: Program Reduction for Testing and Debugging Probabilistic Programming Systems | Debug and Repair, Debugging Frame Development | 2019 | “Ma et al. identified the neurons responsible for the misclassification and call them ‘faulty neurons’. They resampled training data that influence such faulty neurons to help improve model performance.” “Storm, a program transformation framework to generate smaller programs that can support debugging for machine learning testing.”
12 | An Empirical Study on TensorFlow Program Bugs | Bug report analysis, Bug detection in learning programs | 2018 | “Zhang et al. studied 175 TensorFlow bugs, based on the bug reports from Github or StackOverflow. They studied the symptoms and root causes of the bugs, the existing challenges to detect the bugs and how these bugs are handled. They classified TensorFlow bugs into exceptions or crashes, low correctness, low efficiency, and unknown.”
13 | Hands Off the Wheel in Autonomous Vehicles? A Systems Perspective on over a Million Miles of Field Data | Bug report Analysis | 2018 | “Banerjee et al. analysed the bug reports of autonomous driving systems from 12 autonomous vehicle manufacturers that drove a cumulative total of 1,116,605 miles in California.They used NLP technologies to classify the causes of disengagement into 10 types (A disagreement is a failure that causes the control of the vehicle to switch from the software to the human driver).”
14 | TensorFlow Debugger: Debugging Dataflow Graphs for Machine Learning | Debug and Repair (Data Framework Development) | 2016 | “Cai et al. presented tfdbg, a debugger for ML models built on TensorFlow, containing three key components: 1) the Analyzer, which makes the structure and intermediate state of the run time graph visible; 2) the Node Stepper, which enables clients to pause, inspect, or modify at a given node the graph; 3) the Run Stepper, which enables clients to take higher level actions between iterations of model training.”
15 | On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems | Debug and Repair (Fixed Understanding) | 2017 | “Nushi et al. proposed a human-in-the-loop approach that simulates potential fixes indifferent components through human computation tasks: humans were asked to simulate improved component states.”
16 | Repairing Decision-Making Programs Under Uncertainty | Debug and Repair (Program Repair) | 2017 | “Albarghouthi et al. proposed a distribution-guided inductive synthesis approach to repair decision-making programs such as machine learning programs.The purpose is to construct a new program with correct predictive output, but with similar semantics with the original program.”
17 | Systematic Testing of Convolutional Neural Networks for Autonomous Driving | General Testing Framework and Tools | 2017 | “Dreossi et al. presented a CNN testing framework that consists of three main modules: an image generator,a collection of sampling methods,and a suite of visualisation tools.”
18 | Preventing undesirable behavior of intelligent machines | General Testing Framework and Tools | 2019 | “Thomas et al. recently proposed a framework for designing machine learning algorithms, which simplifies the regulation of undesired behaviours. The framework is demonstrated to be suitable for regulating regression, classification, and reinforcement algorithms.”
19 | A test architecture for machine learning product. | General Testing Framework and Tools | 2018 | “Nishi et al. proposed a testing framework including different evaluation aspects such as allowability, achievability, robustness, avoidability and improvability. They also discussed different levels of ML testing,such as system, software, component, and data testing.”
20 | An Empirical Study on Real Bugs for Machine Learning Programs | Bug Detection in learning program and in Framework | 2017 | “Sun et al. Studied 329 real bugs from three machine learning frameworks: Scikitlearn, Paddle, and Caffe. Over 22% bugs are found to be compatibility problems due to incompatible operating systems,language versions,or conflicts with hardware.”
21 | An Orchestrated Empirical Study on Deep Learning Frameworks and Platforms | Bug Detection in learning program and in Framework | 2018 | “Guo et al. investigated deep learning frameworks such as TensorFlow, Theano, and Torch. They compared the learning accuracy, model size, robustness with different models classifying dataset MNIST and CIFAR-10.”
22 | Adaptation of General Concepts of Software Testing to Neural Networks | Bug Detection in learning program | 2018 | “Karpov et al. also highlighted testing algorithm parameters in all neural network testing problems. The parameters include the number of neurons and their types based on the neuron layer types, the ways the neurons interact with each other, the synapse weights, and the activation functions. However, the work currently remains unevaluated.”
23 | Security Risks in Deep Learning Implementations | Bug Detection in learning Framework, | 2018 | “Xiaoet al. focused on the security vulnerabilities of popular deep learning frameworks including Caffe, TensorFlow, and Torch.They examined the code of popular deep learning frameworks.The dependency of these frameworks was found to be very complex. Multiple vulnerabilities were identified in their implementations.The most common vulnerabilities are bugs that cause programs to crash, nonterminate, or exhaust memory.”
24 | | | |
Related Work Research