How will we use and support logging, provenance, and reproducibility within the pipeline?

dlebauer commented 7 years ago

Description

Seeking use cases: how will we use provenance and reproducibility within the program?
What technology(ies) will support these use cases?
What are priorities?

Please provide use cases below as comments, comment on use cases, and use the 👍 flag!

@robkooper @max-zilla @pless @terraref/developers @terraref/standards-committee @ludaesch @tmcphillips

dlebauer commented 7 years ago

Here are some use cases:

re-creating data products with updated code
- Each year we will release an expanded set of data products, extending some in time. All data products of a given type should be generated using the same version of code. We should be able to automatically detect any modifications and their dependencies.
re-run with another method or different parameters
- The reference team and Lemnatec are setting up parallel pipelines: we should minimize the difference between these pipelines so that they only differ in the algorithm being deployed to get from steps
- For any data product we should be able to find the input files, run with another algorithm, and compare output with the data product.
- Another user may develop an alternative algorithm. They should be able to develop and evaluate this algorithm and then deploy it (on our system or elsewhere).
- Related: Compare output generated with different parameterizations of the same algorithm.
Support Journal and Funder expectations of reproducibility
- Facilitate the archiving of code and data required by funding agencies and journals. Specific requirements are defined by these entities.
Demonstrate value of open science and reproducibility (lead by example)
- We want to demonstrate how reproducibility enables scientific advancement. * If we can demonstrate the value and make it easy for users to provide reproducible examples then the program and the community will benefit.
Appropriate attribution
- Ensure that producers of data inputs and algorithms are appropriately attributed.
- If we release this publicly, how does credit cascade with derived data products?
- related issue: https://github.com/terraref/reference-data/issues/8
Fix data all data downstream of error
- Stewart the scanner operator finds bird droppings on the sensor. He isn't sure when it got there but wants to identify all downstream files and database records derived from images taken since the last time he cleaned the sensor bay. points (including in databases) generated by the sensor and check, flag, and / or delete them.
- Related: Stewart finds out that the calibration matrix for sensor X was inverted. He wants to identify and regenerate all data products generated from these files using the correct calibration matrix.
(reusability): Deploy pipeline at another institution
- Our pipeline could be used at other institutions with similar hardware / needs. Users at Arkansas and Nebraska have expressed interest in this. NB: demand for this is influenced by OEM software inadequacy. * How easy can we make it for end users to deploy and modify / extend our pipeline at their institution (or company)?
- Related: Hardware manufacturer (Lemnatec) needs to develop new software. How easy is it for them to build on our infrastructure?
Detect errors, assess pipeline performance
- How can we ensure that failure is automatically detected and notifies the platform operator of this error. * More generally: how should pipeline components log their successes and failures? What information is useful? * Also: summarize performance. ARPA-E wants us to demonstrate that we are meeting metrics such as 'capture X% of data in N hours'

tmcphillips commented 7 years ago

Another use case for reproducibility is enabling researchers to independently reproduce any of the data products and other results ultimately derived from input sensor data sets and published by TERRA REF. (This may be implied by use case 3 above, 'Support Journal and Funder expectations of reproducibility').

For this as well as the other use cases for provenance/reproducibility, it may be worthwhile elaborating each use case to yield user stories sufficiently detailed to highlight what would really be required, via what sequence of steps, using what data/metadata and compute resources, and by whom, to achieve the desired reproducibility result or to answer a particular class of provenance queries.

For example, if someone wanted independently to reproduce one of the data-products/data-sets/sets-of-metadata published by TERRA Ref, would they need to use their own instances of Clowder and RabbitMQ to serve as a workflow engine, in order to re-perform all of the relevant computations? Would this be practical or even feasible? Would they easily be able to discover and install all of the relevant extractors and input data sets needed for the calculation? (This last question implies several additional provenance use cases to consider targeting.)

If someone wanted to use the centrally maintained instances of Clowder/RabbitMQ for this purpose instead (i.e. not truly independently of the official project software installations and computing infrastructure, so only ‘recomputing’ the result, not ‘reproducing’ it in the more rigorous sense), would they be able to request that the correct versions of each extractor be used for these re-computations?

Etc.

tmcphillips commented 7 years ago

It sounds like there is consensus that TERRA Ref products might be more practical to reproduce (especially by others, using their own compute resources) if one could export a representation of the effective workflow (sequence of extractors, parameter settings, and references to data sets) that ultimately yielded a particular product. Such an exported workflow representation could allow for rerunning just that workflow independently of Clowder/RabbitMQ, i.e. as a standalone software pipeline (or Python script or Makefile, etc) with the outputs of one extractor passing directly as inputs to the next extractor, thus making the workflow runnable in the absence of significant computing infrastructure (e.g. using just Python and a few pip-installable Python packages).

I generally like the idea of making it easy to reproduce results outside of the original computational infrastructure, because such infrastructure easily can itself become hard to reproduce or to document rigorously and understandably (VMs go part of the way, and software containers go further, but there's nothing like a from-scratch installation to convince one that a reported result can be reproduced. (We've probably all heard of cases where a seemingly significant result could not be reproduced by the original researchers following a minor version upgrade of a C++ compiler).

robkooper commented 7 years ago

Most of the code can be run independent. So you can download the code, data and run the code on the data. Any parameters used, should be either documented, in the code, or hopefully stored as metadata.

dlebauer commented 7 years ago

The use case that @tmcphillips outlines is important - it is related to the need to be able to deploy algorithms on different platforms (e.g. deploying a pipeline on Arduino so it can compute values of interest before storing them).

I think the key is implementing a SOP and framework so that we can remove the 'most', 'should' and 'hopefully' from @robkooper's statement, and ensure that the process of downloading the code and data and running the code on the data is as easy as possible. Some of this should be captured in the READMEs and tests that are proposed in #160.

terraref / computing-pipeline

How will we use and support logging, provenance, and reproducibility within the pipeline? #149

Description