Discussion topic: Reproducibility in a Dynamic, Uncontrollable World
Brief description of issue/challenge:
Computational reproducibility has been held up as one of the aims of research in both academia and industry. It is the focus of a great deal of effort in the open source community to be able to exactly reproduce computational results. However, in many contexts this is an unachievable aim. In any case where you need to connect to a system that you cannot control and which may change without notice, it won't be possible to create fully deterministic builds.
This is a particularly salient problem for researchers in industry who are interested in reproducibility, but where there work may need to be integrated with a much larger system over which they have little control. In particular, this can lead to large "kitchen sink"-style docker images in order to ensure interoperability with the rest of a company's systems.
Similarly, any academic researcher who is querying the web to get their data to analyse, they cannot rely on their work being reproducible. The only way around this is for them to store every data set that they query, which rapidly becomes too challenging a data maintenance problem for most researchers to handle.
If we are developing open source solution to the problems of reproducibility, how can we incorporate these considerations to cover these kinds of use cases? Are there ways in which changing the framing of the problem can introduce new models of collaboration between the current open source software and computational reproducibility communities? Can we grow the community and thereby establish a wider base of use cases as a means of making software supporting reproducible computing more sustainable?
Discussion topic: Reproducibility in a Dynamic, Uncontrollable World
Brief description of issue/challenge: Computational reproducibility has been held up as one of the aims of research in both academia and industry. It is the focus of a great deal of effort in the open source community to be able to exactly reproduce computational results. However, in many contexts this is an unachievable aim. In any case where you need to connect to a system that you cannot control and which may change without notice, it won't be possible to create fully deterministic builds.
This is a particularly salient problem for researchers in industry who are interested in reproducibility, but where there work may need to be integrated with a much larger system over which they have little control. In particular, this can lead to large "kitchen sink"-style docker images in order to ensure interoperability with the rest of a company's systems.
Similarly, any academic researcher who is querying the web to get their data to analyse, they cannot rely on their work being reproducible. The only way around this is for them to store every data set that they query, which rapidly becomes too challenging a data maintenance problem for most researchers to handle.
If we are developing open source solution to the problems of reproducibility, how can we incorporate these considerations to cover these kinds of use cases? Are there ways in which changing the framing of the problem can introduce new models of collaboration between the current open source software and computational reproducibility communities? Can we grow the community and thereby establish a wider base of use cases as a means of making software supporting reproducible computing more sustainable?
Lead/moderator: Links to resources: