Infrastructure (back-end) thoughts and concerns

flo-compbio commented 4 years ago

Hey everyone,

During the call today I had a few thoughts and concerns about the current infrastructure (back-end). Disclaimer: I'm not 100% familiar with the current back-end, so please point out of if I'm making any incorrect assumptions about how things are currently working. I'm mostly relying on my limited experience of adding a new task and on brief descriptions of the infrastructure from Scott.

1) I really like the "snakemake approach" where new "builds" are only triggered when one of the input changes. In thinking about the broad scope of this project (many tasks, many datasets, many methods), it seems clear that a new "build" should only be triggered when a new dataset or a new method gets added, and only for those specific datasets/methods (no need to re-run all the other methods on all the other datasets). Currently I think everything gets run with each new commit, and I don't think that that's a sustainable solution.

2) In thinking about this snakemake approach some more, I think it's interesting to think about what should happen when a new metric gets added to a particular task. Naively, this would require all methods to run on all datasets again, which seems like a huge effort. The only way to avoid this would be to store the result of each method on each dataset. This way, any new metric could be run simply by comparing the stored method result to the "ground truth". This suggests that metric calculation should be decoupled from actually running the methods. Currently, method results are not stored anywhere.

3) The previous points 1) and 2) suggest to me that a we should store both the docker images (one docker image for each method and each version thereof) and input data (one file/directory per dataset) as files on a server (probably a cloud bucket). This would make it clear when we would have to run a method, namely when we encounter a previously unseen combination of docker file and input data. This also seems very different from how things are set up currently, where both methods and datasets are fundamentally represented as code, not as files, which in my mind makes it very difficult to implement a logic on what has already been run and what still needs to be run.

In summary, I really think the infrastructure needs to be modified to allow more modularity in terms of when to run which method on which dataset and for decoupling metrics from everything else.

LuckyMD commented 4 years ago

I can quickly add a comment for point 1. Only Travis runs are run at every commit atm. This will only be done for test datasets, which are miniature versions of the datasets in the future. That has to be done so that we know the code we are adding can be run at all.

scottgigante commented 4 years ago

Thanks for opening this up. Lots of thinking to do as we scale.

As Malte said, this one is not really a big deal -- the CI is done with small datasets and runs quickly.
The way the snakemake pipeline is currently set up, we don't have a way of just selectively running a few methods, because the way the codebase is set up, the full codebase cannot be linked as a snakemake dependency (e.g. if scikit-learn changes its default behaviour for an MLP, this should trigger a rerun of the corresponding method, but nothing in our codebase has changed.)
One thing we have talked about is storing intermediates in S3. An example is the current data loaders, which currently store intermediates in /tmp after they are run the first time. Eventually this will be stored in S3 instead -- then any change to the raw data or the preprocessing will trigger a rebuild. We could do the same for methods and metrics, by using S3 as a cache, we can check the uploaded file against a hash of current code versions, and if anything has changed, recompute.

LuckyMD commented 4 years ago

Do we want to rerun if dependencies for methods changed? It can happen that methods break when dependencies are updated. I guess that is now controlled via the docker requirements (which can be version-specific) luckily ^^.
I think storing method outputs as well makes sense. Running methods is already a separate snakemake rule, no? If that's the case, then the output from that rule should already be there. Or is that also done via tempfile atm? If not, then don't you already store method outputs?

flo-compbio commented 4 years ago

Thanks Malte and Scott for your comments!

I can quickly add a comment for point 1. Only Travis runs are run at every commit atm. This will only be done for test datasets, which are miniature versions of the datasets in the future. That has to be done so that we know the code we are adding can be run at all.

@LuckyMD @scottgigante Regarding testing, I think the testing system needs to be completely decoupled from the "actual" (real-world) set of methods and datasets available at any given time. I think that for each task, there should be a few (at least one) dummy datasets, and a few (at least one) dummy methods. The test would then be a check of whether the code can correctly load a dataset, execute a method, read a result, and then calculate the metrics for each task. I don't think the testing system should include a test of whether a particular (real-world) method is running correctly. If a method (docker file) gets added that doesn't work properly, this should just result in N/A values for the testing accuracies, but that's not really a problem with the back-end code itself.

The way the snakemake pipeline is currently set up, we don't have a way of just selectively running a few methods, because the way the codebase is set up, the full codebase cannot be linked as a snakemake dependency (e.g. if scikit-learn changes its default behaviour for an MLP, this should trigger a rerun of the corresponding method, but nothing in our codebase has changed.)

@scottgigante I think if a method depends on scikit-learn, then a particular version of it needs to be pre-installed in the docker container for that method (e.g. as part of a conda environment), or at the very least the docker image should contain a snapshot of the environment including the specific versions, so that the environment can then be installed via conda/pip (e.g. using conda environment files, see conda env export). Basically, I think the docker container should completely encapsulate all the dependencies of a given method. I haven't worked with docker in a while, but based on what I know this seems feasible and so in my mind I'm thinking of the docker image as the file that can serve as a very useful dependency for snakemake or a snakemake-like system that checks the md5sum of the docker image or something like that.

One thing we have talked about is storing intermediates in S3. An example is the current data loaders, which currently store intermediates in /tmp after they are run the first time. Eventually this will be stored in S3 instead -- then any change to the raw data or the preprocessing will trigger a rebuild. We could do the same for methods and metrics, by using S3 as a cache, we can check the uploaded file against a hash of current code versions, and if anything has changed, recompute.

@scottgigante I think this definitely goes in the direction that I was thinking in. The main difference is that I'm not thinking of the things (docker images, data files) stored on S3 as "intermediates". I would make them the primary data source for the entire project. Once we have cloud storage available, I think we have to weigh the complexity burden it puts on the system to allow external downloads while maintaining this type of version control, vs. the benefit of being fully transparent in terms of "preprocessing" the data. I feel like definitions of preprocessing and "raw data" are quite arbitrary. If we're downloading a count matrix from GEO, why is cell or gene filtering considered "preprocessing", but read mapping and expression quantification (UMI counting) not? Are there really a lot of tangible benefits to making these "preprocessing" steps part of the infrastructure?

flo-compbio commented 4 years ago

Sorry, I just remembered that I had another comment that I forgot to mention in my initial post:

4) This is related to 2) (storing of results): Since we're running methods through docker, I think it would also make sense from a technical perspective to require (as part of the "API"/specifications for a particular task) that the results be written to a particular file in a particular format. I think in docker you can mount external file systems (probably even S3 buckets) that allow you to read and write stuff, since the docker image itself is immutable. I think this would also completely decouple the language in which a particular method is run in. So you could run an R method in a docker image and just write the results to a particular file, then the result would get read by our Python code after the docker container has shut down.

scottgigante commented 4 years ago

This is sort of splitting into multiple discussions so I would recommend shifting the discussion of saved outputs to https://github.com/singlecellopenproblems/SingleCellOpenProblems/issues/45.

Re: decoupling from languages, I can imagine shifting to a YAML format where a method is built from a file containing method metadata (currently contained in decorators) and a script file of the form ./script /path/to/input.h5ad /path/to/output.h5ad. I think there are some trade-offs to be managed here, principally in the realm of sandboxing -- the current setup is very easy to interact with by running a simple git clone; pip install; python -c 'import openproblems'. If everything is moved to language-agnostic descriptors that are run exclusively in docker containers we lose some of that ease of use in favour of breadth of use.

I'm yet to see a use case where rpy2 can't be used to run more or less any R code, and I doubt that many other languages are in high demand at present. There is already substantial boilerplate involved in adding a method, maybe simply adding a template PR to add an R method would solve this issue without the subsequent loss of ease of use?

scottgigante commented 4 years ago

Re: your other point:

Regarding testing, I think the testing system needs to be completely decoupled from the "actual" (real-world) set of methods and datasets available at any given time. I think that for each task, there should be a few (at least one) dummy datasets, and a few (at least one) dummy methods. The test would then be a check of whether the code can correctly load a dataset, execute a method, read a result, and then calculate the metrics for each task. I don't think the testing system should include a test of whether a particular (real-world) method is running correctly

I tend towards keeping testing of all methods in here -- otherwise how do we know when to merge a PR? It's relatively low cost and ensures that we're not filling our repo with low-quality code.

flo-compbio commented 4 years ago

I guess my idea was that methods should not be part of the codebase but instead just be docker images. Maybe you could keep the metadata for each method in the codebase, a yaml file could be sufficient for that. It just feels to me that there are too many functionalities tightly coupled together through the code and the testing system at the moment, and that the current architecture won't be able to support all the features required given the scope of the project.

LuckyMD commented 4 years ago

I think it is paramount to have the methods as part of this. By ensuring the code is easily accessible it is possible for anyone to go and check what a method is doing. That makes it easy to detect cheating (which would be easy to do as you have access to the datasets via the data loaders and the ground truth).

I'm not sure why you think the current architecture won't be able to support all of the features. What we need is a big S3 bucket, and some AWS servers to run the benchmarks once a month or so.

flo-compbio commented 4 years ago

@LuckyMD I think there's possibly some confusion (or maybe it's a difference of opinion) about what the back-end really needs to do. It looks to me that we need a framework which at its core needs to implement some form of "version control" for both datasets and methods, be able to execute the methods on those datasets, then evaluate the results against a set of metrics, and present the results on a website. Benchmarks should only be re-run and metrics re-computed for a particular combinations of methods and datasets if either the method or the dataset is new. Realistically I think the framework also needs to be language-agnostic in terms of what programming language is used by the methods, since the CompBio world is firmly divided between Python and R and perhaps a smattering of other languages. And it needs to be built on cloud infrastructure to make sure you have enough computational resources (especially memory) to run all these methods.

I think @scottgigante can probably speak better then me about how compatible the current code is with those features. To me it seems that the idea of having a language-agnostic framework is at odds with including the methods directly in this codebase, and that the idea of version control for datasets is somewhat at odds with downloading the datasets on the fly from external sources. And I think the idea of only re-running benchmarks if the method or the dataset has changed (or is new) requires some very deliberate design choices.

Regarding your concern that there needs to be transparency in terms of the code of each method, I think a reasonable solution would be to publish electronic notebooks that document exactly how each docker image is generated (which would for example including a command like "pip install magic-impute==2.0.3 " that provides transparency in terms of where the code is coming from. We talked before about how this approach could be useful to provide transparency in terms of how the datasets are generated. In this scenario, if I want to "contribute" a method to the website, I would make a docker image and provide an electronic notebook documenting how I created the docker image.

scottgigante commented 4 years ago

I am actively working on an easier integration of R code. I feel very strongly that the code for all methods should be easily accessible in the repo.

openproblems-bio / openproblems

Infrastructure (back-end) thoughts and concerns #103