ua-snap / rasdaman-ingest

Collection of ingredients/configurations + docs for ingesting data into Rasdaman
MIT License
3 stars 0 forks source link

Enforce dependency management strategy #33

Closed kyleredilla closed 11 months ago

kyleredilla commented 2 years ago

It would be ideal to have some well-documented strategy for ensuring that all processing code (exclusively jupyter notebooks atm) is functionally reproducible. This repo was initially set up with pipenv, so we have some Pipfiles, but it has not been used for the most part. So there are a few options that come to mind:

  1. global environment.yml file for a conda env that will work with all of the processing notebooks, and docs for creating it and using it to run the notebooks

    • pro: simple and straightforward, and is most likely all we need!
    • con: Reproducibility is highly dependent on correctly following info provided in docs/ readme. Another potential hurdle with this strategy has to do with the fact that we are usually only adding processing code, and typically not touching old processing code much once it is completed. The specific thing I worry about is that new additions to a "global" requirements file (i.e., covering all processing code in the repo) could break old processing code without us ever knowing. This might be a non-issue -- I'm really not sure. This is a minor con for the most part.
  2. Set up the repo as an Anaconda Project

    • pro: might be slightly more "reproducible" (i.e., ability to export as docker containers, installs all dependencies with single command anaconda-project run notebook)
    • con: another layer of "meta" software (need parent conda env with anaconda-project installed, this repo's "project" env is then created and run from inside of that), bells and whistles we probably won't need for this project
  3. apply existing pipenv config to all existing processing codes

    • pro: familiarity within SNAP team
    • con: does not account for non-python packages like GDAL etc.

I am leaning towards 1. above for it's simplicity. I think 3. is out because it is not ideal for this type of processing (mainly to make sure we are all using the same version of GDAL, but there could be others that are important as well). I think if we start bumping into the "backwards compatibility" con I discussed in 1., then we might then decide to switch to anaconda-project, where we could easily have different envs for each of the different processing notebooks, or groups of them anyhow (e.g. arctic_eds, iem, etc.).

Think my vote after writing all of this would be a simple requirements.ymlfile and usage instructions in the main README.md.

To fix this issue, we would ideally want to document the compute environment configuration strategy and ensure that it works for all processing notebooks in the repo.

charparr commented 2 years ago

Yeah good thoughts here Kyle! Luckily we have not really hit and package or dependency issues when bouncing these notebooks back and forth to each other - but we both seem to be working on Atlas. I've just using the Pipfile and have not had issues, e.g., condo activate py38 pipenv run jupyter notebook.... etc. etc.

We could create a .yml on a per-coverage basis - so each ingest and coverage gets a bespoke conda environment. That way you could install whatever packages you want for a new coverage without returning to an old coverage to do a tweak and learning that your env doesn't work anymore.

kyleredilla commented 2 years ago

We are thinking the first thing to try here will be separate environment.yml files for each ingest directory (i.e., a separate set of dependencies for each ingest.json or hook_ingest.json.). This might slightly slow down development and testing with having to re-create and manage similar environments frequently and track things separately, but it will have the benefit of not breaking previous preprocessing pipelines when we want to try newer / different versions of things as we continue to grow this repository, which is very important.

charparr commented 11 months ago

We've long moved on from the Anaconda Project Dream, and we've found that very few of these processing notebooks need to be re-run or updated. We informally settled on the snap-geo conda env for most of our pre-processing work, but there will always be some exceptions where we need to spin up a fresh development environment. I'm suggesting we close this ticket but will punt it you @kyleredilla for final closure or if you want to keep open / distill into a new ticket that is fine by me too!

kyleredilla commented 11 months ago

Thanks Charlie, yep I agree that this is no longer really an issue! Global-ish has been good enough :)