hackalog / easydata

A flexible template for doing reproducible data science in Python.
MIT License
105 stars 22 forks source link

Need "CI For Environments #234

Open hackalog opened 2 years ago

hackalog commented 2 years ago

In brief

We need environments to be shareable, reproducible and upgradeable for at least a 2 month window (ideally 6-12 months). This is deceptively non-trivial.


Problems/ what we want

  1. We want participants to be able to install and load up a working environment reliably quickly. Ideally, we would do this with a lock file that bypasses dependency resolution.
  2. Environments need to be incrementally upgradable. When we add a package, or when a package upgrades, we want to be able to update the environment easily, avoiding the headache of a full environment resolve.
  3. Building environments from lock files has to respect platform+architecture differences, so you need a lock file for each architecture.
  4. Lock files for conda don't really exist that include the pip section properly.
  5. Pip installation needs to be as first class as conda. We can't have packages installed by both
  6. Upgrades and resolving can be a giant headache (for all the reasons we've been dealing with the past couple of weeks). These issues and more are alluded to here: http://iscinumpy.dev/post/bound-version-constraints/
  7. To avoid this headache, we'd like to be able to test the full solve on clean environments on multiple platforms to be able to catch issues before we break the environment build. This way the changes are small and more easily debugged, rather than a giant snotball of changes that is hard to figure out.
  8. We need to be able to hand pin versions of packages to avoid bugs when they come up but keep track of when we can unpin again. It would be great to easily automate the testing of the upnpin without breaking the environment build.
  9. Python environments become huge and can't be resolved at some point. Move to 1 env per repo and then more than 1 env per repo.
  10. What's cached locally affects the build. Environment specification should always be CLEAN-ROOM.
  11. We want to be platform agnostic, so a Docker container isn't the answer for this.

Related Problems, but not mainline at the moment

  1. If I'm maintaining a library and associated notebooks as documentation, I'd like to be able to provide an environment (and even datasets) that work to run the notebooks so I don't have to debug individual environment issues for users.
  2. If I'm maintaining a project, I'd like to know when my dependencies are shifting in a way that is incompatible with my project. I'd like to run CI on --dev so I can know what's coming down the pipe and if there are any breaking changes, and anything that breaks my tests. The part that's tricky is when it breaks my environment before it breaks my test. It would be nice to have a "helping hand" on that step.

What have we tried and things we've looked at

  1. make + conda env --export
  2. conda lock
  3. conda lock + Poetry
  4. mamba solver vs. conda solver

What don't we know

  1. Does anyone else care? How do people try to work around this already? This is a maintainer problem, not a user problem.
    • For web-based applications, we've heard of pip lock files, and git actions that resolve and propose security patches as they become available
  2. What's the easiest/hackiest way to hand build an MVP that addresses the core issues? We need something that works for us for the next 2 months. We're willing to try something that is messy to do, but works.

Running Comments

Lockfiles, Environment Generation, and Windows

In working with our windows users to determine the cause of their windows environment creation woes, it turns out it’s not windows at fault here. There were issues around the version pin for igraph. Removing that allows the environment to be successfully created, but still takes a REALLY long time to generate.

There’s a hack. You can create the environment without igraph (and the other two troublesome packages), and then add the three offending packages with ‘make update_environment’, and it goes much more quickly. Presumably, this is because it cuts down (or changes the order) of the dependency resolution search. Still, there’s no easy way to make this work in CI, or for end-users without manual intervention, so we need another way.

In the end, Amy and I concluded that we’re should generate and check in lockfiles for the major platforms we are using, and ensure conda environment generation uses those lockfiles (vs. environment.yml) if present. This got us to thinking about what CI for Environments would look like. We wrote up a strawman, and we have a plan to implement it with azure pipelines.


What I did was mostly in response to blockers/what didn’t go well that came up last week…so I’ll leave it for the next section. From what I said I’d do last week, I sort of fixed teh environment.yml, and have a potential fix for CI. I have a whiteboard sketch of what DONE looks like for the preparation, but still need to transcribe it to the wiki.

We keep butting up against an issue with environment creation and maintenance that is annoying at best, and a total blocker for some participants at worst. I was wrestling with this already, when our windows users came up against the exact same thing. It all boils down to the fact that environment creation is fragile (especially if you want to maintain the flexibility to upgrade):

As @hackalog mentioned in his post, we’ve been working through the nuances of this, and what CI/CD for environment creation and maintenance might look like. Especially since environment creation is probably the biggest blocker to our <15minutes from fork+clone to loading up a notebook that runs successfully.

Note: on the above issues, the conda solver was either buggy or slow or mamba was ignoring the strict channel order or both, because we thought we solved the slow/crashing build problem by switching to mamba. UPDATE: We didn't. Mamba wasn't building an environment correctly. In particular, mamba env does not appear trustworthy. Sigh

References

  1. http://iscinumpy.dev/post/bound-version-constraints/
  2. https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html
  3. I wish conda-lock actually reliably worked like this: https://pythonspeed.com/articles/conda-dependency-management/