Closed gregtatum closed 7 months ago
We actually use even more ways of managing environment:
For python packages:
And a bunch of ways to manage linux packages:
Some of it is Snakemake legacy.
As for conda vs poetry I used both extensively and poetry is more user friendly and easier to install and run in a local env and in a script. I did encounter issue with poetry groups recently though, when they conflicted. I would assume they should work separately. If not we might need multiple poetry envs in the same way as we have for conda or pip-compile.
Conda usually provides more reproducibility and ability to install other software, not only python (for example cmake). Like you mentioned it's easier to install some scientific packages but it's always a pain to activate a conda env in a script that's why I did all those unpleasant CONDA_ACTIVATE commands right before running things. Locking env is also possible but a bit less native to conda and a bit less convenient than poetry lock. We currenlty use only high level dependencies in our Conda envs. Also regular Conda is incredibly slow and can hang for 10 min to lock the env, that's why people switch to Mamba. Conda/Mamba was the only choice for Snakemake and worked fine with Singularity but was not user friendly to run locally. Snakemake was also great at reusing those conda env without reinstalling any packages each time like we do on TC.
It would be great to settle on one approach for Taskcluster. My suggestion would be having an optional Docker image for each step and if any packages are required either adding them in the base image or override in the step image, so that we don't need to install anything when running the step. When you have linux env setup under Docker, pip/poetry with locking should be enough because you can also preinstall something in the image in rare cases like Pytorch.
As for Snakemake, we can just move everything related to it to a separate directory and leave maintenance up to contributors. It's already pretty hard to sync the pipeline with the latest changes and it's not tested anyway.
Another perspective to think about all this is how easy it is to test the pipeline steps locally. Right now it's not easy at all because you might need some linux package and also to compile Marian. With pre-created Docker images it would become way easier assuming we publish them in the public Docker registry. Maybe to simplify we could even reuse just one Docker image that has everything compiled and installed.
To re-state, the recommendation would approximately be:
If we run into dependency issues like I ran into above we can work around them by relying on a fork with a fixed dependency declaration, or work to get it fixed upstream (in the case of mtdata), or maybe find another way to work around the issue.
I'm fully +1 on the above recommendations. The only thing I might suggest as a non-blocking enhancement is that I encourage the use of Docker locally as much as possible. (Even if your development machine is Linux you will likely eventually run into issues with differences between system tool versions or other things that can cause bustage or confusion.)
I think we have mostly aligned on this, and if we want more changes we can always re-open an issue or discussion for further python changes.
We have both Conda, which was part of the original project, and I introduced poetry with some of the CI tooling. We should consolidate on some kind of tooling solution so we don't have multiple confusing errors like in #263. I was less familiar with Conda, and had trouble getting it working on my machine when I started working on the project. When implementing some CI features, I used poetry due to both my familiarity with it.
Our python packages management should solve the following issue:
Conda
Poetry
poetry run
Makefile
, and then having to remember to use thepoetry
command correctly locally when switching between different scripts with different dependencies.mtdata
dependency requires a specificrequests
version, while other dependencies require different version ofrequests
. There is no way to work around this issue other than forkingmtdata
. See https://github.com/python-poetry/poetry/issues/697 . In addition to this issue, when I use poetry on side projects, PyTorch can't be installed as it uses non-standardized module installation. This is somewhat concerning to me since I want to align on the practices in the ML community at large.pyenv and pip
It would probably be worth filling in this one to consider.