mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
155 stars 34 forks source link

Consolidate on how we run python #264

Closed gregtatum closed 7 months ago

gregtatum commented 1 year ago

We have both Conda, which was part of the original project, and I introduced poetry with some of the CI tooling. We should consolidate on some kind of tooling solution so we don't have multiple confusing errors like in #263. I was less familiar with Conda, and had trouble getting it working on my machine when I started working on the project. When implementing some CI features, I used poetry due to both my familiarity with it.

Our python packages management should solve the following issue:

  1. Provide a lock file for dependency management.
  2. Create a virtual environment so that modules can be installed in a predictable way.
  3. Work locally when running our utils.
  4. Work in CI and in Taskcluster.
  5. Supporting pyproject.toml is a plus.
  6. It should practically allow us to install and manage dependencies.

Conda

  1. There is a plugin for this: https://github.com/conda/conda-lock
  2. This is supported.
  3. I had some trouble getting it setup, but maybe this was just me.
  4. This is what's used in Taskcluster, but not CI.
  5. I guess this is a thing. I'm not sure how good it is though.
  6. Conda is really used in lots of ML-related work so it is a very practical solution. It's one of the few package managers that support PyTorch workflows, not that we are using PyTorch here.

Poetry

  1. Yes, by default.
  2. Yes, with poetry run
  3. Yes, although it's a little awkward putting things in the Makefile, and then having to remember to use the poetry command correctly locally when switching between different scripts with different dependencies.
  4. CI uses it, but Taskcluster does not.
  5. Yes, by default.
  6. I'm running into some issues with the dogmatic, standards based approach of poetry. For instance, the mtdata dependency requires a specific requests version, while other dependencies require different version of requests. There is no way to work around this issue other than forking mtdata. See https://github.com/python-poetry/poetry/issues/697 . In addition to this issue, when I use poetry on side projects, PyTorch can't be installed as it uses non-standardized module installation. This is somewhat concerning to me since I want to align on the practices in the ML community at large.

pyenv and pip

It would probably be worth filling in this one to consider.

eu9ene commented 1 year ago

We actually use even more ways of managing environment:

For python packages:

  1. pip-compile + pip install packages for a specific step, for example pipeline/clean/requirements/clean.txt and taskcluster/ci/clean-corpus/kind.yml
  2. Installing packages in a Dockerfile, for example taskcluster/docker/train/Dockerfile and taskcluster/docker/base/Dockerfile. I don't know if those are being used though since we switched some of the steps to generic workers
  3. Running a separate script for kenlm or installing it through a toolchain

And a bunch of ways to manage linux packages:

  1. Singularity
  2. Docker
  3. Generic worker
  4. Toolchains
  5. Installing some linux packages with a script in a container image

Some of it is Snakemake legacy.

As for conda vs poetry I used both extensively and poetry is more user friendly and easier to install and run in a local env and in a script. I did encounter issue with poetry groups recently though, when they conflicted. I would assume they should work separately. If not we might need multiple poetry envs in the same way as we have for conda or pip-compile.

Conda usually provides more reproducibility and ability to install other software, not only python (for example cmake). Like you mentioned it's easier to install some scientific packages but it's always a pain to activate a conda env in a script that's why I did all those unpleasant CONDA_ACTIVATE commands right before running things. Locking env is also possible but a bit less native to conda and a bit less convenient than poetry lock. We currenlty use only high level dependencies in our Conda envs. Also regular Conda is incredibly slow and can hang for 10 min to lock the env, that's why people switch to Mamba. Conda/Mamba was the only choice for Snakemake and worked fine with Singularity but was not user friendly to run locally. Snakemake was also great at reusing those conda env without reinstalling any packages each time like we do on TC.

It would be great to settle on one approach for Taskcluster. My suggestion would be having an optional Docker image for each step and if any packages are required either adding them in the base image or override in the step image, so that we don't need to install anything when running the step. When you have linux env setup under Docker, pip/poetry with locking should be enough because you can also preinstall something in the image in rare cases like Pytorch.

As for Snakemake, we can just move everything related to it to a separate directory and leave maintenance up to contributors. It's already pretty hard to sync the pipeline with the latest changes and it's not tested anyway.

eu9ene commented 1 year ago

Another perspective to think about all this is how easy it is to test the pipeline steps locally. Right now it's not easy at all because you might need some linux package and also to compile Marian. With pre-created Docker images it would become way easier assuming we publish them in the public Docker registry. Maybe to simplify we could even reuse just one Docker image that has everything compiled and installed.

gregtatum commented 1 year ago

To re-state, the recommendation would approximately be:

Locally

CI

Training runs

Clean-up (ignore snakemake)

If we run into dependency issues like I ran into above we can work around them by relying on a fork with a fixed dependency declaration, or work to get it fixed upstream (in the case of mtdata), or maybe find another way to work around the issue.

bhearsum commented 1 year ago

I'm fully +1 on the above recommendations. The only thing I might suggest as a non-blocking enhancement is that I encourage the use of Docker locally as much as possible. (Even if your development machine is Linux you will likely eventually run into issues with differences between system tool versions or other things that can cause bustage or confusion.)

gregtatum commented 7 months ago

I think we have mostly aligned on this, and if we want more changes we can always re-open an issue or discussion for further python changes.