tweag / FawltyDeps

Python dependency checker
Other
201 stars 14 forks source link

[conda] Support dependency declarations from Conda's `environment.yml` files? #452

Open jherland opened 2 months ago

jherland commented 2 months ago

(found while exploring potential Conda support for FawltyDeps, see e.g. #447 for more context)

I'm following the documentation at https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html to see what file formats conda uses to encode dependency declarations.

Specifically, the following sequence of commands:

conda create --name my_conda_project python=3.8
conda activate my_conda_project
conda install requests
conda env export > environment.yml

yields the following environment.yml on my machine:

name: my_conda_project
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - brotli-python=1.0.9=py38h6a678d5_8
  - ca-certificates=2024.7.2=h06a4308_0
  - certifi=2024.7.4=py38h06a4308_0
  - charset-normalizer=3.3.2=pyhd3eb1b0_0
  - idna=3.7=py38h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_1
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.14=h5eee18b_0
  - pip=24.2=py38h06a4308_0
  - pysocks=1.7.1=py38h06a4308_0
  - python=3.8.19=h955ad1f_0
  - readline=8.2=h5eee18b_0
  - requests=2.32.3=py38h06a4308_0
  - setuptools=72.1.0=py38h06a4308_0
  - sqlite=3.45.3=h5eee18b_0
  - tk=8.6.14=h39e8969_0
  - urllib3=2.2.2=py38h06a4308_0
  - wheel=0.43.0=py38h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - zlib=1.2.13=h5eee18b_1
prefix: /home/jherland/.conda/envs/my_conda_project

As with #450, each dependency listed here does not follow the same format as used by pip's requirements.txt files, rather they seem to use their own Conda-specific format.

The Conda documentation states that this environment.yml can be used to reproduce the Conda environment with this command: conda env create -f environment.yml

Given that I only have stated requests (and python=3.8) as my real dependencies in this environment, the above file does not reflect these direct/intentional dependencies, but instead appears to pin all (transitive) dependencies to specific versions + hashes. As such the above file is closer in essence to a poetry.lock file than a pyproject.toml file.

That said, the Conda documentation has this to say about creating an environment file that is portable across platforms:

If you want to make your environment file work across platforms, you can use the conda env export --from-history flag. This will only include packages that you’ve explicitly asked for, as opposed to including every package in your environment.

Applied to my toy example above, this yields the following environment.yml file:

name: my_conda_project
channels:
  - defaults
dependencies:
  - python=3.8
  - requests
prefix: /home/jherland/.conda/envs/my_conda_project

This is clearly much closer to declaring the direct/intentional dependencies that we want to use as input to FawltyDeps.

The Conda documentation goes on to describe how to create an environment file manually, and this also yields a more minimal/appropriate file for FawltyDeps to use. In my toy example, it would look something like this:

name: my_conda_project
dependencies:
  - python=3.8
  - requests

I don't know how prevalent the environment.yml file is compared to the weird requirements.txt files described in #450, but I suspect we should consider supporting both if we want to support Conda fully.

Complications

Non-Python packages

Conda project dependencies will often include Python itself, along with non-Python dependencies. These must be properly ignored by FawltyDeps, but doing so correctly may require us to parse all dependencies, and then somehow consult a real Conda environment to deduce which of the dependencies actually provide Python import names or not.

To that end, there appear to be .json files in the conda-meta/ subdir of the Conda environment that list the files provided by a package, and from here we might be able to deduce which Conda packages correspond to Python packages (e.g. by looking for lib/pythonX.Y/site-packages/... paths), which can then be further mapped into import names.

Custom package sources

Unsurprisingly, Conda does not use PyPI to find Python packages, but rather has its own system of channels, including default channels and prioritization between channels in order to resolve conflicts.

When resolving Conda package names (and especially in conjunction with --install-deps) we would have to use/understand the same channel system to correctly map Conda dependencies into Conda packages (and from there -> Python packages -> import names).

Possibly the only sensible choice here is to use/run Conda itself to either find an existing local environment - or establish one based on the environment.yml file - and then consult this environment to build our mapping.

Non-obvious interactions with pip?

It appears that Conda projects sometimes also use pip to manage some packages, and according to https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#using-pip-in-an-environment there are some things to be aware of when these tools are combined (see also https://conda.io/projects/conda/en/latest/user-guide/configuration/pip-interoperability.html). It remains to be seen how a combination of conda-installed and pip-installed dependencies can be best navigated and handled by FawltyDeps.

ctcjab commented 2 months ago

conda-lock can be used to create a fully-resolved lockfile from an environment.yml that only declares the direct dependencies, much like pip-tools' pip-compile workflow. Here's a good article about it: https://pythonspeed.com/articles/conda-dependency-management/

jherland commented 2 months ago

Thanks! It's important for us to learn what tools are available and in use in the conda community. Also, it's good to see that there are even more tools to help make reproducible conda environments.

As far as FawltyDeps is concerned, the lockfiles produced by tools like conda-lock or pip-compile are not very interesting, as they are designed to capture the full closure of transitive dependencies, and FawltyDeps is only interested in you declaration of direct dependencies. (Passing such a lockfile to FawltyDeps will typically only generate a large list of unused - i.e. transitive - deps.)

When environment.yml is generated by conda env export (without --from-history), I would consider it more of a lockfile than a manually curated declaration of direct dependencies (which really is what FawltyDeps is designed to work with).

Hence, for FawltyDeps to be useful in a conda project with environment.yml, we want this file to only declare the direct dependencies, and not to be the product of conda env export. Do you have a sense as to what is the common practice in conda projects here?


To be clear, this situation is somewhat similar to the situation with requirements.txt files in many other Python projects:

Some projects manually curate their direct dependencies in a requirements.txt file, and it is thus a valid input for FawltyDeps. Other projects will use a different file and run e.g. pip-compile to generate a lockfile named requirements.txt. We currently do not differentiate between these two cases, and we instead rely on the user pointing us to the declaration of direct dependencies with --deps.

ctcjab commented 2 months ago

FawltyDeps is only interested in you declaration of direct dependencies

Right, I only mentioned conda-lock specifically to point out that, because conda-lock has been the standard tool for producing lockfiles from environment.yml files, it allows environment.yml files to declare only direct dependencies. So I think you can simplify this task by proceeding under the assumption that the user has an environment.yml where they intend to only declare direct dependencies, and is using a proper tool like conda-lock to create lockfiles from that (rather than other ways you mentioned users might be (mis-)managing dependencies, e.g. conda env export --from-history, which is error-prone).

(It's too bad the conda docs you found don't mention conda-lock. That's either a significant oversight, or they're very out-of-date.)