martius-lab / cluster_utils

https://cluster-utils.readthedocs.io/stable/
Other
12 stars 0 forks source link

How to manage dependency versions? #111

Closed luator closed 4 months ago

luator commented 5 months ago

We already had some discussion in the past whether we should pin dependency versions or not (not sure where, though, I couldn't find a corresponding issue).

Currently, we have some dependencies fixed (only pandas, actually) and others with a minimal version or no restriction. This now let to an issue as NumPy 2 got released which is not compatible with pandas 2.0.3.

To reduce the risk of such issues happening, I think we should not pin exact versions but only add min/max requirements if really needed (i.e. if there is some known compatibility issue).

Explicitly pinning versions of all our dependencies wouldn't actually avoid these issues, as it also depends on how our direct dependencies specify their dependencies (e.g. pandas doesn't seem to limit the numpy version, so installing pip install 'pandas==2.0.3' in a clean venv doesn't work anymore, because it installs the incompatible numpy 2.0).

Due to this, pinning all versions would also not fully ensure reproducibility when reinstalling. For this, I'd rather add some info to the documentation and instruct users to do something like pip freeze > requirements.txt after installation.

luator commented 5 months ago

Side note: There are more issues with NumPy 2, so independent of what we do with the pandas dependency, we should use numpy<2 for now.

mseitzer commented 5 months ago

Preface: I am not an expert on Python dependency management.

As developers of a library, we definitely should not pin our dependencies. Instead, we should be as permissive as possible, and only exclude incompatibilities (as you also said above). Downstream users have diverse environments, and we want to be usable in as many as possible.

Our previous discussion on pinning dependencies was in the context of the development dependencies. There the story is different: we are in full control of the environment, so we can (and possibly should) run against pinned dependencies that ensure that our tests are 1) reproducible and 2) run in the same environment for everyone (i.e. devs and CI).

Whether to also pin library dependencies (not only the dev dependencies) is an additional question. Some projects provide a "known-to-work" environment in the form of pip freeze that the tests are also run in. This has the advantages of reproducible and standardized environments, but the disadvantage that we are not informed about breakages because of upstream changes (e.g. the numpy case). If we'd be really thorough, the CI would be run twice: once against a known env, once against the "latest" dependencies.

Explicitly pinning versions of all our dependencies wouldn't actually avoid these issues, as it also depends on how our direct dependencies specify their dependencies

Not sure this is true: at least if you pin all dependencies (i.e. including transitive dependencies), like when running pip freeze, you should get a reproducible environment.

instruct users to do something like pip freeze > requirements.txt after installation.

I don't think this is necessary. How downstream manages their dependencies should not be our concern. Specifically for ML projects, I expect some form of version pinning to be common practice to ensure reproducibility (well, maybe I'm to hopeful there...).

luator commented 5 months ago

So actions to take would be:

I'm a bit unsure with the last point, as it again puts more maintenance effort on us and it's a bit unclear to me how much benefit it would really bring.

Not sure this is true: at least if you pin all dependencies (i.e. including transitive dependencies), like when running pip freeze, you should get a reproducible environment.

I meant only the direct dependencies here (i.e. for everything that is listed in pyproject.toml).