pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.53k stars 3.03k forks source link

New feature idea: create a "minimal" pip freeze option #8981

Open danielefundaro opened 4 years ago

danielefundaro commented 4 years ago

What's the problem this feature will solve? When I create the requirements file with the pip freeze command (pip freeze > requirements.txt) there are many packages, most of which are dependencies of other packages.

Describe the solution you'd like An additional option like pip freeze --min > requirements.txt which writes only the top-level packages needed in the project. I give an example.

I create a project and install the tensorflow and keras packages. What I would like is a command that only provides these 2 packages, because they already install the other required packages (their dependencies).

Current situation: pip install keras tensorflow && pip freeze > requirements.txt

requirements.txt file:

absl-py==0.10.0
astunparse==1.6.3
cachetools==4.1.1
certifi==2020.6.20
chardet==3.0.4
freeze==3.0
gast==0.3.3
google-auth==1.22.1
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
grpcio==1.32.0
h5py==2.10.0
idna==2.10
importlib-metadata==2.0.0
Keras==2.4.3
Keras-Preprocessing==1.1.2
Markdown==3.3
numpy==1.18.5
oauthlib==3.1.0
opt-einsum==3.3.0
protobuf==3.13.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
PyYAML==5.3.1
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
scipy==1.5.2
six==1.15.0
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-estimator==2.3.0
termcolor==1.1.0
urllib3==1.25.10
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.3.0

With the new feature: pip install keras tensorflow && pip freeze --min > requirements.txt

requirements.txt file:

Keras==2.4.3
tensorflow==2.3.1

Additional context I wrote a simple python script that achieves this goal. Maybe this can help with development.

ei8fdb commented 4 years ago

Hi @DanieleFundaro To understand what the command syntax might need to be, can you help me understand why would you want to do that?

What would you want to do with that "minimal" requirements.txt file?

danielefundaro commented 4 years ago

Hi @ei8fdb

I write apps in python and in order to make them portable to my team, I create the requirements.txt file with the pip freeze command. What I would like to have is an additional command that makes me display only the packages that I really need to install, without having a list of all the packages. As I said before in the example, keras automatically installs other packages like h5py, scipy, pyyaml, numpy, so I just need to have only the keras package in the list of requirements, because the other packages will be installed automatically, since they are in the Keras requires section.

I hope I have clarified the ideas with this simple explanation.

groodt commented 4 years ago

This is probably the opposite of what I would expect from pip. 😄 I would hope that some day for pip to add real "lock-file" support of the transitive dependency closure, similar to pipenv or poetry.

Using pip freeze as you currently are is not recommended.

I would recommend using pip-tools to handle your use-case of provide a list of direct dependencies for your team. You would check-in the input requirements.in and the output requirements.txt then feed the requirements.txt into your venv tool or similar.

python -m piptools compile requirements.in \
  --verbose \
  --generate-hashes \
  --output-file requirements.txt
pfmoore commented 4 years ago

@groodt While I am fine with pip gaining some form of lock file support, @DanieleFundaro is using requirements files in what I would consider to be a perfectly reasonable way - requirements files are a tool, with well-defined behaviour, and can be used in whatever way a user chooses to do so. Certainly the request does not count as using pip freeze in a way that is "not recommended".

@DanieleFundaro I assume that with this request you understand that someone using your "minimal" requirements file will get the versions of the main packages that you specified, but may get different versions of any dependencies - and that may (in theory) result in different behaviour? With that proviso, I think this is a reasonable feature request, and one that I've wanted on more than one occasion.

To be clear, I use requirements files in much the way the OP does, to define "what needs to be installed to build this environment" without constraining versions, so that I get a known set of packages. In my case, I don't even specify versions for the top-level requirements, so I would typically remove the version constraints even from the --min version of freeze. But I would definitely use this, in effect as a way to reconstruct the sequence of pip install commands that I used to build an environment.

Edit: Just for context, from the pip-tools readme,

A set of command line tools to help you keep your pip-based packages fresh, even when you've pinned them. You do pin them, right? (In building your Python application and its dependencies for production, you want to make sure that your builds are predictable and deterministic.)

That seems perfectly fine to me, but "building your Python application and its dependencies for production" is only one way of using Python and pip, and I would not want to see pip's design ignore the (many) other ways people use them.

groodt commented 4 years ago

That seems perfectly fine to me, but "building your Python application and its dependencies for production" is only one way of using Python and pip, and I would not want to see pip's design ignore the (many) other ways people use them.

Unfortunately this flexibility comes at the cost of determinism: There should be one-- and preferably only one --obvious way to do it. Explicit is better than implicit. In the face of ambiguity, refuse the temptation to guess.

I'm of the opinion that it makes things more confusing for beginners, since the mechanism in use during development does not apply in production. This is in contrast to the approach used with other common tools that one may be familiar with such as Maven (Java), Bundler (Ruby) and Yarn (Javascript).

I feel that the recent improvements towards a more correct and deterministic resolver are a big step forwards for Python.

pfmoore commented 4 years ago

I'm of the opinion that it makes things more confusing for beginners, since the mechanism in use during development does not apply in production.

You're still assuming a "development -> production" workflow, which is not the only workflow that I (as a pip maintainer) want pip to support. Sorry, but I think we're going to have to agree to differ on this matter.

+1 from me on the general form of this feature request.

I feel that the recent improvements towards a more correct and deterministic resolver are a big step forwards for Python.

100% agreed, but I don't see why that precludes this feature.

pfmoore commented 4 years ago

For what it's worth, here's a script that returns the information the OP wanted. Recent versions of Python only, but it should be possible to backport.

import importlib.metadata
from packaging.requirements import Requirement
import re

def normalize(name):
    return re.sub(r"[-_.]+", "-", name).lower()

dependencies = set()
all_packages = set()
for dist in importlib.metadata.distributions():
    name = normalize(dist.metadata["name"])
    all_packages.add(name)
    if dist.requires:
        for req in dist.requires:
            dep = normalize(Requirement(req).name)
            dependencies.add(dep)

top_level = all_packages - dependencies
for name in sorted(top_level):
    print(f"{name}=={importlib.metadata.version(name)}")

And before anyone comments, this is no more an argument for not supporting the OP's request than the fact that you can do

import importlib.metadata

for dist in importlib.metadata.distributions():
    print(f"{dist.metadata['name']}=={dist.version}")

means that pip freeze is not needed. In both cases the script probably solves 90% of the problem, building the functionality into pip is for people who need the remaining 10%.

uranusjr commented 4 years ago

I want to add that it would need a much more concrete definition of “minimal” to work. As mentioned above, the minimalised requirements set would behave “the same” as the current, flattened list. But why would the minimal format be useful, if they are the same? Since this is considered worthwhile to be raised as a proposal, there must be some merits to omitting transitive dependencies, and we must figure them out before committing to avoid designing the wrong thing for the problem (which pip have a history of). To put it another way, what are the problems the current format cannot solve, but the minimal format can?

Considering this from the other side, what is considered “minimal” is often context-dependent. Take OP’s example, it is true that we can minimalise the requirements set to Keras==2.4.3 tensorflow==2.3.1, but how is that more useful than the full requirements list? Does it mean the project only really depends on these two packages (and others only “as a consequence”)? The project may also uses requests==2.24.0 for something entirely unrelated to machine learning, but now it’s ommitted simply because it happens to be depended by someone. Would this be a problem? If so, “minimal” may be a wrong goal and we need to rethink what is actually wanted. And if not, what makes “minimal” more meaningful than the current format, other than pip (somehow arbitrarily) omitting some packages because it thinks they don’t need to be there?

pfmoore commented 4 years ago

Agreed. For me, the feature is useful for the purpose of rebuilding a description of "what the project depends on". If I were developing a package, that should be in install_requires. If I'm developing an application, it's in a (manually maintained) requirements.txt (or a requirements.in in a pip-tools world). But if I'm doing an adhoc analysis of some data, or writing a script to automate/integrate some tools, or something like that, then it's only really captured in "the virtual environment I used while developing". That's what pip freeze --min would recover, for me. And yes, sometimes it's context dependent, and sometimes it's not quite what I need. But it's always better in practice than pip freeze for that purpose, because dependencies are mostly noise in that context (i.e., the context where I explicitly don't want a lock file).

This is a hard problem, with many possible solutions. There are tools like pip-tools and pipenv that solve it for certain workflows, at the cost of being inappropriate for others. I don't think that pip needs to compete with them ("if you need pip-tools, you know where to find it). But I do think that pip should enable users to build their own workflows.

If people are adamant that supporting this use case is too awful to contemplate, how about simply sorting the output of pip freeze so that packages are in increasing order of "things that depend on them"? That would be relatively easy to do, and would at least help with the OP's use case.

To clarify further, pip freeze itself is a bit of a weird case. It's not part of pip's core functionality of installing packages, it's more in the "environment management" area. It can easily be implemented in 3rd party code, as I demonstrated above. It's only useful in virtual environments that don't have include-system-site-packages set. So before we start arguing that enhancements to pip freeze are inappropriate, maybe we could be clear on why pip freeze exists at all?

What I will say, though, is that in my own personal experience I have essentially never needed or used pip freeze. I have, however, wanted some variation on the OP's pip freeze --min on many, many occasions. If nothing else, this discussion prompted me to write my own script, which I will now keep and use for the moment, until pip gets the equivalent functionality.

groodt commented 4 years ago

What I will say, though, is that in my own personal experience I have essentially never needed or used pip freeze.

This is a good point. It's probably a big deal to remove it, since many systems in the wild may be correctly / incorrectly relying on it for a "lockfile" style mechanism to reproduce environments. It's convenient that it exists, but it does seem a little out of place I agree.

There are tools such as pipdeptree or deptree that have a lot of nice features such as machine readable json, graphviz support etc.


I have, however, wanted some variation on the OP's

Let's see what the OP thinks. @DanieleFundaro are you happy working with a pip-tools style workflow where you keep your direct dependencies in a requirements.in file (either version pinned or not) and then use piptools compile to create a fully resolved transitive-dependency closure file in the requirements.txt that can be distributed to others or CI to reproduce your environment deterministically?


Not to change topic too much, but I think this problem stems from pip being a low-level tool with some "project" level facilities, but not quite enough to be useful. It's a package installer, not a package manger. I feel that many people these days expect package managers and pip isn't one and my perception that this distinction is not widely communicated or understood.

I will say, that if pip was closer in spirit to tools such as yarn or bundler, then doing something like pip add requests would also modify the requirements.in file. Yes, I know pipenv and poetry do this. Having this functionality inside pip would support your adhoc development use-case as well as the common use-case of development -> production environment reproducibility.

pfmoore commented 4 years ago

I will say, that if pip was closer in spirit to tools such as yarn or bundler, then doing something like pip add requests would also modify the requirements.in file. Yes, I know pipenv and poetry do this. Having this functionality inside pip would support your adhoc development use-case as well as the common use-case of development -> production environment reproducibility.

Sigh. No it wouldn't. You're still making assumptions about my workflow. Many times, I'm either knocking up a simple script or doing some interactive experimentation in a directory that's either shared (a "scratch" directory) or is a project directory for a totally unrelated project. In those cases, I may be using a temporary venv, or one that's stored somewhere completely different than the "project" venv for that directory. Where would the requirements.in file that you suggest pip manages for me go in that case? (And no, it can't go in the venv, as that would break other uses like --target which aren't associated with a venv).

Please assume that I know what I'm talking about when I say that neither a pip-tools style workflow, nor a pipenv style workflow, suits my needs, and when I say that I would find the OP's suggestion useful for my workflow. Arguments about whether my needs are sufficient to justify adding a feature to pip are fine (I never said they were), but arguments about whether I'm using pip "correctly" are not acceptable, and frankly are becoming a little annoying. I'm a pip maintainer, and I think I have a good understanding of what "correct" usage of pip is (hint: it's anything at all, as long as it just uses pip's documented features).

That's the thing with low-level tools like pip, you can't make any assumptions about how people use them.

some "project" level facilities, but not quite enough to be useful.

Fair point, and while I don't personally want to add more project-level facilities, I'm not going to block them if someone does. I will, however, fight hard against any suggestion that pip "bless" any particular usage as the "correct" one, with the intention of desupporting or deprecating other workflows.

danielefundaro commented 4 years ago

@groodt honestly, I've never used pip-tools, because I've always been more or less comfortable with other tools. This obviously does not exclude that I cannot learn other tools, of course.

Agreed. For me, the feature is useful for the purpose of rebuilding a description of "what the project depends on". If I were developing a package, that should be in install_requires. If I'm developing an application, it's in a (manually maintained) requirements.txt (or a requirements.in in a pip-tools world). But if I'm doing an adhoc analysis of some data, or writing a script to automate/integrate some tools, or something like that, then it's only really captured in "the virtual environment I used while developing". That's what pip freeze --min would recover, for me.

This is also why I thought about this suggestion. Sometimes I develop tools that I later have to document and go back to direct dependencies it becomes really hard. The only way, not very scalable, was to manually pin them in an external file.

groodt commented 4 years ago

@pfmoore

Please assume that I know what I'm talking about when I say that neither a pip-tools style workflow

I'm not at all suggesting you don't have a superior understanding of pip. This was not my intention at all. Please accept my apologies.


@DanieleFundaro

go back to direct dependencies it becomes really hard

Yes. I feel your pain. I used to suffer with the very same struggles for years and years and it was one of the things that almost drove me away from Python entirely. This is why I really wish that the built-in tool directed people towards simpler, more reliable methods.

Not to belabour the point, but please hear me out.

If, hypothetically, adding new dependencies during development of an application or script or scratch experiment, one was directed to a mechanism such as pip add <pkg_name> [--output-file [default: requirements.in]] then one would always be maintaining a list of ones direct dependencies as one developed their script or application. Yes, there is a price to pay for working this way. I'm not suggesting there isn't. This mechanism does mean that inside every "project" folder there is one or more requirements.in files with direct dependencies listed. It does mean that alongside every "script" there needs to be such a file(s). It does mean that inside every "scratch" folder there may be numerous files with various names e.g. requirements.tensorflow-idea-1.in etc.

Yes, it can get messy. Yes, these files are not linked to any particular venv. What they do however is to keep a simple record of the steps taken and dependencies that were directly added during development or experimentation. Once there is a collection of these file(s), one can use them to synchronise the transitive dependency closure into any particular venv. In a sense, the venvs are truly disposable and can be recreated or synched from any particular requirements.in style file at any time.

In any case, pip does not work like this and nor does it need to. Neither does anyone in this thread. I'm only speaking from experience that it feels like so, so, so many people are falling into dependency-hell by default, instead of having dependency-heaven by default with the option to take the child-locks off.

I guess as beginners become more aware of things such as pip-tools, pipenv or poetry then the problem is diminished. I'm not sure exactly why, but it seems that most people I see are still doing pip install X, pip install Y and then getting themselves into a pickle (sic). I wish there was a better default.

pradyunsg commented 4 years ago

TBH, I see what you're asking for but I don't think it's possible for pip to switch to a lockfile-by-default model without a significant transition period, even if we wanted to.

As @pfmoore notes, there are lots of valid workflows wherein the benefits of having reproducibility (and all the other nice things that having a lockfile gives you) don't really matter and forcing those users to change their workflow would be... well, let's just say that's off the table because I like not being on the receiving end of lots of angry users. :)

OTOH, enabling these more reproducible workflows within pip, as an opt-in "i want lockfiles", is something I can get behind tho.

The exact shape is in the air right now, obviously. FWIW, I don't see why we can't treat requirements.txt files as a source of "things to lock" that generates a lockfile, which we can start using. Then, once the design is finalized + stabilized + battle-tested, we'd slowly move users of pip freeze to. All of this is a lot of work though, and IDK if there's any good solution in the short term.

Mark-Joy commented 2 years ago

sigh... For a super simple mind - python beginner user as myself, who completely don't care about dependencies, I would like to see this feature supported by pip freeze and pip list. Let us not worry about how users use pip the recommended way or not. To me this is simple feature and somewhat useful. How come a simple feature for simple mind user turn into an argument of lock-file mechanism which is for quite advanced user? And hence a useful feature is blocked and ditched.

pfmoore commented 2 years ago

@Mark-Joy Did you try the script I posted earlier in this thread? In what way does it not do what you want? This feature isn't "blocked and ditched", it's simply not had enough interest for someone to want to continue the debate and/or submit a PR for it (the discussion was also derailed a bit by a digression into a lockfile-type solution, but you can ignore those comments if you're not interested in lockfiles).

Mark-Joy commented 2 years ago

In what way does it not do what you want?

@pfmoore The script as well as pip-chill can only show package and package's version. pip freeze can show other information such as editable installed packages or locally installed packages, and supports many other options as well.

Mark-Joy commented 2 years ago

@pfmoore I've just submitted a PR #10681 for this issue

khaerensml6 commented 2 years ago

Just want to chime in and say that on more than one occasion I've badly wanted this feature. In my company we work with private pip packages, when installing those having the full pip freeze output, the dependency resolver takes ages because it's checking dependencies of all low-level dependencies. In general a lot of people in my company want this feature.

pradyunsg commented 2 years ago

@khaerensml6 Your company is welcome to spend their engineering time to contribute this feature to pip.

We’ve already shared how to do this without depending on pip: https://github.com/pypa/pip/issues/8981#issuecomment-707051457

sbidoul commented 2 years ago

Pip now has pip inspect which should make it even easier to implement the behaviour described in the OP.

davidgilbertson commented 2 years ago

@pfmoore thanks for the code snippet. I have made a variant that directly looks for the REQUESTED flag. E.g. so if you install numpy, then install something else that requires numpy, numpy would still show up in the list. Not better, just different :)

import importlib.metadata

for dist in importlib.metadata.distributions():
    requested_marker = [path for path in dist.files if path.match('*dist-info/REQUESTED')]
    if requested_marker:
        print(f"{dist.name}={dist.version}")

Logic copied somewhat from the pip inspect code. As noted in the docs for inspect, the REQUESTED file is only added by pip since 20.2.

uranusjr commented 2 years ago

Instead of using files, it’s more robust to use something like this

def requested(dist):
    try:
        dist.read_text("REQUESTED")
    except OSError:
        return False
    return True
davidgilbertson commented 2 years ago

@uranusjr noted, but that would return True if the file doesn't exist. What about this:

def requested(dist):
    try:
        return dist.read_text("REQUESTED") is not None
    except OSError:
        return False
uranusjr commented 2 years ago

Hmm really? I though it raises FileNotFoundError instead. Do what works for you.