Avoid full path enumeration on import of setuptools or pkg_resources?

ghost commented 8 years ago

Originally reported by: konstantint (Bitbucket: konstantint, GitHub: konstantint)

At the moment on my machine, it takes about 1.28 seconds to do a bare import pkg_resources, 1.47 seconds to do a bare import setuptools, 1.36 seconds to do a bare from pkg_resources import load_entry_point and 1.25 seconds to do a bare from pkg_resources import load_entry_point.

This obviously affects all of the python scripts that are installed as console entry points, because each and every one of them starts with a line like that. In code which does not rely on entry points this may be a problem whenever I want to use resource_filename to consistently access static data.

I believe this problem is decently common, yet I did not find any issue or discussion, hence I'm creating one, hoping I'm not repeating what has been said already elsewhere unnecessarily.

I am using Anaconda Python, which comes along with a fairly large package, alongside several of my own packages, which I commonly add my path via setup.py develop, however I do not believe this setup is anything out of the ordinary. There are 37 items on my sys.path at the moment. Profiling import pkg_resources shows that this leads to 76 calls to workingset.add_entry (timing at about a second), of which most of the time is spent in 466 calls to Distribution.from_location.

Obviously, the reason for the problem lies in the two _call_aside methods at the end of pkg_resources which lead to a full scan of the python path at the moment when the package is imported, and the only way to alleviate it would be to somehow avoid or delay the need for this scan as much as possible.

I see two straightforward remedies: a) Make the scanning lazy. After all, if all one needs is to find a particular package, the scan could stop as soon as the corresponding package is located. At the very least this would allow me to "fix" my ipython loading problem by moving it up in the path. This might break some import rules which do not respect the precedence of the path, which I'm not aware. b) Cache a precomputed index and update it lazily. Yes, this might requre some ad-hoc rules for resolving inconsistencies, and this may lead to ugly conflicts with external tools that attempt to install multiple versions of a package, but this will basically avoid the current startup delay in 99.99% of cases and solve so much of my problems, that I'd be willing to pay the price.

Although both options might seem somewhat controversial, the problem itself seems to be serious enough to deserve at least some fix eventually (for example, I've recently discovered I'm reluctant to start ipython for short calculations because of its startup delay which I've now tracked back to this same issue).

I'm contemplating making a separate utility, e.g. fast_pkg_resources, which would implement the strategy b) by simply caching calls to pkg_resources in an external file, yet I thought of raising the issue here to figure out whether someone has already addressed it, whether there are plans to do something about it in the setuptools core codebase, or perhaps I'm missing something obvious.

Bitbucket: https://bitbucket.org/pypa/setuptools/issue/510

jaraco commented 5 years ago

Do we require [validation of all packages] for console entry points too?

Yes - for console scripts generated by easy_install or for entry points processed by pkg_resources.

However, as you point out, console scripts generated by pip install don't have this characteristic.

Furthermore, entry points processed by entrypoints or importlib_metadata also do not have this characteristic.

Given the difficulty and risk of addressing this issue withing pkg_resources, I've been focusing my efforts on making importlib_metadata, which is planned to become importlib.metadata in the stdlib, a suitable replacement for 99% of the use-cases on which applications currently rely on pkg_resources. After that transition, I'd expect setuptools might be able to rely on importlib_metadata or pkg_resources could become a private, internal implementation detail of setuptools.

That doesn't necessarily preclude someone from attempting to make pkg_resources faster and more robust for these use-cases. It just means I'm focusing my efforts elsewhere (on this topic).

cjw296 commented 4 years ago

Sorry, hard to follow above, but where has this issue ended up? I got here through a bug report on one of my libraries (https://github.com/Simplistix/configurator/pull/6) but I'm certainly interested in the wider issue.

Am I right in guessing that this is is basically because at startup, pkg_resources is going to go scan for entrypoints in all installed packages? If so, this results in two problems:

Console scripts use entry points and so will be massively slow to start up. Could this also be because pkg_resource appears to do some pseudo-dependency-resolution stuff? I'd love if all of this could go away and be replaced with a simple import statement.
Anything that uses entrypoints is going to be massively slow when lots of packages are installed or the filesystem serving the python code is slow. HPC environments often have both of these ;-)

IIUC, I'd see the best solution to have entrypoints collected in a central location as part of the package installation process, rather than being collected at runtime, but would that have to be a PEP nowadays?

gaborbernat commented 4 years ago

IIUC, I'd see the best solution to have entrypoints collected in a central location as part of the package installation process, rather than being collected at runtime, but would that have to be a PEP nowadays?

This would make this a pip feature request, and the very least a PEP would need to be formulated for this that deals with how this central location is maintained as long as CRUD operations and parallel interactions go. In theory, this could be done though. Tagging @pfmoore for thoughts. And then it's not clear how one could actually handle altering the sys.path at startup/runtime which can extend/change this central database. Maybe every sys.path entry can have a distributions.sqlite file that basically contains everything under *.dist-info), and if not fallback to directory discovery as is today.

cjw296 commented 4 years ago

@gaborbernat - thanks for the quick response, to try and explain more about the environments where I think this causes problems:

huge numbers of python packages installed (think your average data science stack, or something like the Anaconda distribution)
only a few packages used for any more scripts.

I would suggest that most people would be happy for entry points to not take into account modification of sys.path in terms of finding entry points in return for making entrypoints scale. I'd be sad, but of course would have to accept, if empirical evidence suggests otherwise.

I'm not sure a database per sys.path would make things better, I was thinking more along the lines of one data store per python installation/virtualenv/etc.

gaborbernat commented 4 years ago

I would suggest that most people would be happy for entry points to not take into account modification of sys.path in terms of finding entry points in return for making entrypoints scale. I'd be sad, but of course would have to accept, if empirical evidence suggests otherwise.

While most might be ok, I don' think we can design something that would drop a feature supported for now. IMHO the only way I can at the moment this work is tying it to the sys.path. This could decrease the disk access roughly from 1000 to 10, which IMHO could be enough.

pfmoore commented 4 years ago

Thanks for tagging me here @gaborbernat. I'll answer this with my "interoperability BDFL" hat rather than my "pip maintainer" hat, as I think that's probably more appropriate (see later for why).

Requiring installers to log entry points in a central location is definitely something that would need to be standardised via a PEP (essentially, it's an extension of PEP 376 - Database of Installed Python Distributions). As tools like setuptools couldn't rely in that database unless they were sure all installers would maintain it, a standard is the correct approach here. (And with my pip maintainer hat on, pip would only implement something like this if it were backed by a standard).

One point I would like to clarify here, though - the original post talked about console scripts, which are defined using entry points. However, the script wrappers installed by pip (which are implemented by distlib) do not use entry point discovery to work, so they do not need or use pkg_resources (they use entry point discovery when installing, but not at runtime). It's only the wrapper scripts implemented by setuptools itself (used by pip for develop mode, and for "legacy" installs that don't go via a wheel build) that use the entry point discovery mechanism at runtime.

Personally, I'd strongly advise against using that mechanism - my understanding is that it was mainly to support the mechanisms setuptools has for dynamically activating versions of a package at runtime (something that I think has been deprecated for some time). But either way, the choice to use it or not is entirely in setuptools' hands. It wouldn't make any difference for other uses of entry points, but it would address the issue for setuptools-created console script wrappers.

jaraco commented 4 years ago

Lots of work has been done to address this issue, in particular by satisfying the use-cases of pkg_resources in other places. In particular, the importlib.resources and importlib.metadata stdlib packages (and their _-separated backports) attempt to satisfy the most common use-cases around metadata (including fast entry point parsing) and package resource retrieval. At the same time, this project is deprecating egg-based installs and easy_install.

As a result, use of the pkg_resources package for these use-cases is discouraged and the module is largely deprecated.

To that end, I don't believe there's much left to do with this issue as described. Instead, downstream consumers should attempt to use the importlib features instead. If there are use-cases that are not satisfied by those packages, please feel free to raise those as separate issues.

bulletmark commented 4 years ago

@jaraco I am a simple developer who uses setuptools but have avoided using entry_points because of this bug. I have been watching this bug (for a few years!) waiting for the day it gets fixed so I can start using them again. According to the current automatic script generation documentation, entry_points are still the intended way to do this but I just tried it again using pip and my programs still cop a few unacceptable extra seconds of startup delay (e.g when installed on a raspberry pi). What are you saying we should be doing now? Can you please provide a link to an example setup.py? I am using pip 20.0.2 and setuptools 46.1.3.

gaborbernat commented 4 years ago

@bulletmark did you read https://docs.python.org/3/library/importlib.metadata.html#entry-points ?

pfmoore commented 4 years ago

@bulletmark I'd recommend that you use pip's script generation, installing from wheels (or if you install from source, ensure you have the wheel project installed, so that pip builds a wheel and installs it, rather than going through the setuptools "legacy" script generation process).

The pip scripts still use exactly the same entry point mechanism, but only at install time - there's no runtime cost to the entry points.

bulletmark commented 4 years ago

@gaborbernat sorry, but I don't see how that link you quote relates to my question? Should I still be using setuptools? If so, should I not be following the current documentation to create an automatic script? I can't do what that current documentation says because the (pip) created script runs far too slowly due to the present bug. I think this bug should be re-opened, at least until that documentation is corrected, and/or some other avenue is provided for us to create automatic scripts from setuptools.

jaraco commented 4 years ago

I can't do what that current documentation says because the (pip) created script runs far too slowly due to the present bug.

@bulletmark, sorry to hear you're still having issues. I'm surprised by this report. If pip is creating the script, it should not be importing pkg_resources implicitly. However, if any of the libraries used by that command are importing pkg_resources, you'll still have the slowness. The recommendation is for those packages to use importlib.metadata. The instructions in setuptools are still accurate for how a project can/should declare console-script entry points. Do you have an example of a command whose library when installed by pip is still slow to execute?

bulletmark commented 4 years ago

@jaraco wrote:

Do you have an example of a command whose library when installed by pip is still slow to execute?

Here you go: https://github.com/bulletmark/dummysetuptestapp.

marcelm commented 4 years ago

@bulletmark If I install your test program using the most recent pip and setuptools versions, the generated wrapper does not use pkg_resources and is fast to execute.

The problem may be in the use of sudo (according to your instructions in the README), which may give you different pip and setuptools versions than when you run it without sudo.

You may want to compare output of sudo python3 -c "import sys; print(sys.executable)" with python3 -c "import sys; print(sys.executable)" to check this. Note that best practice is to avoid sudo pip install anyway.

untitaker commented 4 years ago

FWIW the issue for me has always been that pip install -e is slow, not pip install which has been working fine.

always = since ~2018

WGH- commented 4 years ago

Okay, that's odd. On my main system, the generated dummysetuptestapp indeed doesn't have pkg_resources import.

On my Orange Pi Zero board, however, the generated script somehow still imports main through load_entry_point, even though I manually updated pip and setuptools beforehand.

$ python3 -m venv env
$ ./env/bin/pip3 install -U pip setuptools
[...]
Installing collected packages: pip, setuptools
  Found existing installation: pip 9.0.1
    Uninstalling pip-9.0.1:
      Successfully uninstalled pip-9.0.1
  Found existing installation: setuptools 39.0.1
    Uninstalling setuptools-39.0.1:
      Successfully uninstalled setuptools-39.0.1
Successfully installed pip-20.0.2 setuptools-46.1.3
$ ./env/bin/pip3 freeze --all
pip==20.0.2
pkg-resources==0.0.0
setuptools==46.1.3
$ ./env/bin/pip3 install .
$ cat ./env/bin/dummysetuptestapp
#!/tmp/dummysetuptestapp/env/bin/python3
# EASY-INSTALL-ENTRY-SCRIPT: 'dummysetuptestapp==1.0','console_scripts','dummysetuptestapp'
__requires__ = 'dummysetuptestapp==1.0'
import re
import sys
from pkg_resources import load_entry_point

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    sys.exit(
        load_entry_point('dummysetuptestapp==1.0', 'console_scripts', 'dummysetuptestapp')()
    )

pfmoore commented 4 years ago

@WGH- I've noted this earlier, but the behaviour is different depending on whether you have wheel installedl. In your example, you don't - so you get the old, slow version of the scripts. Please install wheel and try again, and you'll see the newer scripts.

WGH- commented 4 years ago

Okay, turned out I had to install wheel as well. It was far from obvious that wheel somehow influences pip/setuptools (?) entry point script generation. FFS.

pfmoore commented 4 years ago

@WGH- Agreed, it's not obvious, It's because pip can't build a wheel and do its own (faster) script generation when wheel isn't present, so it falls back on the older setuptools direct installation code, which is what generates the old-style wrappers.

Maybe pip should warn in this case. I'll raise a pip issue suggesting that.

bulletmark commented 4 years ago

@pfmoore and @WGH- have identified the problem! I simply install python3-wheel, reinstall that test app, and then the startup time on my Raspberry Pi is basically the same.

pfmoore commented 4 years ago

@bulletmark Sigh, I noted this earlier, but it got lost in the discussion, my apologies for not being clearer. As noted, I'm raising a pip issue to consider warning in this case, to make it easier to see what's going on.

bulletmark commented 4 years ago

Given this situation, I still won't be using entry_points because I don't want users to suffer slow startup just because they don't have that package installed. Can this be improved other than merely outputting a warning?

pfmoore commented 4 years ago

Can this be improved other than merely outputting a warning?

You could switch your project to use pyproject.toml for defining its build dependencies. Then you can explicitly require wheel, and pip will install it for you when your users build from source. That may or may not be a bigger change than you want to make, but it should be relatively painless. In principle, all you need is a pyproject.toml file containing

[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

It does change some details of how pip builds your project, though, so please test before releasing.

WGH- commented 4 years ago

Sigh, I noted this earlier, but it got lost in the discussion, my apologies for not being clearer.

@pfmoore you were perfectly clear, the fault is mine for not reading at least some recent messages of the discussion. (the real fault is on the non-obvious pip/setuptools/wheel interaction, of course)

cjw296 commented 4 years ago

@pfmoore - when would the non-wheel case be desirable? How would vendoring in wheel so that pip never builds the degraded scripts that use pkg_resources?

pradyunsg commented 4 years ago

@cjw296 https://github.com/pypa/pip/issues/8102#issuecomment-617045981

kapsh commented 4 years ago

@pfmoore can wheel be added to build process using setup_requires parameter of setuptools.setup?

cjw296 commented 4 years ago

@pradyunsg - from a user perspective, having a really badly performing thing done because some library I don't know about isn't installed by a tool that should require it does end up feeling like a bug with the tool...

pfmoore commented 4 years ago

@pfmoore can wheel be added to build process using setup_requires parameter of setuptools.setup?

@kapsh not safely, setup_requires is deprecated because it uses easy_install to install the packages, not pip. pyproject.toml is the correct, supported way. What's the use case that works with setup_requires but not with pyproject.toml (because we'd like to fix it!)?

from a user perspective, having a really badly performing thing done because some library I don't know about isn't installed by a tool that should require it does end up feeling like a bug with the tool...

@cjw296 Agreed, up to a point. While I understand that the history isn't the point here, we're in a transition period. The setuptools wrappers were historically the approved solution, and pip did setup.py install. We moved away from setup.py install to PEP 517 (pyproject.toml and building wheels) but we're still part-way through that process (pyproject.toml adoption is still in progress, but the new wrappers depend on wheel). The transition to PEP 517 is not about better wrappers itself, but they come as a consequence.

The fix pip is progressing towards is making all builds go via PEP 517. We only support setup.py install any more to avoid breaking projects that haven't done anything about the transition yet, and break under the new process. Conversely, setuptools isn't interested in updating their wrappers as they are being phased out by the pip change (and the general move away from installing via setuptools directly).

So yes, it's a bug, but it's being fixed. The fix is just rather long-winded, for compatibility reasons, and we're doing our best to apply mitigations while the process is ongoing.

kapsh commented 4 years ago

@pfmoore sorry, I am not very proficient with setuptools and can't tell you use cases where pep-517-way would fail (personally I like to abuse pip install -e ., but that's another story). I've only used setup_requires to make setuptools_scm available to the packaging process. Yet I have to notice that documentation here https://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords never mentions deprecation of setup_requires keyword, maybe you would like to fix that detail. Big red warning while building sounds useful.

Thanks for your brief on current situation, this is interesting to know about.

pfmoore commented 4 years ago

Yet I have to notice that documentation here https://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords never mentions deprecation of setup_requires keyword, maybe you would like to fix that detail. Big red warning while building sounds useful.

Good point. I'm not a setuptools developer, so I'll leave it to them to pick up on that.

tgbugs commented 4 years ago

One context where the degraded setuptools scripts are generated is for any/every python package on gentoo that has an entry point. This is probably ultimately a issue that the gentoo python team (e.g. @mgorny) would have to tackle, but it affects all system installed python packages.

pganssle commented 4 years ago

Yet I have to notice that documentation here https://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords never mentions deprecation of setup_requires keyword, maybe you would like to fix that detail. Big red warning while building sounds useful.

setup_requires is sort of semi-deprecated. It's not the preferred way to add things to the build dependencies, but it is compatible with PEP 517/518 and feeds into get_requires_for_build_wheel. We can probably open a separate issue to discuss this.

cjw296 commented 4 years ago

@pfmoore - okay, so I think I'm doing everything you said, but still getting entrypoint scripts built using pkg_resources:

$ pip freeze --all | egrep -i 'wheel|pip|setuptools'
pip==20.1
setuptools==46.1.3
wheel==0.34.2
$ pip install -e .
Obtaining file:///home/chris/energenie
...
Successfully installed energenie
$ cat `which check`
#!/home/chris/virtualenvs/energenie/bin/python3.5
# EASY-INSTALL-ENTRY-SCRIPT: 'energenie','console_scripts','check'
__requires__ = 'energenie'
import re
import sys
from pkg_resources import load_entry_point

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    sys.exit(
        load_entry_point('energenie', 'console_scripts', 'check')()
    )

What am I doing wrong?

pfmoore commented 4 years ago

@cjw296 Ah, you're using editable installs (-e). They go via setup.py develop, because they are a setuptools-specific feature, not handled via wheels or any published standard. So you get setuptools wrappers in that case, and there's no avoiding it. Sorry, I forgot to mention that case.

cjw296 commented 4 years ago

It feels like the chances of getting a non-sucky entrypoint script are really pretty small, no? (Honestly, reading through the above feels like some magic incantation, rather than a standard way to install software in one of the most popular programming languages in the world...)

cjw296 commented 4 years ago

More constructively, what's the current state of play on publishing a standard for editable installs? (-e seems pretty ubiquitous, I seem to remember flit having something too, not actually sure what conda does or if they care...)

gaborbernat commented 4 years ago

@cjw296 see https://discuss.python.org/t/third-try-on-editable-installs/3986/25

pfmoore commented 4 years ago

There's a lot of debate on editable installs, but the latest round of discussions is here.

To cut through some of the packaging community specifics there, there's one proposal that hasn't been completely written up yet (a rough spec is here) which is waiting on someone with time to build a proof of concept implementation for some build backend (probably setuptools) and for pip. There's still a lot of debate over whether this is the best approach, but TBH, we need someone to write code at this point, not to discuss ideas (we've got plenty of people willing to do that 🙂)

Edit: @gaborbernat posted a link to some additional points that I'd not spotted since I last checked the topic, so we're a bit further forward than I suggested above.

bluetech commented 4 years ago

I use Arch Linux, which installs all python packages using setup.py install (see package guidelines). So all Python executables installed through the system package manager (tox, virtualenv, meson, youtube-dl, docker-compose, borg, and many more) get the 250ms startup slowdown due to the pkg_resources import, which is unfortunate.

I reported this issue to the Arch Linux devs, and they explained that they prefer the pkg_resources method because it provides a nice informative error message if one of the dependencies is broken or missing, for example:

Traceback (most recent call last):
  File "/usr/bin/pyrsa-keygen", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3259, in <module>
    def _initialize_master_working_set():
  File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3242, in _call_aside
    f(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3271, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 584, in _build_master
    ws.require(__requires__)
  File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 901, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 787, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'pyasn1>=0.1.3' distribution was not found and is required by rsa

compared to some import error, or worse, a silently broken program if the import is conditional.

I wonder if the Python devs have any recommendations to distros on this, or if other distros do something different.

(Apologies if I missed previous discussion on this.)

pganssle commented 4 years ago

The recommendation to distros is definitely to not use setup.py install.

We are 100% planning on removing setup.py install, and for several years we haven't been fixing bugs that can be fixed by using pip. They don't have to use pip, but they should be using something equivalent. The sooner they migrate to something else the better.

mgorny commented 4 years ago

Could you please indicate what 'something equivalent' useful for distributions is? It's easy to remove features you don't need for your workflow. It's much harder to provide a good alternative, and a plan to update thousands of packages to work. Flit/poetry has already caused enough mess by not caring at all about what distributions need.

pfmoore commented 4 years ago

The recommendation is to install using pip. If a distribution doesn't like the script wrappers pip generates, they can certainly write their own (or write a tool to generate something that works as they want). As things stand, I think you'd have to overwrite the pip-created wrappers (or put your own earlier on PATH so they get priority) but it would be a reasonable request for pip install to have a flag that omits generating script wrappers.

mgorny commented 4 years ago

What advantage does pip have over setup.py install? Besides creating even bigger circular dependency graph that makes switching to a new Python version an experience wasting hundreds of hours of our time.

pganssle commented 4 years ago

I think maybe we should take this to a new issue, since we're getting a bit far off the original topic of discussion.

@mgorny If you do not like the supported workflow, or the fact that setup.py install is deprecated and unsupported, would you mind opening a new issue?

Considering that this issue is closed and seems to be a lightning rod for off-topic discussions, I recommend we lock it.

gaborbernat commented 4 years ago

What advantage does pip have over setup.py install? Besides creating even bigger circular dependency graph that makes switching to a new Python version an experience wasting hundreds of hours of our time.

Pretty much every word of pep 518 and pep 517.

FFY00 commented 4 years ago

@gaborbernat you completely missed the point, PEP 517 and 518 are completely irrelevant for what is being discussed.

pganssle commented 4 years ago

I've gone ahead and locked this topic for the sake of the inboxes of the people who followed this issue looking for updates on pkg_resources and who don't care about linux distributions. If other maintainers feel I've overstepped my bounds here, they are welcome to unlock it.

I would like to say to the Linux distributors, (and particularly the Arch Linux packagers; a distro I've been using and heartily recommending for years) — thank you for the work you've been doing. We definitely would like to continue working with you to find a reasonable way to take your important use case into account. You are always welcome to open an issue on setuptools, a thread on the packaging discourse, or even to e-mail me personally. Next time PyCon happens, we'll be having a packaging summit, and we'd be happy to have you involved.

pradyunsg commented 4 years ago

I would like to say to the Linux distributors, (and particularly the Arch Linux packagers; a distro I've been using and heartily recommending for years) — thank you for the work you've been doing. We definitely would like to continue working with you to find a reasonable way to take your important use case into account.

+1

You are always welcome to open an issue on setuptools, a thread on the packaging discourse, or even to e-mail me personally.

I'll extend the same offer from pip's side as well!

pypa / setuptools

Avoid full path enumeration on import of setuptools or pkg_resources? #510