pypa / setuptools

Official project repository for the Setuptools build system
https://pypi.org/project/setuptools/
MIT License
2.5k stars 1.19k forks source link

Avoid full path enumeration on import of setuptools or pkg_resources? #510

Closed ghost closed 4 years ago

ghost commented 8 years ago

Originally reported by: konstantint (Bitbucket: konstantint, GitHub: konstantint)


At the moment on my machine, it takes about 1.28 seconds to do a bare import pkg_resources, 1.47 seconds to do a bare import setuptools, 1.36 seconds to do a bare from pkg_resources import load_entry_point and 1.25 seconds to do a bare from pkg_resources import load_entry_point.

This obviously affects all of the python scripts that are installed as console entry points, because each and every one of them starts with a line like that. In code which does not rely on entry points this may be a problem whenever I want to use resource_filename to consistently access static data.

I believe this problem is decently common, yet I did not find any issue or discussion, hence I'm creating one, hoping I'm not repeating what has been said already elsewhere unnecessarily.

I am using Anaconda Python, which comes along with a fairly large package, alongside several of my own packages, which I commonly add my path via setup.py develop, however I do not believe this setup is anything out of the ordinary. There are 37 items on my sys.path at the moment. Profiling import pkg_resources shows that this leads to 76 calls to workingset.add_entry (timing at about a second), of which most of the time is spent in 466 calls to Distribution.from_location.

Obviously, the reason for the problem lies in the two _call_aside methods at the end of pkg_resources which lead to a full scan of the python path at the moment when the package is imported, and the only way to alleviate it would be to somehow avoid or delay the need for this scan as much as possible.

I see two straightforward remedies: a) Make the scanning lazy. After all, if all one needs is to find a particular package, the scan could stop as soon as the corresponding package is located. At the very least this would allow me to "fix" my ipython loading problem by moving it up in the path. This might break some import rules which do not respect the precedence of the path, which I'm not aware. b) Cache a precomputed index and update it lazily. Yes, this might requre some ad-hoc rules for resolving inconsistencies, and this may lead to ugly conflicts with external tools that attempt to install multiple versions of a package, but this will basically avoid the current startup delay in 99.99% of cases and solve so much of my problems, that I'd be willing to pay the price.

Although both options might seem somewhat controversial, the problem itself seems to be serious enough to deserve at least some fix eventually (for example, I've recently discovered I'm reluctant to start ipython for short calculations because of its startup delay which I've now tracked back to this same issue).

I'm contemplating making a separate utility, e.g. fast_pkg_resources, which would implement the strategy b) by simply caching calls to pkg_resources in an external file, yet I thought of raising the issue here to figure out whether someone has already addressed it, whether there are plans to do something about it in the setuptools core codebase, or perhaps I'm missing something obvious.


jbohren commented 8 years ago

@konstantint Here's another instance, with some more comparisons: https://www.reddit.com/r/Python/comments/4auozx/setuptools_deployment_being_very_slow_compared_to/

jaraco commented 8 years ago

Is the performance better with the same packages installed using pip? What about those packages installed with pip install --egg?

As long as console entry points require the validation of all packages in the chain, I expect startup to be somewhat slow.

I worry that remedy (a) might only have modest benefits while imposing new, possibly conflicting instructions to the user on how to implement the remedy.

Remedy (b) promises a nicer use-case, but as you point out, caching is fraught with challenges.

It sounds like you have a decent grasp of the motivations behind the current implementation, so you're at a good place to draft an implementation.

jbohren commented 8 years ago

Is the performance better with the same packages installed using pip? What about those packages installed with pip install --egg?

Even when installing via pip or with --egg, it's still over 300ms for my use case. As an aside, the reason we want to decrease this startup time is so that we can use the tool in interactive tab-completion.

Carreau commented 8 years ago

Might be of interest : https://pypi.python.org/pypi/entrypoints (https://github.com/takluyver/entrypoints) but agreed that the load time is impacting a few other project like everython that rely on prompt_toolkit.

scopatz commented 8 years ago

And everyone that relies on pygments. I have some profiling available at https://github.com/xonsh/import-profiling where I have a nasty sys.modules['pkg_resources'] = None hack to prevent its import.

Importing pygments:

So just by importing pkg_resources, the slowdown is ~100x. In wall clock time, I have consistently tested the pkg_resources overhead to be at least 150 - 200 ms. This makes pkg_resoucres unusable in command line utilities that require fast start up times.

In xonsh, we have resorted to the above hacks to prevent our dependencies (pygments, prompt_toolkit) from accidentally importing it.

olliebun commented 8 years ago

I'm seeing a consistent ~150ms wall clock time as well. I'm writing a command-line utility with autocompletion, so it's a serious challenge. It's not clear how to fix this without giving up all of setuptools' advantages.

scopatz commented 8 years ago

Yesterday, I released the lazyasd package (pip install lazyasd) which has the ability perform imports on a background thread. This was written specifically to mitigate the long pkg_resources import times.

Background thread docs and example here https://github.com/xonsh/lazyasd#background-imports

Feel free to use or copy the lazyasd module into your projects.

ninjaaron commented 8 years ago

I wrote a tiny module called fastentrypoints that monkey patches the mechanism behind entry_points to generate scripts that don't import pkg_resources.

https://github.com/ninjaaron/fast-entry_points

Fak3 commented 7 years ago

@ninjaaron Thanks for fastentrypoints. I managed to fix the distribution issue by adding it to MANIFEST.in:

include fastentrypoints.py
ninjaaron commented 7 years ago

@Fak3 Good idea! I took that crazy bit about downloading and exec-ing the code out of the docs for fastentrypoints and mentioned using MANIFEST.in instead. The fastep command also now appends this line to MANIFEST.in

I have to admit, I loved the way I came up with (because it's so evil), but using MANIFEST.in is waaaaay saner.

cachedout commented 7 years ago

I certainly don't want to pile on but I did want to chime and say that this is an enoromous problem for big Python projects right now. It's severely impacted the performance of SaltStack's CLI toolset, which takes ~2.0s to load, of which 1.9s is spent purely in pkg_resources. Unfortunately, we can't just rip out any imports of pkg_resources because so many of the libs we use end up importing in anyhow. (This is generally the requests package, but could be others.)

We're exploring ways to mitigate this right now but anything we can do to help out here we'd gladly contribute to. It's a big issue for us.

(We're looking at fastentrypoints by @ninjaaron today.) I'll report back with any results. :]

ninjaaron commented 7 years ago

@cachedout I don't think fastentrypoints can solve the problem if you are importing pkg_resources anyway. It only takes it out of the automatically-generated scripts. However, if you are or another library is importing it anyway... :(

I myself have actually moved away from using requests for trivial scripts just to avoid the "tax" of importing it. I'm sure this isn't a solution for you, but you might try (or several of us might try) working with the developer of requests to move away from pkg_resources.

Also, I know the developers of xonsh (@scopatz and co., and I think they are not the only ones) have created mechanisms for lazily importing modules only when they are actually required. This kind of lazy import strategy might be appropriate for your project.

cachedout commented 7 years ago

@ninjaaron Thanks so much for the feedback! Yes, after looking at fastentrypoints it's not the right solution for us, unfortunately.

Yes, we're in the process of deprecating requests directly but we have plugins that use it so it would be very challenging to remove it entirely. I'll head over to the requests project and see if I can get an issue filed.

We do have a lazy plugin system that we really like but unfortunately it doesn't quite get us out of this problem because of the way it's written. There might be some room for improvement though, certainly. I'll be investigating.

One very ugly workaround that we did find (though likely won't use) is to simply fool the importer into skipping over pkg_resources. Somewhat surprisingly, this works at the top of a CLI script:

# This needs to be the first thing you do. Obviously, if `pkg_resources` is already imported you are too late!
import sys
sys.modules['pkg_resources'] = None
<do work>
del sys.modules['pkg_resources']

I'm not necessarily advocating this in all cases but I'll leave it here as a possible workaround for others.

That said, I would still really like to hear from the setuptools folks on this if possible. Having a simple module import stat the disk almost 18,000 times as it does in my test case all but makes many python projects unusable. Would they accept a PR to move away from this behavior by default or at the least, gate it behind an environment variable?

jaraco commented 7 years ago

I don't currently have it paged into my mind why this behavior is the way it is. I can't recall if anyone has analyzed the behavior to see if there is something that can be done. Would someone search to see if there have been solutions proposed outside this ticket and link those here? It sounds like fast entry points suggests a partial solution but not a general one. If one were to analyze the code, does that lead to other ideas? I'm happy to review a PR, especially one that's not too complicated.

What about moving the implicit behavior to another module, like pkg_resources.preload. Then projects and scripts that import for the implicit behavior could try/except to import that module, and those that don't can simply import pkg_resources.

It would be a backward incompatible release, but if that clears the way for more efficient implementations, I'm for it.

untitaker commented 7 years ago

this is an issue for me as well. This might be naive, but is there a reason why the scripts written by ScriptWriter can't import the entry point directly? (i.e. only use pkg_resources.load_entry_point at install time, not runtime).

ninjaaron commented 7 years ago

@untitaker Not a clear reason. script generated with wheel do just that. fastentrypoints monkey-patches ScriptWriter for the same behavior, and it seems to work.

Apparently someone thought this was needed when they wrote it, but clearly it doesn't affect the general use-case!

untitaker commented 7 years ago

Is it possible that some other hook is also executed when load_entry_point is used? That would explain the indirection.

ninjaaron commented 7 years ago

I guess it's possible, but I think the fact that wheels don't behave this way is a pretty good indication that it's unnecessary.

I have a suspicion it's a case of getting so involved in one's own API that it seems like the obvious way to do something, even when there is a much simpler solution. We've all been there...

untitaker commented 7 years ago

I'm currently working on this and it's more complicated. Installing from eggs doesn't work with your patch.

untitaker commented 7 years ago

Ah. Testcases using this example project fail: https://github.com/pypa/setuptools/blob/1ca6f3bf272d8ba2c0d4161cc56a74c63c8afb82/setuptools/tests/test_egg_info.py#L31

The entry point uses a wrong delimiter between module and function (but something like that crashes at runtime anyway).

untitaker commented 7 years ago

@jaraco @ninjaaron See #901

anarcat commented 6 years ago

so fastentrypoints only partially solves this issue, and i hope that surely this issue will not stay idle for another year here, because there are many consumers of pkg_resources. for example, it is suggested as the canonical way for a package to fetch its own version in setuptools_scm. others will use pkg_resources to load their own data_files as well.

and while it's thing to have to eat this performance cost once for something, hitting it on a module import is really unacceptable. we should at the very least, lazy-load those call_aside functions and call them on the fly, with caching of course, as needed, in the functions that actually need them.

then we can make sense of this mess: optimize hot loops and everything. right now, it's hard to eveni make heads or tails of all of this because everything is mangled up in the package load. and what's with the globals() mangling that's going on in there? is that expected practice for tools are basically part of the standard library (i know, they're not, but considering that entry_points is basically the standard way of distributing Python programs, we should consider this is standard).

here's a quick profiler run done on Python3 loading the pkg_resources, including most function calls until we start getting into actual package loading (e.g. feedparser, a rather lorge package, is included):

$ python3 -m cProfile -s cumtime test-import 
         396313 function calls (392538 primitive calls) in 0.335 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     76/1    0.004    0.000    0.336    0.336 {built-in method builtins.exec}
        1    0.000    0.000    0.336    0.336 test-import:1(<module>)
     88/1    0.000    0.000    0.336    0.336 <frozen importlib._bootstrap>:966(_find_and_load)
     88/1    0.000    0.000    0.336    0.336 <frozen importlib._bootstrap>:939(_find_and_load_unlocked)
     88/1    0.000    0.000    0.336    0.336 <frozen importlib._bootstrap>:659(_load_unlocked)
     64/1    0.000    0.000    0.336    0.336 <frozen importlib._bootstrap_external>:667(exec_module)
    108/1    0.000    0.000    0.335    0.335 <frozen importlib._bootstrap>:214(_call_with_frames_removed)
        1    0.000    0.000    0.335    0.335 __init__.py:16(<module>)
        2    0.000    0.000    0.217    0.108 __init__.py:3002(_call_aside)
        1    0.000    0.000    0.217    0.217 __init__.py:3019(_initialize_master_working_set)
       19    0.001    0.000    0.206    0.011 __init__.py:683(add_entry)
      458    0.004    0.000    0.199    0.000 __init__.py:1992(find_on_path)
       15    0.000    0.000    0.108    0.007 __init__.py:1966(_by_version_descending)
       15    0.009    0.001    0.108    0.007 {built-in method builtins.sorted}
        1    0.000    0.000    0.104    0.104 __init__.py:641(_build_master)
        1    0.000    0.000    0.104    0.104 __init__.py:628(__init__)
    29/15    0.000    0.000    0.085    0.006 {built-in method builtins.__import__}
      445    0.003    0.000    0.076    0.000 __init__.py:2418(from_location)
     1601    0.003    0.000    0.074    0.000 __init__.py:1981(_by_version)
     1601    0.003    0.000    0.066    0.000 __init__.py:1987(<listcomp>)
     4199    0.006    0.000    0.063    0.000 version.py:24(parse)
      401    0.001    0.000    0.050    0.000 __init__.py:2760(_reload_version)
      532    0.001    0.000    0.050    0.000 re.py:278(_compile)
       87    0.000    0.000    0.049    0.001 re.py:222(compile)
       83    0.001    0.000    0.048    0.001 sre_compile.py:531(compile)
      401    0.001    0.000    0.048    0.000 __init__.py:2390(_version_from_file)
     5040    0.017    0.000    0.043    0.000 version.py:198(__init__)
        1    0.000    0.000    0.042    0.042 requirements.py:4(<module>)
  300/296    0.005    0.000    0.037    0.000 {built-in method builtins.__build_class__}
     3749    0.003    0.000    0.036    0.000 version.py:74(__init__)
     1818    0.003    0.000    0.034    0.000 __init__.py:2563(_get_metadata)
     3749    0.012    0.000    0.033    0.000 version.py:131(_legacy_cmpkey)
      401    0.001    0.000    0.032    0.000 {built-in method builtins.next}
       83    0.000    0.000    0.031    0.000 sre_parse.py:819(parse)
   337/83    0.001    0.000    0.030    0.000 sre_parse.py:429(_parse_sub)
      841    0.002    0.000    0.030    0.000 __init__.py:1376(safe_version)
      9/8    0.000    0.000    0.030    0.004 <frozen importlib._bootstrap>:630(_load_backward_compatible)
   487/88    0.011    0.000    0.030    0.000 sre_parse.py:491(_parse)
        6    0.000    0.000    0.030    0.005 __init__.py:35(load_module)
        1    0.000    0.000    0.024    0.024 pyparsing.py:61(<module>)
   153/98    0.000    0.000    0.024    0.000 <frozen importlib._bootstrap>:996(_handle_fromlist)
        1    0.000    0.000    0.022    0.022 parser.py:5(<module>)
        1    0.000    0.000    0.022    0.022 feedparser.py:20(<module>)

test-import is simply import pkg_resources in a text file. here, it takes about 200ms more starting Python3 with pkg_resources than without, a result consistent with others. The 345ms result above is probably due to the profiler overhead.. most of the time (217ms) is taken by _initialize_master_working_set():

https://github.com/pypa/setuptools/blob/5da3a845683ded446cad8af009d3ab9f264a944f/pkg_resources/__init__.py#L3193

Half of that time (104ms) is taken by WorkingSet._build_master():

https://github.com/pypa/setuptools/blob/5da3a845683ded446cad8af009d3ab9f264a944f/pkg_resources/__init__.py#L3205

If I would adventure a guess as to the other half, it's the map call a little later:

https://github.com/pypa/setuptools/blob/5da3a845683ded446cad8af009d3ab9f264a944f/pkg_resources/__init__.py#L3228

find_on_path also seems like a pretty hot loop, called 500 times and doing pretty inefficient stuff like:

    if os.path.isdir(path_item) and os.access(path_item, os.R_OK):

this is probably generating hundreds of stat() spurious syscalls. and that's just scratching the surface... there might be other nice optimization opportunities here... but really, there's a lot of work done here to list all packages. and i doubt we'll get to below 10-20ms (a 10-fold increase would be a nice first performance target) with all those...

isn't there a simpler way to do most of what we need here? for example, I just want the version number of my package: I don't want all version numbers of all packages or even the version number of arbitrary package foo. couldn't there be a cache built on install of the package metadata so that is available in a fast way? heck, the entrypoint stuff created by setup.py looks like this here:

#!/usr/bin/python3
# EASY-INSTALL-ENTRY-SCRIPT: 'undertime==1.0.2.dev0+ng7058042.d20180223','console_scripts','undertime'
__requires__ = 'undertime==1.0.2.dev0+ng7058042.d20180223'
import re
import sys
from pkg_resources import load_entry_point

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    sys.exit(
        load_entry_point('undertime==1.0.2.dev0+ng7058042.d20180223', 'console_scripts', 'undertime')()
    )

my version number is right there on top! why do I need to go back to pkg_resources to load it again if it was used to find the package in the first place!

if we deploy an egg, for example, couldn't we just hardcode the path to the egg there and make a special case for our own package in get_distribution()? that would solve 99% of the use cases, and, with proper lazy-loading, would fix this shit for the remaining 1%, which would pay a higher cost to look at other packages.

i know it's a major change to packaging, but it's not like we have a precious little pearl of design that we don't want to touch here. one more stitch on that frankenstein might actually make it look prettier. ;)

cachedout commented 6 years ago

A big šŸ‘ to what @anarcat writes above.

The way this is built at present is causing major performance issues for Python projects that can't simply abandon pkg_resources. A 200ms+ hit on virtually all Python executables is a big, big, big deal.

anarcat commented 6 years ago

Another thing: anecdotal evidence here shows that there might be more performance improvements than I thought possible. Indeed, not that much time is actually spent in syscalls at all, and a lot of time spent in pkg_resources is just raw userland CPU cycles. Here's an example of a program that I have which uses setuptools_scm to guess its own version number, but only if it's not available in a _version.py file. With the _version.py file we get our baseline, about 83ms:

$ multitime -n 100 -s 0 -q ./undertime.py
===> multitime results
1: -q ./undertime.py
            Mean        Std.Dev.    Min         Median      Max
real        0.083       0.005       0.081       0.081       0.119       
user        0.075       0.007       0.064       0.076       0.116       
sys         0.006       0.005       0.000       0.008       0.016     

Note: 6ms spent on system calls on average. Now, with pkg_resources being hit:

$ multitime -n 100 -s 0 -q ./undertime.py
===> multitime results
1: -q ./undertime.py
            Mean        Std.Dev.    Min         Median      Max
real        0.377       0.014       0.369       0.373       0.451       
user        0.345       0.015       0.316       0.344       0.416       
sys         0.026       0.009       0.004       0.024       0.048    

feeel it, feeeeeeel the pain!!!!! Whereas the program was running almost instantly (below the traditional 100ms threshold), it's now pushing past the load time of my homepage through a DSL link, with a lovely 380ms. Also note that the vast majority of that is spent in userland: 345ms, with only a five-fold increase in the system time (26 ms).

In other words, I don't know what we do in pkg_resources, but we sure as heck spend a lot of time building and inspecting data structures that we just throw out the window a second later when the program terminates: most of the time is not spent looking at the filesystem, but actually calculating stuff.

It might very well be that find_on_path and add_entry are actually where all the stuff is going on: they are called often enough and fast enough and they are slow enough to matter. How about we optimize the heck out of those functions now? :)

benoit-pierre commented 6 years ago

I think your missing the point, it's not just about your version number: handling namespace packages, checking requirements and activating distributions... You can't do that without knowing what distributions are available. And load_entry_point will use that info.

anarcat commented 6 years ago

@benoit-pierre the point of package_resources is exactly that: it's doing too much stuff. it shouldn't load a model of the universe and try to guess the weather in Honolulu when all i want to know is "what's my address?" Normally, I just know, but if i don't, I just step out the door and look. Most operations are just like that: local inventory. Not multi-level introspection. I understand why those might be conflated at the data structure level, but from an API consumer perspective, this is really hurting the developer experience.

cachedout commented 6 years ago

"..from an API consumer perspective, this is really hurting the developer experience."

Not just from the developer experience.

Again, I can't emphasize this enough: programs that are written in Python which end up hitting this code path (which is a huge number of all Python executables) are experiencing a massive decrease in performance when it comes to execution time.

scopatz commented 6 years ago

Yes this continues to be the major startup time slowdown for all of my projects.

ninjaaron commented 6 years ago

I like how popular fastentrypoints is getting, but clearly the real solution is just to make entry_points fast -- not to mention all the other uses of pkg_resources. I've been using gross ~data_files~ package_data hackarounds just to avoid it but my own projects. I hesitate to use Requests in small scripts because I know it's using pkg_resoures. Requests alone should be reason enough to fix this ridiculous behavior.

ninjaaron commented 6 years ago

@scopatz Yes! I'm a huge fan of xonsh! The main reason I don't use it much is because I'm constantly opening new terminals and the load time is slow. Fix pkg_resources already! The people want xonsh!

anarcat commented 6 years ago

@ninjaaron it would be great if you could share the hack you use for data_files, it's the other main reason i use pkg_resources outside of entry_point and setuptools_scm.

i guess that if we have workarounds for entrypoints, data_files and get_version, I'd be happy with pkg_resources being slow. :p

dsully commented 6 years ago

@anarcat - Check out http://importlib-resources.readthedocs.io/en/latest/ which will also be part of Python 3.7

ninjaaron commented 6 years ago

@anarcat looks like I misspoke. I use package_data. You have to do some stuff in setup.py and you also have to include the required files in your MANIFEST.in (to insure it's included if someone installs with pip). Then, I locate the data file relative to the module when I need to use it. It's absurd.

Here's an example where I ship the data in the package folder: https://github.com/ninjaaron/lazydots for that, you just use include_package_data=True in setup.py

In another case (https://github.com/OriHoch/python-hebrew-numbers), the data comes from a repo we vendor, so you have to do some manual enumeration.

More info: https://docs.python.org/3.6/distutils/setupscript.html#installing-package-data

warsaw commented 6 years ago

At this point I'm pretty well convinced pkg_resources can't be fixed, it needs to be replaced. And it should be replaced not by another monolithic package but by smaller packages that each do a piece of what pkg_resources tries to do. That's why we wrote importlib.resources for 3.7 (and the backport of importlib_resources), and why things like fastentrypoints is a good thing.

With Python 3, we shouldn't need pkg_resources for namespace packages; we should just get on with it and ditch Python 2, and adopt native namespace packages. The next thing I want to look at is a better way of extracting version numbers from libraries and applications. I have some thoughts here, but haven't yet put pen to paper.

konstantint commented 6 years ago

@warsaw importlib.resources does indeed solve the resource_filename problem perfectly (which covers about half of my own uses of pkg_resources), but what should we do with entrypoints (which makes for another half)? Do you know of a dedicated module for implementing these, or are there any plans for one?

anarcat commented 6 years ago

@konstantint what's wrong with fastentry? couldn't that make it into stdlib eventually?

dsully commented 6 years ago

There is also http://entrypoints.readthedocs.io - which has some entrypoint helpers.

ninjaaron commented 6 years ago

@anarcat ha! I doubt it. fastentrypoints monkey patches setuptools, which also isn't in the standard library!

fastentrypoints is really just a hack-around for situations where you're not using wheels, i.e. for development testing (something installed with pip install -e), or if someone else is creating a package and you don't have control over whether they use wheel or setup.py (many Linux distros have scripts for automatically generating their package format from Python packages that, unfortunately, don't use wheel).

For uploading to pypi, you should already be building with wheel, which solves the same problem (I even stole the script fastentrypoints generates directly from wheel... Or maybe one of the related packages, possibly distlib...).

I definitely think of fastentrypoints as a sort of stop-gap until someone fixes this broken packaging system for real.

warsaw commented 6 years ago

We're experimenting with fep, but yeah, I'm not sure that's going to be the long term solution. Another thing to consider; IMHO we really, really want to get rid of setup.py and the whole imperative build system, which includes setuptools. I think PEP 517/518, pyproject.toml declarative build specifications, flit, etc. are the long term way to go, so that's what we should be thinking about. I'm not as involved in the distutils-sig (or, maybe I try to ignore it as much as possible, and there are awesome people doing great work over there ;), but I think that's also the general consensus for where the ecosystem should be moving.

anarcat commented 6 years ago

while it'd be great to get rid of setup.py, i'd like to see this issue fixed without having to rebuild all of Python's distribution system, which could take a ... rather long time. :)

asottile commented 6 years ago

I recently tried out the brand new importlib-metadata and importlib.resources!

Here's an example PR which replaces pkg_resources.get_distribution(...).version and pkg_resources.resource_filename(...) with those modules: https://github.com/pre-commit/pre-commit/pull/846

For comparison, the "usable startup" time before vs. after:

before

$ time pre-commit --version
pre-commit 1.11.2

real    0m0.363s
user    0m0.303s
sys 0m0.043s

after

$ time pre-commit --version
pre-commit 1.11.2

real    0m0.254s
user    0m0.229s
sys 0m0.020s

Thanks @jaraco @warsaw for these awesome improvements šŸ‘ šŸ‘ šŸ‘!

pganssle commented 6 years ago

I do not see anywhere in this issue an explanation for why pkg_resources cannot be fixed. Has anyone tried?

At this point I'm pretty well convinced pkg_resources can't be fixed, it needs to be replaced. And it should be replaced not by another monolithic package but by smaller packages that each do a piece of what pkg_resources tries to do.

@warsaw I tend to agree with switching to a unix philosophy type design, but can this particular issue about the eager path enumeration be fixed?

asottile commented 6 years ago

The import time cost could be fixed, however anything that depends on any api which uses the working_set will still incur the huge cost at ~some point. Construction of working_set basically involves querying the filesystem for every sys.path entry (and then reading a bunch of metadata off disk). For example iter_entry_points requires that every distribution is available such that it can be checked for a specific class of entry point (meaning things like flake8 / pytest which use entrypoints as a plugin system will still have to have something do all of that work).

warsaw commented 5 years ago

I think the problem is that pkg_resources has some very subtle semantics, it's used throughout the ecosystem, and somewhere somebody probably depends on the current behavior. Can it be fixed without breaking things? I'm doubtful, but all I know is that I have no interest in working on that. :)

ninjaaron commented 5 years ago

If that's the case, the answer is depreciation (and warning) with some kind of forward-compatible kludge that implements the correct behavior.

As far as I'm concerned, backward compatibility is not an eternal obligation. You just have to give advanced notice when you break it.

On Thu, 1 Nov 2018, 23:15 Barry Warsaw, notifications@github.com wrote:

I think the problem is that pkg_resources has some very subtle semantics, it's used throughout the ecosystem, and somewhere somebody probably depends on the current behavior. Can it be fixed without breaking things? I'm doubtful, but all I know is that I have no interest in working on that. :)

ā€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pypa/setuptools/issues/510#issuecomment-435205382, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBiAHjXgaMsl_w7aNWEMsbCPzGUPNKjks5uq3JkgaJpZM4ISY72 .

pganssle commented 5 years ago

@ninjaaron It's unclear who would be relying on this and why, and it's notoriously hard to warn about any default behavior (most people who do the default thing will probably not care about one specific behavior or another, plus you generally want the default behavior to be the sane one - very hard to get people to opt in to that).

We can start by making this behavior occur lazily only when needed, but a full understanding of what operations use the working set and why people use them would be needed to properly optimize it.

navytux commented 5 years ago

It is sad to see this bug is going to stay with us for a long time.

pganssle commented 5 years ago

@navytux I think it's just that it's hard to fix. If you would like to take a crack at this it would be awesome to make this happen lazily.

lorencarvalho commented 5 years ago

@pganssle it's been a while since I looked, but iirc the issue is that the building of the WorkingSet underpins all other functionality, so if pkg_resources is in the critical path for your application (for example, finding the entry point), you are going to take the performance penalty whether it's lazily computed or not.

gaborbernat commented 5 years ago

As long as console entry points require the validation of all packages in the chain, I expect startup to be somewhat slow.

@jaraco is this right, do we require this for console entry points too? Note if packages are installed as a wheel the user sidesteps this as pip does not generate a script that tries to do this validation (bumped into this via https://github.com/pypa/virtualenv/pull/1300)