Delay evaluation of re.compile to improve startup time.

chrahunt commented 5 years ago

What's the problem this feature will solve?

Speed up pip startup time, as motivated by #4768.

Describe the solution you'd like

Globally evaluate calls to re.compile lazily, only executing when needed.

Alternative Solutions

Manually delaying evaluation by refactoring code that uses re.compile at global or class scope to do so as-needed.
Create and use an explicit function for lazy-evaluated regex in our code.

A quick grep in pip/_internal shows 21 calls in our code, out of the 220 total mentioned below, so neither of these are likely to have a large impact.

Additional context

Profiling with cProfile indicates that a non-negligible part of our startup time is due to calls to re.compile. Evaluating it lazily shows some improvement. Just running pip, we see:

Metric	Before	After
all function calls	295461	216540
total time	0.224 s	0.182 s
calls to `re.compile`	220	32
time in `re.compile`	0.064 s	0.024 s

The times will always be machine/hardware-dependent, but it seems worth evaluating as an option just from the reduction in number of function calls.

To reproduce, run

test.sh

``` #!/bin/sh set -x d="$(mktemp -d)" cd "$d" python -m cProfile -o pip.cprof -m pip >/dev/null python -c 'from pstats import Stats; Stats("pip.cprof").print_stats("re.py.*$compile$")' # just for testing without requiring file patching cat < sitecustomize.py import re class Proxy: def __init__(self, source): self.__source = source self.__created = False def __getattribute__(self, name): if name.startswith('_Proxy_'): return object.__getattribute__(self, name) if not self.__created: self.__created = True self.__object = self.__source() return getattr(self.__object, name) old_compile = re.compile def re_compile(*args, **kwargs): return Proxy(lambda: old_compile(*args, **kwargs)) re.compile = re_compile EOF PYTHONPATH=$PWD python -m cProfile -o pip.cprof -m pip >/dev/null python -c 'from pstats import Stats; Stats("pip.cprof").print_stats("re.py.*$compile$")' ```

Example output

``` + mktemp -d + d=/tmp/user/1000/tmp.KtQjQEhJvT + cd /tmp/user/1000/tmp.KtQjQEhJvT + python -m cProfile -o pip.cprof -m pip + python -c from pstats import Stats; Stats("pip.cprof").strip_dirs().print_stats("re.py.*$compile$") Mon Jul 8 21:21:36 2019 pip.cprof 295461 function calls (285279 primitive calls) in 0.224 seconds Random listing order was used List reduced from 2370 to 1 due to restriction <'re.py.*\$compile\$'> ncalls tottime percall cumtime percall filename:lineno(function) 220 0.000 0.000 0.064 0.000 re.py:232(compile) + cat + PYTHONPATH=/tmp/user/1000/tmp.KtQjQEhJvT python -m cProfile -o pip.cprof -m pip + python -c from pstats import Stats; Stats("pip.cprof").strip_dirs().print_stats("re.py.*$compile$") Mon Jul 8 21:21:36 2019 pip.cprof 216540 function calls (208328 primitive calls) in 0.182 seconds Random listing order was used List reduced from 2367 to 1 due to restriction <'re.py.*\$compile\$'> ncalls tottime percall cumtime percall filename:lineno(function) 32 0.000 0.000 0.024 0.001 re.py:232(compile) ```

The important parts from the output show the numbers I included in the table above.

Run 1 (no change):

         295461 function calls (285279 primitive calls) in 0.224 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      220    0.000    0.000    0.064    0.000 re.py:232(compile)

Run 2 (lazy re.compile):

         216540 function calls (208328 primitive calls) in 0.182 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       32    0.000    0.000    0.024    0.001 re.py:232(compile)

cjerdonek commented 5 years ago

Do you have any idea how many of these compiled expressions are used during startup, and where they are defined? It seems like you could figure this out using your proxy class.

[Oh, I see now from the table it’s 32, though I still wonder which modules the unused ones are defined in, and what the used ones are needed for during startup.]

chrahunt commented 5 years ago

Here you go:

Defined

path	count
`/lib/python3.7/site-packages/pip/_vendor/pyparsing.py`	39
`/lib/python3.7/http/cookiejar.py`	20
`/lib/python3.7/site-packages/pip/_vendor/distlib/util.py`	15
`/lib/python3.7/site-packages/pip/_vendor/pytoml/parser.py`	13
`/lib/python3.7/html/parser.py`	12
`/lib/python3.7/platform.py`	10
`/lib/python3.7/configparser.py`	8
`/lib/python3.7/site-packages/pip/_vendor/pkg_resources/__init__.py`	6
`/lib/python3.7/logging/config.py`	6
`/lib/python3.7/email/feedparser.py`	5
`/lib/python3.7/textwrap.py`	5
`/lib/python3.7/_markupbase.py`	5
`/lib/python3.7/http/cookies.py`	4
`/lib/python3.7/site-packages/pip/_internal/vcs/subversion.py`	4
`/lib/python3.7/email/header.py`	3
`/lib/python3.7/email/utils.py`	3
`/lib/python3.7/distutils/sysconfig.py`	3
`/lib/python3.7/site-packages/pip/_vendor/packaging/version.py`	3
`/lib/python3.7/site-packages/pip/_vendor/packaging/specifiers.py`	3
`/lib/python3.7/json/encoder.py`	3
`/lib/python3.7/site-packages/pip/_vendor/chardet/universaldetector.py`	3
`/lib/python3.7/site-packages/pip/_internal/models/link.py`	3
`/lib/python3.7/site-packages/pip/_internal/wheel.py`	3
`/lib/python3.7/site-packages/pip/_internal/req/req_file.py`	3
`/lib/python3.7/glob.py`	2
`/lib/python3.7/http/client.py`	2
`/lib/python3.7/distutils/fancy_getopt.py`	2
`/lib/python3.7/plistlib.py`	2
`/lib/python3.7/site-packages/pip/_vendor/colorama/ansitowin32.py`	2
`/lib/python3.7/json/decoder.py`	2
`/lib/python3.7/urllib/request.py`	2
`/lib/python3.7/site-packages/pip/_vendor/requests/utils.py`	2
`/lib/python3.7/site-packages/pip/_internal/download.py`	2
`/lib/python3.7/site-packages/pip/_vendor/html5lib/_inputstream.py`	2
`/lib/python3.7/site-packages/pip/_vendor/html5lib/serializer.py`	2
`/lib/python3.7/locale.py`	1
`/lib/python3.7/string.py`	1
`/lib/python3.7/urllib/parse.py`	1
`/lib/python3.7/email/_encoded_words.py`	1
`/lib/python3.7/email/message.py`	1
`/lib/python3.7/gettext.py`	1
`/lib/python3.7/distutils/dist.py`	1
`/lib/python3.7/site-packages/pip/_vendor/packaging/utils.py`	1
`/lib/python3.7/shlex.py`	1
`/lib/python3.7/html/__init__.py`	1
`/lib/python3.7/json/scanner.py`	1
`/lib/python3.7/site-packages/pip/_vendor/idna/core.py`	1
`/lib/python3.7/encodings/idna.py`	1
`/lib/python3.7/site-packages/pip/_vendor/cachecontrol/controller.py`	1
`/lib/python3.7/site-packages/pip/_internal/utils/encoding.py`	1
`/lib/python3.7/xml/etree/ElementPath.py`	1
`/lib/python3.7/site-packages/pip/_internal/pep425tags.py`	1
`/lib/python3.7/site-packages/pip/_vendor/distlib/scripts.py`	1
`/lib/python3.7/site-packages/pip/_internal/index.py`	1
`/lib/python3.7/site-packages/pip/_vendor/pytoml/utils.py`	1
`/lib/python3.7/site-packages/pip/_internal/vcs/git.py`	1

Used

path	count
`/lib/python3.7/site-packages/pip/_vendor/pyparsing.py`	6
`/lib/python3.7/configparser.py`	5
`/lib/python3.7/site-packages/pip/_vendor/pkg_resources/__init__.py`	3
`/lib/python3.7/textwrap.py`	3
`/lib/python3.7/http/client.py`	2
`/lib/python3.7/site-packages/pip/_vendor/packaging/specifiers.py`	2
`/lib/python3.7/site-packages/pip/_vendor/packaging/version.py`	2
`/lib/python3.7/json/decoder.py`	2
`/lib/python3.7/_markupbase.py`	2
`/lib/python3.7/email/_encoded_words.py`	1
`/lib/python3.7/platform.py`	1
`/lib/python3.7/shlex.py`	1
`/lib/python3.7/http/cookies.py`	1

Generated with

test.sh

``` #!/bin/sh set -x d="$(mktemp -d)" cd "$d" cat <<'EOF' > sitecustomize.py import atexit import re import traceback from collections import Counter used = Counter() created = Counter() def format_counter(c): return ( '|path|count|\n' '|-|-|\n' + '\n'.join(f'|`{k}`|{v}|' for k, v in c.most_common()) ) def show_details(): print('**Defined**\n') print(format_counter(created)) print('\n**Used**\n') print(format_counter(used)) atexit.register(show_details) class Proxy: def __init__(self, key, origin, source): self.__key = key self.__origin = origin self.__source = source self.__created = False def __getattribute__(self, name): if name.startswith('_Proxy_'): return object.__getattribute__(self, name) if not self.__created: print(f'used {self.__origin}') used[self.__key] += 1 self.__created = True self.__object = self.__source() return getattr(self.__object, name) def get_origin(tb): caller = tb[-2] _, file, *_ = caller.split() start = file.index('/lib/') return caller, file[start:-2] old_compile = re.compile def re_compile(*args, **kwargs): origin, file = get_origin(traceback.format_stack()) print(origin) created[file] += 1 return Proxy(file, origin, lambda: old_compile(*args, **kwargs)) re.compile = re_compile EOF PYTHONPATH=$PWD python -m pip ```

cjerdonek commented 5 years ago

Thanks!

The reason I ask, by the way, is that there's another alternative, which is not to import modules / code that isn't needed for startup. Like, you could imagine pip having a minimal startup sub-package, where we put only the code needed for startup. That would give us greater control / visibility into what vendored libraries and standard library modules are needed for startup based on what (module-level) imports are in the startup sub-package.

chrahunt commented 5 years ago

I have tried tackling that, but the most difficult part for me is having enough and the right kind of testing to avoid regressions and increased overhead on development and code review.

Assuming this approach doesn't have major downsides, I don't think that they are mutually exclusive. It might be worth making a dedicated issue for each possible improvement so we can give each one focus in turn.

cjerdonek commented 5 years ago

I guess for me personally I'd be more interested in reviewing and seeing progress on the more general dependency / import issue. Like, why does pip's startup need to import http/cookiejar.py and its 20 regexes, and why does it need to e.g. load the whole download.py and index.py machinery? While it may have an effect, swapping out re.compile() seems more like it's masking / papering over a deeper issue that should be fixed. If our dependencies are structured right, there may be little need to do anything with re.compile(), and I think we'll find that we'll get lots of other benefits for free, too.

As for tackling the dependency issue, I think it's just a matter of deciding on a target for how we want things to be and an incremental plan to get there, and then to do the PR's one at a time. The PR's should mostly just be moving things. And those kinds of changes would I think already be covered by tests since things would fail quickly if e.g. a function was imported from the wrong module. We'd probably want to settle on some kind of light-weight convention or pattern where we import the command or operation code we need inside a function once the code knows what command needs to be executed. There are also many opportunities for doing things towards this goal that can be done even without an overall goal / plan, like moving utility functions out of download.py and other "refactoring" / simplification of the module dependency graph.

cjerdonek commented 5 years ago

Like, it seems like we shouldn't need to actually import all the command classes here just to get their name and summary in order to parse the command name: https://github.com/pypa/pip/blob/ddfa401dae5cdc473772f93951be2715e852681a/src/pip/_internal/commands/__init__.py#L27-L42

It looks like the command class can simply be imported here once the name is parsed from the arguments and known: https://github.com/pypa/pip/blob/ddfa401dae5cdc473772f93951be2715e852681a/src/pip/_internal/__init__.py#L76

pfmoore commented 5 years ago

I'd caution against getting sucked into looking at wider issues all at once. This is a nice, focused improvement which seems like it might help. Let's not block progress just for the sake of other potential improvements that may or may not happen. It's not like we can't do both.

So +1 from me on investigating this possibility further, and looking at other improvements in their own PRs.

cjerdonek commented 5 years ago

I filed a separate issue for what I was suggesting: https://github.com/pypa/pip/issues/6692. It's similarly focused and gets more at the root cause IMO as many of the re.compile lines will no longer be in the import path.

xavfernandez commented 5 years ago

Given that cpython automatically caches the last 512 regexes used, is there really a need to pre-compile all those regexes ?

pradyunsg commented 5 years ago

Related to #4768.

pradyunsg commented 5 years ago

is there really a need to pre-compile all those regexes?

Interesting. The idea here however, IIUC, is to defer the compiling to later, instead of pre-compiling.

pypa / pip

Delay evaluation of re.compile to improve startup time. #6690