Pex zip-creation takes a very long time for `torch>=2`

Hey!

Not sure if actionable, but maybe there's something here that can be done. I was investigating another issue today and ended up seeing a very slow Pants package step ~5 minutes. The issue reproduces with the simple command line pex -vvv torch>=2 -o t2.2.pex. This takes ~280 seconds on my machine, of which ~210-220 is spent purely in the zip step:

<snip>
pex: Building pex: 70298.8ms
pex:   Adding distributions from pexes: : 0.1ms
pex:   Resolving distributions for requirements: torch: 70294.7ms
pex:     Resolving requirements.: 70294.6ms
pex:       Resolving for:
  /usr/bin/python3.10: 55574.5ms
pex:       Calculating project names for direct requirements:
  PyPIRequirement(line=LogicalLine(raw_text='torch', processed_text='torch', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('')>, marker=None), editable=False): 0.1ms
pex:       Installing 22 distributions: 9352.3ms
pex:       Checking install: 2.7ms
pex:   Configuring PEX dependencies: 3.4ms
pex: Zipping PEX file.: 213135.5ms

This turns out to a 2.5 GB pex, which admittedly is on the fat side. Unzipping this beast takes ~30 seconds, and zipping it with regular zip takes ~230 seconds. zip -1 takes ~100 seconds and adds ~10% to the size. zip -0 takes 12 seconds but doubles the size. Seeing as compression seems to add the majority of the runtime, I did a very quick hack (outside of pex) where I move the compress step to a process pool (since it's CPU-heavy). With that, I get ~30 seconds at level 1, or about ~60 seconds on level 6. So 3-4x speed increase. It may be able to push this a bit higher by playing with ordering.

I also played around with the store-only-by-suffix capabilities, but it seems like the .so's make up the bulk of both the compression potential and time: only compressing text-like files gives a ~4.3 GB zip in 20 seconds.

With all that said, I'm mostly curious if this is something that has been discussed elsewhere (found nothing while searching), and what kind of solution might be palatable relative to the gains that can be made. I'm willing to contribute something based on the work I've done so far, or investigate other suggested approaches.

This has come up before. Two concrete results are the support for --layout packed introduced in #1431 / #1438 and the --no-compress Pex build option introduced in #1705. The associated issues have more discussion.

If neither --layout packed, which amortizes the slow zip to once per wheel and is used by Pants internally for this and other reasons, nor --no-compress are satisfactory, the only other approaches I can see are:

Speed up zip.
Cache zips as is done for --layout packed, but made usable for monolithic PEX zips.

@cosmicexplorer explored both and came up wanting. I think #2158 is probably the best entrypoint into that work.

I'll have a peek at those, thanks. We already use layout=packed (+execution_mode=venv) in some situations. In the specific case where I hit this I was running a python_source where I don't have control over that, and I'm not sure what the default is. The timings seem to end up the same as with the command posted though.

--no-compress I think could work; if I can pass it into pants somewhere. Most of our pex building (with gpu wheels) is either to execute it immediately or to unpack it into a container. We have only one use-case for pex-at-rest, and that is a fraction of the size of these big GPU packages.

I still do think there is great value to being performant "by default" though, but maybe my effort is better invested into contributing to the already existing work by @cosmicexplorer -- will see if there's anything I can do there.

I still do think there is great value to being performant "by default" though

I agree there, but the only real solution for that is faster zip support. FWICT that is a problem for native code and not really related to Pex at all. With that implemented though, Pex - and many other tools - could benefit.

To be honest though, I think trying to make Pex - or any zipapp implementation - faster for behomoths like pytorch is fighting the wrong battle altogether. I imagine a much "simpler" way to do this is to not use a zipapp. For example, one might imagine a scie that contained all the resolved wheels for a zipapp, but not pre-installed wheels like PEXes contain, the actual wheel files downloaded from PyPI. The scie could then use PBS's Python distributions support for -mvenv to create a venv and install the contained wheels. This would mean there is 0 compression time or effort spent packaging the scie since the wheels are used as-is and just cat'ed to the scie and there is only the 1 time install time of unzipping.

Alternatively, instead of the scie containing raw wheel files, a PEX could. Pex would then need to learn how to install wheels though at runtime. Currently it lets Pip do this at build time. In this way the whl contents of a PEX could be stored as STORED by default.

I agree there, but the only real solution for that is faster zip support. FWICT that is a problem for native code and not really related to Pex at all. With that implemented though, Pex - and many other tools - could benefit.

That is also an option, and looks like was explored fairly well. Will see if that can be landed, it'd definitely be good. My approach is Python native, but probably a lot hackier since it depended a lot on zipfile internals.

To be honest though, I think trying to make Pex - or any zipapp implementation - faster for behomoths like pytorch is fighting the wrong battle altogether. I imagine a much "simpler" way to do this is to not use a zipapp. For example, one might imagine a scie that contained all the resolved wheels for a zipapp, but not pre-installed wheels like PEXes contain, the actual wheel files downloaded from PyPI. The scie could then use PBS's Python distributions support for -mvenv to create a venv and install the contained wheels. This would mean there is 0 compression time or effort spent packaging the scie since the wheels are used as-is and just cat'ed to the scie and there is only the 1 time install time of unzipping.

I think my stance on torch is that whatever they do, doing the opposite is likely better. My life (and yours, by extension) would be a lot better if we didn't have to think about why they decide to ship a whole copy of CUDA in their wheels, or why their native component is larger than the Linux kernel when built 🤷 Inexplicably, the situation is now even worse that more of CUDA is on PYPI.

Alternatively, instead of the scie containing raw wheel files, a PEX could. Pex would then need to learn how to install wheels though at runtime. Currently it lets Pip do this at build time. In this way the whl contents of a PEX could be stored as STORED by default.

Hmm. That doesn't sound half bad, at least for some use-cases. I guess it'd be almost the same size as well, since zip only uses local compression. A wheel install is pretty much guaranteed to be isolated, right? I'm not sure I can fully see the implications for Pants though, or how it'd end up working in every situation (pants package vs run vs export...).

Hmm. That doesn't sound half bad, at least for some use-cases. I guess it'd be almost the same size as well, since zip only uses local compression. A wheel install is pretty much guaranteed to be isolated, right? I'm not sure I can fully see the implications for Pants though, or how it'd end up working in every situation (pants package vs run vs export...).

This would be opaque to all Pex users at runtime. The PEX zipapp would use STORED unadulterated .whl files instead of today's DEFLATED installed wheel chroots and the packed layout would use .deps/X.whl unadulterated .whl files instead of today's zipped-up installed wheel chroots. At runtime, new Pex installer code would install from these internal files (unzip + spread as per https://packaging.python.org/en/latest/specifications/binary-distribution-format/#installing-a-wheel-distribution-1-0-py32-none-any-whl ... plus a little more since that spec is actually wanting for how console scripts are actually handled in the wild) into the ~/.pex/installed_wheels (and then create a venv if using --venv from there), exactly as today.

I really do think this is the right way to go. Don't speed up zipping, avoid unzipping (installing wheels at build time) + zipping (back into a PEX zipapp or packed layout ~wheel zips) altogether. There will still be an unzip on a cold cache for the 1st boot at runtime, but since zipfile.ZipFile(zipfile.ZipFile("the.pex").open(".deps/X.whl")).extractall("here") works and is efficient, this should be ~the same PEX 1st boot install time as today.

I experimented enough writing a PEP-427 installer today to see it works, but you need to handle generating console scripts since .whls in the wild, for the most part, don't actually carry these in proj-rev.data/scripts/... as you'd hope they would given PEP-427.

@tgolsson I won't have solid time until the 23-28th, but I think I can get this knocked out and released then. I'm not sure exactly how to spell the feature activation, perhaps two new --layout options - one for zipapp and one for spread, but that's not too important as long as no existing users / PEX_ROOT caches are broken.

That sounds very good. My concern with pants is mostly how far away from pants <goal> a potential error can occur, since I assume there are issues that could surface only when installing wheels. But since adding this feature to Pants would require work anyway, that's not going to be an immediate problem - and I'm guessing this would be opt-in per target either way.

It also seems like a good feature for Pex, regardless of Pants usage.

Noting I did not complete this during the current work stretch. It will be picked back up on December 10th when I start my next work stretch.

This should completely side-step the need for #2158 since it does better than that approach ever could by avoiding zipping altogether (and unzipping as well!).

Ok, circling back to the OP using #2298:

Status quo:

$ rm -rf ~/.pex/installed_wheels/
$ time python3.11 -mpex -v torch==2.1.1 -o t2.2.pex
...
pex: Building pex: 20905.4ms
pex:   Adding distributions from pexes: : 0.0ms
pex:   Resolving distributions for requirements: torch==2.1.1: 20902.6ms
pex:     Resolving requirements.: 20902.5ms
pex:       Resolving for:
  /usr/bin/python3.11: 8135.2ms
pex:       Calculating project names for direct requirements:
  PyPIRequirement(line=LogicalLine(raw_text='torch==2.1.1', processed_text='torch==2.1.1', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('==2.1.1')>, marker=None), editable=False): 0.1ms
pex:       Installing 22 distributions: 10994.5ms
pex:       Checking install: 2.2ms
pex:   Configuring PEX dependencies: 2.3ms
Saving PEX file to t2.2.pex
Previous binary unexpectedly exists, cleaning: t2.2.pex
pex: Zipping PEX file.: 167895.1ms
/home/jsirois/dev/pantsbuild/jsirois-pex/pex/pex_builder.py:113: PEXWarning: The PEX zip at t2.2.pex~ is not a valid zipapp: Could not find the `__main__` module.
This is likely due to the zip requiring ZIP64 extensions due to size or the
number of file entries or both. You can work around this limitation in Python's
`zipimport` module by re-building the PEX with `--layout packed` or
`--layout loose`.
  pex_warnings.warn(message)

Using --no-pre-install-wheels:

$ rm -rf ~/.pex/installed_wheels/  
$ python3.11 -mpex -v torch==2.1.1 --no-pre-install-wheels -o t2.2.pex
...
pex: Building pex: 10125.3ms
pex:   Adding distributions from pexes: : 0.0ms
pex:   Resolving distributions for requirements: torch==2.1.1: 10123.1ms
pex:     Resolving requirements.: 10123.1ms
pex:       Resolving for:
  /usr/bin/python3.11: 8274.5ms
pex:       Calculating project names for direct requirements:
  PyPIRequirement(line=LogicalLine(raw_text='torch==2.1.1', processed_text='torch==2.1.1', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('==2.1.1')>, marker=None), editable=False): 0.1ms
pex:       Checking build: 2.1ms
pex:   Configuring PEX dependencies: 1.7ms
Saving PEX file to t2.2.pex
pex: Zipping PEX file.: 3173.1ms
/home/jsirois/dev/pantsbuild/jsirois-pex/pex/pex_builder.py:113: PEXWarning: The PEX zip at t2.2.pex~ is not a valid zipapp: Could not find the `__main__` module.
This is likely due to the zip requiring ZIP64 extensions due to size or the
number of file entries or both. You can work around this limitation in Python's
`zipimport` module by re-building the PEX with `--layout packed` or
`--layout loose`.
  pex_warnings.warn(message)

So that's:		Status quo
Pre-install time (~unzip)	10.99s	N/A
Zip time	167.89s	3.17s
Size (bytes)	2680106601	2677995839

Of course, this is not a great example since the resulting PEX cannot be run as the elided warning indicates in both cases; so we can't examine the tradeoff in the 1st boot runtime penalty for installing the wheels just in time.

And, using the OP, but with --layout packed --venv --venv-site-packages-copies, which is required to work around the zipapp size issue and work around indirect nvidia dependencies failure to properly use namespace packages:

Status quo cold:

$ rm -rf ~/.pex/installed_wheels/ ~/.pex/packed_wheels/
$ python3.11 -mpex -v torch==2.1.1 --venv --venv-site-packages-copies --layout packed -o t2.2.pex
...
pex: Building pex: 20589.6ms
pex:   Adding distributions from pexes: : 0.0ms
pex:   Resolving distributions for requirements: torch==2.1.1: 20586.9ms
pex:     Resolving requirements.: 20586.9ms
pex:       Resolving for:
  /usr/bin/python3.11: 8686.9ms
pex:       Calculating project names for direct requirements:
  PyPIRequirement(line=LogicalLine(raw_text='torch==2.1.1', processed_text='torch==2.1.1', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('==2.1.1')>, marker=None), editable=False): 0.1ms
pex:       Installing 22 distributions: 10215.5ms
pex:       Checking install: 1.7ms
pex:   Configuring PEX dependencies: 2.2ms
Saving PEX file to t2.2.pex
pex: Zipping PEX .bootstrap/ code.: 86.5ms
pex: Zipping 22 distributions.: 172517.1ms
$ du -sb t2.2.pex/
2679282217      t2.2.pex/

Status quo warm:

$ python3.11 -mpex -v torch==2.1.1 --venv --venv-site-packages-copies --layout packed -o t2.2.pex
...
pex: Building pex: 12982.0ms
pex:   Adding distributions from pexes: : 0.1ms
pex:   Resolving distributions for requirements: torch==2.1.1: 12979.3ms
pex:     Resolving requirements.: 12979.2ms
pex:       Resolving for:
  /usr/bin/python3.11: 8217.2ms
pex:       Calculating project names for direct requirements:
  PyPIRequirement(line=LogicalLine(raw_text='torch==2.1.1', processed_text='torch==2.1.1', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('==2.1.1')>, marker=None), editable=False): 0.1ms
pex:       Installing 22 distributions: 3051.5ms
pex:       Checking install: 1.8ms
pex:   Configuring PEX dependencies: 2.2ms
Saving PEX file to t2.2.pex
pex: Zipping PEX .bootstrap/ code.: 0.0ms
pex: Zipping 22 distributions.: 0.4ms
$ du -sb t2.2.pex/
2679282217      t2.2.pex/

Using --no-pre-install-wheels (~same for warm and cold cases):

$ python3.11 -mpex -v torch==2.1.1 --venv --venv-site-packages-copies --layout packed --no-pre-install-wheels -o t2.2.whls.pex
...
pex: Building pex: 10429.3ms
pex:   Adding distributions from pexes: : 0.0ms
pex:   Resolving distributions for requirements: torch==2.1.1: 10427.3ms
pex:     Resolving requirements.: 10427.2ms
pex:       Resolving for:
  /usr/bin/python3.11: 8666.5ms
pex:       Calculating project names for direct requirements:
  PyPIRequirement(line=LogicalLine(raw_text='torch==2.1.1', processed_text='torch==2.1.1', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('==2.1.1')>, marker=None), editable=False): 0.1ms
pex:       Checking build: 1.7ms
pex:   Configuring PEX dependencies: 1.7ms
Saving PEX file to t2.2.whls.pex
pex: Zipping PEX .bootstrap/ code.: 91.7ms
pex: Copying 22 distributions.: 0.2ms
$ du -sb t2.2.whls.pex/
2678537958      t2.2.whls.pex/

So that's:		Status quo (cold)	Status quo (warm)
Pre-install time (~unzip)	10.22s	N/A	N/A
Zip / Copy time	172.52s	0.4s	0.2s
Size (bytes)	2679282217	2679282217	2678537958

And at runtime:

$ hyperfine \
    -w2 \
    -p 'rm -rf ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p 'rm -rf ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p 'rm -rf ~/.pex/installed_wheels ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p 'rm -rf ~/.pex/installed_wheels ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p '' \
    -p 'rm -rf ~/.pex/installed_wheels ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p 'rm -rf ~/.pex/installed_wheels ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p '' \
    -n 'Status quo warm 1st' \
    -n 'Status quo warm 1st parallel' \
    -n 'Status quo cold 1st' \
    -n 'Status quo cold 1st parallel' \
    -n 'Status quo hot' \
    -n 'With --no-pre-install-wheels 1st' \
    -n 'With --no-pre-install-wheels 1st parallel' \
    -n 'With --no-pre-install-wheels hot' \
    't2.2.pex/__main__.py -c "import torch"' \
    'PEX_MAX_INSTALL_JOBS=0 t2.2.pex/__main__.py -c "import torch"' \
    't2.2.pex/__main__.py -c "import torch"' \
    'PEX_MAX_INSTALL_JOBS=0 t2.2.pex/__main__.py -c "import torch"' \
    't2.2.pex/__main__.py -c "import torch"' \
    't2.2.whls.pex/__main__.py -c "import torch"' \
    'PEX_MAX_INSTALL_JOBS=0 t2.2.whls.pex/__main__.py -c "import torch"' \
    't2.2.whls.pex/__main__.py -c "import torch"'
Benchmark 1: Status quo warm 1st
  Time (mean ± σ):      5.765 s ±  0.040 s    [User: 5.017 s, System: 0.734 s]
  Range (min … max):    5.717 s …  5.853 s    10 runs

Benchmark 2: Status quo warm 1st parallel
  Time (mean ± σ):      5.991 s ±  0.035 s    [User: 7.267 s, System: 0.885 s]
  Range (min … max):    5.952 s …  6.054 s    10 runs

Benchmark 3: Status quo cold 1st
  Time (mean ± σ):     26.737 s ±  0.338 s    [User: 24.027 s, System: 2.683 s]
  Range (min … max):   26.307 s … 27.365 s    10 runs

Benchmark 4: Status quo cold 1st parallel
  Time (mean ± σ):     12.790 s ±  0.141 s    [User: 30.314 s, System: 3.424 s]
  Range (min … max):   12.549 s … 12.969 s    10 runs

Benchmark 5: Status quo hot
  Time (mean ± σ):     889.1 ms ±   4.9 ms    [User: 815.3 ms, System: 68.5 ms]
  Range (min … max):   883.1 ms … 898.3 ms    10 runs

Benchmark 6: With --no-pre-install-wheels 1st
  Time (mean ± σ):     29.602 s ±  0.137 s    [User: 26.534 s, System: 3.034 s]
  Range (min … max):   29.480 s … 29.955 s    10 runs

Benchmark 7: With --no-pre-install-wheels 1st parallel
  Time (mean ± σ):     14.062 s ±  0.245 s    [User: 34.360 s, System: 3.842 s]
  Range (min … max):   13.780 s … 14.540 s    10 runs

Benchmark 8: With --no-pre-install-wheels hot
  Time (mean ± σ):     882.1 ms ±   4.0 ms    [User: 810.3 ms, System: 66.7 ms]
  Range (min … max):   874.7 ms … 889.1 ms    10 runs

Summary
  With --no-pre-install-wheels hot ran
    1.01 ± 0.01 times faster than Status quo hot
    6.54 ± 0.05 times faster than Status quo warm 1st
    6.79 ± 0.05 times faster than Status quo warm 1st parallel
   14.50 ± 0.17 times faster than Status quo cold 1st parallel
   15.94 ± 0.29 times faster than With --no-pre-install-wheels 1st parallel
   30.31 ± 0.41 times faster than Status quo cold 1st
   33.56 ± 0.22 times faster than With --no-pre-install-wheels 1st

So, in summary, that's (assuming resolve time for the build and run cases are equal and so are ignored):		Status quo	With `--no-pre-install-wheels`
Cold build and run 1st local machine	188.51s	29.80s	84% faster
Cold run 1st remote machine	26.74s	29.60s	11% slower
Cold run 1st remote machine parallel	12.79s	14.06s	10% slower
Size (bytes)	2679282217	2678537958	0.02% smaller

This means, for local, internal-only use --no-pre-install-wheels is always a win. Important examples are Pants's Python backend use case and @cosmicexplorer's case in #2158 of local iteration on an ML / data science project.

For cases where remote deployment cold 1st run start time is important (legacy lambdex use cases come to mind), --no-pre-install-wheels will always be a small loss.

For other cases the perf is a wash and more localized analysis is needed to decide which set of options to use.

The analysis above is at the extreme end of PEX sizes (~2GB). I'll add the same analysis below for the extreme small end (A cowsay PEX) to button this up, assuming ~linearity between the two extremes.

Ok, for a small case I used cowsay and ansicolors deps with this 93 byte main.py and driver scripts:

app/src/main.py

```python import colors import cowsay if __name__ == "__main__": cowsay.tux(colors.blue("Moo?")) ```

app/build-cowsay.sh

```bash #!/usr/bin/env bash set -euo pipefail PYTHON="${PYTHON:-python3.11}" PEX_DIR="$(git rev-parse --show-toplevel)" APP_DIR="${PEX_DIR}/app" cd "${PEX_DIR}" DEPS="${DEPS:-cowsay ansicolors}" venv="$(mktemp -d)" "${PYTHON}" -mvenv "${venv}" "${venv}/bin/python" -mpip --disable-pip-version-check -q wheel --wheel-dir "${APP_DIR}/wheels" ${DEPS[*]} function build_pex() { echo "${PYTHON} -mpex --no-pypi -f ${APP_DIR}/wheels -D ${APP_DIR}/src -m main ${DEPS[*]} ${@}" } hyperfine \ -w2 \ -p 'rm -rf ~/.pex' \ -p 'rm -rf ~/.pex' \ -p 'rm -rf ~/.pex' \ -p 'rm -rf ~/.pex' \ -p 'rm -rf ~/.pex' \ -p 'rm -rf ~/.pex' \ -p '' \ -p '' \ -p '' \ -p '' \ -p '' \ -p '' \ -n 'Build zipappi (cold)' \ -n 'Build .whl zipapp (cold)' \ -n 'Build packed (cold)' \ -n 'Build .whl packed (cold)' \ -n 'Build loose (cold)' \ -n 'Build .whl loose (cold)' \ -n 'Build zipappi (warm)' \ -n 'Build .whl zipapp (warm)' \ -n 'Build packed (warm)' \ -n 'Build .whl packed (warm)' \ -n 'Build loose (warm)' \ -n 'Build .whl loose (warm)' \ "$(build_pex --layout zipapp -o ${APP_DIR}/cowsay.zipapp.pex)" \ "$(build_pex --layout zipapp --no-pre-install-wheels -o ${APP_DIR}/cowsay.zipapp.whls.pex)" \ "$(build_pex --layout packed -o ${APP_DIR}/cowsay.packed.pex)" \ "$(build_pex --layout packed --no-pre-install-wheels -o ${APP_DIR}/cowsay.packed.whls.pex)" \ "$(build_pex --layout loose -o ${APP_DIR}/cowsay.loose.pex)" \ "$(build_pex --layout loose --no-pre-install-wheels -o ${APP_DIR}/cowsay.loose.whls.pex)" \ "$(build_pex --layout zipapp -o ${APP_DIR}/cowsay.zipapp.pex)" \ "$(build_pex --layout zipapp --no-pre-install-wheels -o ${APP_DIR}/cowsay.zipapp.whls.pex)" \ "$(build_pex --layout packed -o ${APP_DIR}/cowsay.packed.pex)" \ "$(build_pex --layout packed --no-pre-install-wheels -o ${APP_DIR}/cowsay.packed.whls.pex)" \ "$(build_pex --layout loose -o ${APP_DIR}/cowsay.loose.pex)" \ "$(build_pex --layout loose --no-pre-install-wheels -o ${APP_DIR}/cowsay.loose.whls.pex)" du -sbl ${APP_DIR}/cowsay.* | sort -n ```

app/perf-cowsay.sh

```bash #!/usr/bin/env bash set -euo pipefail PEX_DIR="$(git rev-parse --show-toplevel)" APP_DIR="${PEX_DIR}/app" cd "${APP_DIR}" hyperfine \ -w2 \ -p 'rm -rf ~/.pex' \ -n 'Run zipapp cold' \ -n 'Run .whl zipapp cold' \ -n 'Run packed cold' \ -n 'Run .whl packed cold' \ -n 'Run loose cold' \ -n 'Run .whl loose cold' \ -n 'Run zipapp cold (parallel)' \ -n 'Run .whl zipapp coldi (parallel)' \ -n 'Run packed cold (parallel)' \ -n 'Run .whl packed cold (parallel)' \ -n 'Run loose cold (parallel)' \ -n 'Run .whl loose cold (parallel)' \ "./cowsay.zipapp.pex" \ "./cowsay.zipapp.whls.pex" \ "cowsay.packed.pex/__main__.py" \ "cowsay.packed.whls.pex/__main__.py" \ "cowsay.loose.pex/__main__.py" \ "cowsay.loose.whls.pex/__main__.py" \ "PEX_MAX_INSTALL_JOBS=0 ./cowsay.zipapp.pex" \ "PEX_MAX_INSTALL_JOBS=0 ./cowsay.zipapp.whls.pex" \ "PEX_MAX_INSTALL_JOBS=0 cowsay.packed.pex/__main__.py" \ "PEX_MAX_INSTALL_JOBS=0 cowsay.packed.whls.pex/__main__.py" \ "PEX_MAX_INSTALL_JOBS=0 cowsay.loose.pex/__main__.py" \ "PEX_MAX_INSTALL_JOBS=0 cowsay.loose.whls.pex/__main__.py" ```

$ ./build-cowsay.sh && ./perf-cowsay.sh
Benchmark 1: Build zipappi (cold)
  Time (mean ± σ):      1.146 s ±  0.028 s    [User: 1.075 s, System: 0.161 s]
  Range (min … max):    1.110 s …  1.189 s    10 runs

Benchmark 2: Build .whl zipapp (cold)
  Time (mean ± σ):      1.047 s ±  0.026 s    [User: 0.914 s, System: 0.131 s]
  Range (min … max):    1.011 s …  1.081 s    10 runs

Benchmark 3: Build packed (cold)
  Time (mean ± σ):      1.125 s ±  0.016 s    [User: 1.073 s, System: 0.136 s]
  Range (min … max):    1.109 s …  1.167 s    10 runs

Benchmark 4: Build .whl packed (cold)
  Time (mean ± σ):      1.034 s ±  0.008 s    [User: 0.893 s, System: 0.140 s]
  Range (min … max):    1.017 s …  1.042 s    10 runs

Benchmark 5: Build loose (cold)
  Time (mean ± σ):      1.077 s ±  0.010 s    [User: 1.030 s, System: 0.131 s]
  Range (min … max):    1.062 s …  1.094 s    10 runs

Benchmark 6: Build .whl loose (cold)
  Time (mean ± σ):     995.2 ms ±  17.7 ms    [User: 852.2 ms, System: 142.5 ms]
  Range (min … max):   972.2 ms … 1028.8 ms    10 runs

Benchmark 7: Build zipappi (warm)
  Time (mean ± σ):     413.8 ms ±  12.5 ms    [User: 370.8 ms, System: 43.0 ms]
  Range (min … max):   399.5 ms … 437.5 ms    10 runs

Benchmark 8: Build .whl zipapp (warm)
  Time (mean ± σ):     401.1 ms ±   5.4 ms    [User: 345.5 ms, System: 55.5 ms]
  Range (min … max):   396.0 ms … 415.1 ms    10 runs

Benchmark 9: Build packed (warm)
  Time (mean ± σ):     351.6 ms ±   2.9 ms    [User: 314.1 ms, System: 37.3 ms]
  Range (min … max):   348.6 ms … 357.1 ms    10 runs

Benchmark 10: Build .whl packed (warm)
  Time (mean ± σ):     354.5 ms ±  11.4 ms    [User: 315.7 ms, System: 38.5 ms]
  Range (min … max):   343.2 ms … 372.2 ms    10 runs

Benchmark 11: Build loose (warm)
  Time (mean ± σ):     358.2 ms ±   2.5 ms    [User: 307.2 ms, System: 50.5 ms]
  Range (min … max):   354.3 ms … 364.1 ms    10 runs

Benchmark 12: Build .whl loose (warm)
  Time (mean ± σ):     365.2 ms ±  19.3 ms    [User: 314.1 ms, System: 51.2 ms]
  Range (min … max):   352.7 ms … 415.4 ms    10 runs

Summary
  Build packed (warm) ran
    1.01 ± 0.03 times faster than Build .whl packed (warm)
    1.02 ± 0.01 times faster than Build loose (warm)
    1.04 ± 0.06 times faster than Build .whl loose (warm)
    1.14 ± 0.02 times faster than Build .whl zipapp (warm)
    1.18 ± 0.04 times faster than Build zipappi (warm)
    2.83 ± 0.06 times faster than Build .whl loose (cold)
    2.94 ± 0.03 times faster than Build .whl packed (cold)
    2.98 ± 0.08 times faster than Build .whl zipapp (cold)
    3.06 ± 0.04 times faster than Build loose (cold)
    3.20 ± 0.05 times faster than Build packed (cold)
    3.26 ± 0.08 times faster than Build zipappi (cold)
709130  /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.zipapp.whls.pex
714166  /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.zipapp.pex
721772  /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.packed.whls.pex
723960  /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.packed.pex
2543013 /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.loose.whls.pex
2670261 /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.loose.pex
Benchmark 1: Run zipapp cold
  Time (mean ± σ):     433.1 ms ±  17.8 ms    [User: 383.9 ms, System: 48.6 ms]
  Range (min … max):   417.3 ms … 476.7 ms    10 runs

Benchmark 2: Run .whl zipapp cold
  Time (mean ± σ):     511.4 ms ±   8.2 ms    [User: 469.1 ms, System: 41.9 ms]
  Range (min … max):   497.8 ms … 524.0 ms    10 runs

Benchmark 3: Run packed cold
  Time (mean ± σ):     422.3 ms ±   5.1 ms    [User: 375.7 ms, System: 46.3 ms]
  Range (min … max):   413.4 ms … 429.8 ms    10 runs

Benchmark 4: Run .whl packed cold
  Time (mean ± σ):     504.6 ms ±   7.0 ms    [User: 455.2 ms, System: 49.0 ms]
  Range (min … max):   493.8 ms … 515.9 ms    10 runs

Benchmark 5: Run loose cold
  Time (mean ± σ):     239.7 ms ±   6.5 ms    [User: 212.8 ms, System: 26.5 ms]
  Range (min … max):   231.2 ms … 256.2 ms    12 runs

Benchmark 6: Run .whl loose cold
  Time (mean ± σ):     332.3 ms ±   5.1 ms    [User: 285.4 ms, System: 46.7 ms]
  Range (min … max):   326.7 ms … 340.5 ms    10 runs

Benchmark 7: Run zipapp cold (parallel)
  Time (mean ± σ):     550.6 ms ±   4.4 ms    [User: 551.2 ms, System: 55.1 ms]
  Range (min … max):   544.3 ms … 556.6 ms    10 runs

Benchmark 8: Run .whl zipapp coldi (parallel)
  Time (mean ± σ):     586.3 ms ±   5.2 ms    [User: 616.6 ms, System: 65.1 ms]
  Range (min … max):   581.7 ms … 595.8 ms    10 runs

Benchmark 9: Run packed cold (parallel)
  Time (mean ± σ):     545.6 ms ±   8.2 ms    [User: 551.4 ms, System: 50.6 ms]
  Range (min … max):   536.5 ms … 561.9 ms    10 runs

Benchmark 10: Run .whl packed cold (parallel)
  Time (mean ± σ):     580.6 ms ±   4.8 ms    [User: 608.2 ms, System: 64.9 ms]
  Range (min … max):   573.0 ms … 588.4 ms    10 runs

Benchmark 11: Run loose cold (parallel)
  Time (mean ± σ):     232.4 ms ±   2.3 ms    [User: 211.8 ms, System: 20.3 ms]
  Range (min … max):   229.4 ms … 237.2 ms    12 runs

Benchmark 12: Run .whl loose cold (parallel)
  Time (mean ± σ):     411.7 ms ±   2.4 ms    [User: 449.2 ms, System: 56.2 ms]
  Range (min … max):   407.8 ms … 416.1 ms    10 runs

Summary
  Run loose cold (parallel) ran
    1.03 ± 0.03 times faster than Run loose cold
    1.43 ± 0.03 times faster than Run .whl loose cold
    1.77 ± 0.02 times faster than Run .whl loose cold (parallel)
    1.82 ± 0.03 times faster than Run packed cold
    1.86 ± 0.08 times faster than Run zipapp cold
    2.17 ± 0.04 times faster than Run .whl packed cold
    2.20 ± 0.04 times faster than Run .whl zipapp cold
    2.35 ± 0.04 times faster than Run packed cold (parallel)
    2.37 ± 0.03 times faster than Run zipapp cold (parallel)
    2.50 ± 0.03 times faster than Run .whl packed cold (parallel)
    2.52 ± 0.03 times faster than Run .whl zipapp coldi (parallel)

The summary is:

Roughly, .whl builds are slightly faster than the status quo as expected (no unzipping and, for zipapp and packed, re-zipping is required).
Roughly, .whl 1st cold runs are slightly slower than the status quo as expected (extra install step at runtime).
Forcing parallelization makes things slower. In general knowing when this will pay off requires experimentation with the PEX and deploy target machine in-hand.

pex-tool / pex

Pex zip-creation takes a very long time for `torch>=2` #2292