pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.31k stars 636 forks source link

Pants test timeout includes venv creation overhead causing timeouts #19681

Open JoostvDoorn opened 1 year ago

JoostvDoorn commented 1 year ago

Describe the bug When a python_tests has heavy dependencies (e.g. pytorch) a test could timeout on a "cold" execution of the test, as during first execution of the test some additional overhead actions are performed to create the venv. This can be significant, we observe a 'test time' >30s for a test that takes 3s with a warm cache.

One solution would be to exclude this additional venv creation time for the test timeout, and only measure the actual execution time of the python process. This would make the timeout argument for python_tests more useful when large complex dependencies are used.

Not sure if there is a performance component that can be solved, potentially that should then be an issue in the pex repository.

Pants version 2.16.0 and 2.17.0rc5

OS MacOS + Linux

Additional info See gist for example: https://gist.github.com/JoostvDoorn/9c0f63ed5198544a36b477502eeac4fb

To test:

rm -rf ~/.cache/pants/named_caches/
pants test :: --keep-sandboxes=on_failure --test-force
✓ //test_file.py:root succeeded in 23.09s.

Second execution using pants test :: --keep-sandboxes=on_failure --test-force is significantly faster:

✓ //test_file.py:root succeeded in 3.12s.

The bug here is that the test execution timing should reflect the actual time it takes to execute the test, and not the creation of the virtual environment.

For the gist example the expectation is that would always work regardless of the state of the cache:

python_tests(
    name="root",
    timeout=15,
)
benjyw commented 1 year ago

The timeout (as set globally or on individual python_tests targets) applies only to the pytest process itself, not to any other processes (see here).

So is this extra overhead happening within that process?

benjyw commented 1 year ago

Hmm, but I can reproduce what you're talking about, with your gist.

benjyw commented 1 year ago

OK, so this is the cost of seeding the venv from the pex, and does indeed happen in the pytest process. So there is no easy way to avoid this, short of pre-seeding the venv in a separate process.

benjyw commented 1 year ago

I think the problem is wider than just timeouts though. If each test constructs its own bespoke venv, but they all include pytorch, each test will incur this penalty on first run. And in CI first run is the only run.

One solution, when there are many tests, is to run tests in batches: https://www.pantsbuild.org/docs/python-test-goal#batching-and-parallelism

This doesn't help the single test cold start time, but it amortizes that cold start over many tests.

JoostvDoorn commented 1 year ago

I think the problem is wider than just timeouts though. If each test constructs its own bespoke venv, but they all include pytorch, each test will incur this penalty on first run. And in CI first run is the only run.

You're completely right, from a user side we can do some steps to alleviate this. We often apply the batch_compatibility_tag as well to reduce the overhead for many of our tests, and it might be a good idea for us to adjust this usage to better address this concern for pytorch (and others). In addition we can adjust minimum timeouts to a relatively high number, and we could rely on https://pypi.org/project/pytest-timeout/ as an alternative way to control test timeouts.

OK, so this is the cost of seeding the venv from the pex, and does indeed happen in the pytest process. So there is no easy way to avoid this, short of pre-seeding the venv in a separate proces

Maybe @jsirois has some idea? I wonder if the venv construction for large library dependencies can be somehow optimized.

jsirois commented 1 year ago

Have you tried setting this to True?: https://www.pantsbuild.org/docs/reference-pex#venv_use_symlinks

JoostvDoorn commented 1 year ago

This doesn't seem to make a difference in this case.

> rm -rf /Users/joostvandoorn/.cache/pants/named_caches/
> pants test :: --keep-sandboxes=on_failure --test-force --pex-venv-use-symlinks
11:56:42.56 [INFO] Completed: Run Pytest - //test_file.py:root - succeeded.

✓ //test_file.py:root succeeded in 23.41s.
jsirois commented 1 year ago

Ok, makes sense from a clean Pex cache. I'm not sure there is any way to speed up venv creation from a cold start save for populating the venv site-packages in parallel. That step is currently serial. I can give that a quick try in the am and report timing differences for your requirement set case in the gist.

benjyw commented 1 year ago

PS to get the gist to work, I had to add

[python]
interpreter_constraints = ["==3.10.*"]

216.x still has broad default ICs, which in my case captured a 3.7 interpreter, which is incompatible with the fsspec>=2023.6.0 requirement.

jsirois commented 1 year ago

Ok, thanks. I'm going to cut out Pants in my testing, but the IC will still be relevant.

benjyw commented 1 year ago

--pex-venv-use-symlinks

Not that this helps in CI, but you should see a difference if your nuking command is rm -rf ~/.cache/pants/named_caches/pex_root/venvs/, because then the symlink farm can still be efficiently be recreated, as the symlink targets in ~/.cache/pants/named_caches/pex_root/installed_wheels/ still exist.

So you get a sense of what's going on.

jsirois commented 1 year ago

I mean, stopping real fast here, just looking at the parallel downloads from the lock file:

$ pex3 lock create -r https://gist.githubusercontent.com/JoostvDoorn/9c0f63ed5198544a36b477502eeac4fb/raw/f4aec15dd827c6480790099526861b4dc4d6b1ee/requirements.txt --style universal --pip-version latest --resolver-version pip-2020-resolver --target-system linux --target-system mac --interpreter-constraint "==3.10.*" --indent 2 -o lock.json
$ rm -rf ~/.pex && time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex
...
pex: Downloading FileArtifact(url='https://files.pythonhosted.org/packages/a8/cc/c36f3439f5d47c3b13833ce6687b43a040cc7638c502ac46b41e2d4f3d6f/scipy-1.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl', fingerprint=Fingerprint(algorithm='sha256', hash='d690e1ca993c8f7ede6d22e5637541217fc6a4d3f78b3672a6fe454dbb7eb9a7'), verified=False, filename='scipy-1.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl'): 9482.3ms
pex: Downloading FileArtifact(url='https://files.pythonhosted.org/packages/8c/4d/17e07377c9c3d1a0c4eb3fde1c7c16b5a0ce6133ddbabc08ceef6b7f2645/torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl', fingerprint=Fingerprint(algorithm='sha256', hash='8ced00b3ba471856b993822508f77c98f48a458623596a4c43136158781e306a'), verified=False, filename='torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl'): 54308.9ms
...
pex:         Installing 58 distributions: 11351.9ms
Saving PEX file to example.pex

real    2m7.001s
user    2m11.010s
sys     0m12.343s

Now, this is on my home network, but ~54s (1/2 the total time) just to download torch. As a sanity check I find:

$ time curl -sSL https://files.pythonhosted.org/packages/8c/4d/17e07377c9c3d1a0c4eb3fde1c7c16b5a0ce6133ddbabc08ceef6b7f2645/torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl -O

real    0m42.055s
user    0m0.387s
sys     0m0.627s
$ time curl -sSL https://files.pythonhosted.org/packages/8c/4d/17e07377c9c3d1a0c4eb3fde1c7c16b5a0ce6133ddbabc08ceef6b7f2645/torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl -O

real    0m42.124s
user    0m0.485s
sys     0m0.636s

So the contended (parallel) Pip download Pex does is slower than a straight curl of that one wheel, but not by much. Pex absolutely cannot speed that up in any way whatever besides caching, and this use case is ruling out a cache.

With a cache (re-run the last command), and:

$ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex
...
pex:         Installing 58 distributions: 1586.7ms
Saving PEX file to example.pex

real    0m2.347s
user    0m1.810s
sys     0m0.350s

... more analysis to come re the remaining ~60s.

benjyw commented 1 year ago

We can, however, pre-seed the venv, so that at least the test timeout only applies to the pytest run, and not the venv seeding. This would solve the narrow problem the OP started with, and can be done entirely in Pants.

jsirois commented 1 year ago

Sure - yes. Save 0 time, but save a timeout. I'll continue to dig on the raw performance aspect on the Pex side to see what can be juiced.

JoostvDoorn commented 1 year ago

Now, this is on my home network, but ~54s (1/2 the total time) just to download torch. Pex absolutely cannot speed that up in any way whatever besides caching, and this use case is ruling out a cache.

Fair enough, but we have a lot of code that depend on torch, and we only need to download torch once for our lockfile per CI run whereas multiple different pex files will be generated, and ?unzipped? as part of the their first pex execution. This can be limited with batch_compatibility_tag but still there could be 5-15 batches running which include torch as a dependencies having to perform the fairly similar bootstrapping of the venv.

jsirois commented 1 year ago

Ok, the lion's share (~90%), of the remaining time is creating the (compressed) --layout packed wheel zips.

  1. Avoid re-packing:
    
    $ rm -rf ~/.pex/{bootstrap_zips,built_wheels,installed_wheels,interpreters,isolated,venvs}
    $ du -shc ~/.pex/*
    709M    /home/jsirois/.pex/downloads
    709M    /home/jsirois/.pex/packed_wheels
    743M    /home/jsirois/.pex/pip
    2.2G    total
    $ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex
    ...
    pex:         Installing 58 distributions: 10896.7ms
    Saving PEX file to example.pex
    Previous binary unexpectedly exists, cleaning: example.pex

real 0m12.915s user 0m42.100s sys 0m5.047s


2. Force re-packing with compression (the default - Pants does this):

$ rm -rf ~/.pex/{bootstrap_zips,built_wheels,installed_wheels,interpreters,isolated,venvs,packed_wheels} $ du -shc ~/.pex/* 709M /home/jsirois/.pex/downloads 743M /home/jsirois/.pex/pip 1.5G total $ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex ... pex: Installing 58 distributions: 10871.3ms Saving PEX file to example.pex

real 1m8.286s user 1m35.345s sys 0m6.424s


3. Force re-packing without compression:

$ rm -rf ~/.pex/{bootstrap_zips,built_wheels,installed_wheels,interpreters,isolated,venvs,packed_wheels} $ du -shc ~/.pex/* 709M /home/jsirois/.pex/downloads 743M /home/jsirois/.pex/pip 1.5G total $ time pex -v --exe exe.py --lock lock.json --python python3.10 --layout packed -o example.pex --no-compress ... pex: Installing 58 distributions: 10971.2ms Saving PEX file to example.pex

real 0m16.502s user 0m45.362s sys 0m6.502s


So the lion's share of the overall time is dominated by the long pole of the download of torch in a parallel download of all needed wheels and the zip compression of packed wheel zips. The latter is serial; so that could be parallelized - but stepping back and trying a native zip of just the torch installed wheel shows:

$ time zip -r --quiet torch-packed.zip ~/.pex/installed_wheels/8ced00b3ba471856b993822508f77c98f48a458623596a4c43136158781e306a/torch-2.0.1- cp310-cp310-manylinux1_x86_64.whl/

real 0m47.460s user 0m46.773s sys 0m0.680s



So there is not much to be gained there in this case.
jsirois commented 1 year ago

@JoostvDoorn I'll let you continue with @benjyw down the path of not speeding things up in reality, but avoiding the test timeout with accounting tricks. To meaningfully speed venv creation up for this example, you definitely need to allow for caching something.

benjyw commented 1 year ago

Fair enough, but we have a lot of code that depend on torch, and we only need to download torch once for our lockfile per CI run whereas multiple different pex files will be generated, and ?unzipped? as part of the their first pex execution. This can be limited with batch_compatibility_tag but still there could be 5-15 batches running which include torch as a dependencies having to perform the fairly similar bootstrapping of the venv.

You can control the batch size, to trade off concurrency vs this overhead.

benjyw commented 1 year ago

This is where @jriddy's work on "provided dependencies" might be valuable. If your CI image could provide torch preinstalled in site-packages...

jsirois commented 1 year ago

I think that work brings us full circle: https://github.com/pantsbuild/pex/issues/2097 @jriddy will have to grapple with that to make "provided dependencies" work for the Pants Python stack.

jsirois commented 1 year ago

I think what this really all says is Pants probably needs to stop playing whac-a-mole and instead have someone fully embrace that they are a torch user and look at the shortcomings wholistically and actually solve them all. From lock issues to speed issues.

jriddy commented 1 year ago

I think that work brings us full circle: https://github.com/pantsbuild/pex/issues/2097

Yeah I'm aware of that one. I want to get to it, just trying to find when I have the time

instead have someone fully embrace that they are a torch user and look at the shortcomings wholistically and actually solve them all

I'm not fully sure what you mean by this, but in my experience the overlap between the actual users of torch and people who have any interest in deterministic builds is pretty small. Often ML teams have a handful of enabler engineers with far less background in ML that are trying to build tooling and pipelines and generally hold the whole dev shop side of things together. So it will take that rare someone that is both a genuine user of Torch and someone who is willing to wade through the nitty-gritty of weird native code + python packaging to understand the problem holistically.

I think "provided dependencies" + environments is probably a better way to handle this longer term, because torch and tensorflow aren't the only things like this. So giving some way to provide something resembling determinism while acquiescing to the reality of how some packages/ecosystems/organizations work seems to be a reasonable way out for these cases that we can never really solve perfectly through the pip ecosystem.

jsirois commented 1 year ago

I'm not fully sure what you mean by this, but in my experience the overlap between the actual users of torch and people who have any interest in deterministic builds is pretty small. Often ML teams have a handful of enabler engineers with far less background in ML that are trying to build tooling and pipelines and generally hold the whole dev shop side of things together. So it will take that rare someone that is both a genuine user of Torch and someone who is willing to wade through the nitty-gritty of weird native code + python packaging to understand the problem holistically.

I mean just that - Pants has historically been composed of committers that basically have no clue about the things they're integrating. For example, I learned Python packaging on the job over the course of ~10 years. I had no clue going in. We have had, FWICT, no Pants committer dig in and learn / live ML to produce a useful experience. I agree this is hard, I'm just pointing out its a huge problem and has been for a very long time. Pants has Pants experts and next to no Pants use case problem domain experts. Its very hard being a middleman tool but not having the middleman expertise but that is the position Pants is in. Much better would be if there were a way to get out of the middle somehow. Then the ecosystem expert could know little about Pants but leverage their ecosystem. Right now its a bit of the worst of both worlds / you have to know both worlds to do good work.

jriddy commented 1 year ago

Okay that makes sense. I actually have been there glue guy on an ML team and I’m currently dealing with a RPM shop where we can’t use Pex conventionally . So I think I do have a bit of domain expertise here. I just struggle to carve out enough time to actually make these contributions. 😅

On Tue, Aug 29, 2023 at 20:36 John Sirois @.***> wrote:

I'm not fully sure what you mean by this, but in my experience the overlap between the actual users of torch and people who have any interest in deterministic builds is pretty small. Often ML teams have a handful of enabler engineers with far less background in ML that are trying to build tooling and pipelines and generally hold the whole dev shop side of things together. So it will take that rare someone that is both a genuine user of Torch and someone who is willing to wade through the nitty-gritty of weird native code + python packaging to understand the problem holistically.

I mean just that - Pants has historically been composed of committers that basically have no clue about the things they're integrating. For example, I learned Python packaging on the job over the course of ~10 years. We have had, FWICT, no Pants committer dig in and learn / live ML to produce a useful experience. I agree this is hard, I'm just pointing out its a huge problem and has been for a very long time. Pants has Pants experts and next to no Pants use case problem domain experts.

— Reply to this email directly, view it on GitHub https://github.com/pantsbuild/pants/issues/19681#issuecomment-1698322220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH5UH4PVDRK2RRJDX5B7TDXX2DI7ANCNFSM6AAAAAA4BN3WGA . You are receiving this because you were mentioned.Message ID: @.***>

benjyw commented 1 year ago

I mean just that - Pants has historically been composed of committers that basically have no clue about the things they're integrating. For example, I learned Python packaging on the job over the course of ~10 years. I had no clue going in. We have had, FWICT, no Pants committer dig in and learn / live ML to produce a useful experience. I agree this is hard, I'm just pointing out its a huge problem and has been for a very long time. Pants has Pants experts and next to no Pants use case problem domain experts. Its very hard being a middleman tool but not having the middleman expertise but that is the position Pants is in. Much better would be if there were a way to get out of the middle somehow. Then the ecosystem expert could know little about Pants but leverage their ecosystem. Right now its a bit of the worst of both worlds / you have to know both worlds to do good work.

100%. This has been one of the biggest challenges of expanding Pants into use-cases the current maintainers don't understand. To make a useful plugin for X you need to know X and Pants, and that intersection is small to empty for most X.

benjyw commented 1 year ago

Specifically for the initial problem of the test timeout being consumed by the venv creation overhead: @JoostvDoorn , do you want to take a crack at that?

The idea is: right before running pytest we run a process in the same venv (so same params as the pytest process) that does enough work to create the venv, but doesn't actually run the test.

Actually, now that I think of it, this is similar to https://github.com/pantsbuild/pants/pull/19430. Maybe we wait for that one, then unify into one "force venv to materialize" rule.

JoostvDoorn commented 1 year ago

Specifically for the initial problem of the test timeout being consumed by the venv creation overhead: @JoostvDoorn , do you want to take a crack at that?

Actually, now that I think of it, this is similar to #19430. Maybe we wait for that one, then unify into one "force venv to materialize" rule.

Seems doable from my side. Will have a look at this sometime after this is merged.

benjyw commented 1 year ago

OK! So #19430 is now merged, and I think we want to generalize it to a "force a venv to exist" rule, that we don't cache. See the conversation in that PR for more context.