High total runtime for `test`, even with very high cache rate

stuhood commented 2 years ago

For larger number of tests, the startup time and overhead of pants in some private repositories can be very high. At the same time, memory usage is unexpectedly high (as first reported in https://github.com/pantsbuild/pants/issues/12662).

Most likely the memory usage and total runtime are related, so pursuing the metrics gathering exercise from https://github.com/pantsbuild/pants/issues/12662 would likely be very valuable in also tracking down the performance issue.

stuhood commented 2 years ago

At least for the cases I have profiles for, it looks like a very large fraction of the runtime for these cases is in: https://github.com/pantsbuild/pants/blob/e45b00d2eb467b599966bca262405a5d74d27bdd/src/python/pants/backend/python/util_rules/local_dists.py#L80-L83

Upon inspection, it looks like those @rules are pulling the entire contents of distributions into memory, and that they're doing so per-consumer (rather than per-dist). And because the work accomplished by those @rules is not executed as an external process, it is not persistently cached in the local/remote caches.

stuhood commented 2 years ago

https://github.com/pantsbuild/pants/pull/14551 addresses the local dist performance issue.

stuhood commented 2 years ago

Cherrypicking to 2.10.x in https://github.com/pantsbuild/pants/pull/14555

stuhood commented 2 years ago

Have also opened https://github.com/pantsbuild/pants/pull/14564, since it looks like in some cases there are no dists present at all.

benjyw commented 2 years ago

I did some benchmarking on digest subsetting (see https://github.com/pantsbuild/pants/pull/14569), and noticed the following:

generating the "full subset" by passing all paths in the digest in as the requested subset) takes ~1 second for 40k files and this time grows linearly with the number of files in the digest.
Creating a digest from file content in memory (via CreateDigest) takes ~8.3 seconds for 40k files, and this time grows roughly quadratically with the number of files in the digest.

So it is possible that full subset generation times of a few hundred ms add up across many consumers. But this benchmarking effort incidentally reveals something worse and not directly related to subsetting, i.e., superlinear behavior of CreateDigest.

benjyw commented 2 years ago

benjyw commented 2 years ago

See https://docs.google.com/spreadsheets/d/1f94uT7ZIwkMWewGbKtd-cfVlKzuGPv5Epj3lOZXRZ60/edit?usp=sharing

benjyw commented 2 years ago

Of course this may be a red herring for this specific issue, since I'm not sure we ever CreateDigest on significant numbers of files outside a test setting. Nonetheless it would be instructive to know why we have quadratic behavior there, as maybe it means we have it elsewhere.

stuhood commented 2 years ago

Thanks a lot!

Of course this may be a red herring for this specific issue, since I'm not sure we ever CreateDigest on significant numbers of files outside a test setting. Nonetheless it would be instructive to know why we have quadratic behavior there, as maybe it means we have it elsewhere.

Yea, I think that CreateDigest is implemented fairly naively, by creating an independent Digest per file, and then merging all of them. I'll watch out for other consumers in the profiles, but I don't think that the initial cases I saw used CreateDigest.

stuhood commented 2 years ago

I've opened https://github.com/pantsbuild/pants/pull/14638 to help address https://github.com/pantsbuild/pants/issues/12662: there appears to be a small amount of overlap between memory usage and performance (less than I suspected when I opened this ticket), so we will likely break out pursuing https://github.com/pantsbuild/pants/issues/12662 into a separate workstream while continuing to work on the performance issue captured here.

stuhood commented 2 years ago

The next highest item in the profile was related to the snapshotting of sources:

pants.backend.python.goals.pytest_runner.run_python_test ->
pants.backend.python.goals.pytest_runner.setup_pytest_for_target (360) ->
pants.backend.python.util_rules.python_sources.prepare_python_sources (360) ->
    56,868,568,602
pants.core.util_rules.source_files.determine_source_files (360) ->
    40,450,644,707
pants.engine.internals.graph.hydrate_sources (7722) ->
    274,146,863,500
snapshot (4649) ->
    106,989,925,260
digest_file (5253)
    64,643,182,013

After investigating whether it would be useful to implement batching (I don't think so, because inference will still need to capture or isolate per-target source batches, which might mean two different APIs), I've begun https://github.com/pantsbuild/pants/issues/13112, which should significantly reduce IO.

stuhood commented 2 years ago

There has been further progress on https://github.com/pantsbuild/pants/issues/13112: see https://github.com/pantsbuild/pants/issues/13112#issuecomment-1059521974 for a summary. The full stack of patches should be landable early next week, although some further optimization might be necessary to show the full potential.

stuhood commented 2 years ago

Most of the effort for https://github.com/pantsbuild/pants/issues/13112 has now been completed, and preliminary microbenchmarking shows that the particular codepath from https://github.com/toolchainlabs/issues/issues/12#issuecomment-1053867579 (which represented 33-66% of the total runtime) has been reduced by 56%.

So we should see anywhere between a 18% = 33%*56% and 36% = 66%*56% speedup in the production usecase in 2.11.x. Once we have new production profiles, we can resume work here, but for now I'm going to put it on hold.

stuhood commented 2 years ago

Users were able to test with https://github.com/pantsbuild/pants/issues/13112, and confirmed a 30% speedup, which is great.

Source hydration / snapshotting is still the largest part of the profile (the codepath below pants.backend.python.util_rules.python_sources.prepare_python_sources is still pretty hot), but no longer by such a significant margin. I'm asking for a more detailed profile (with --streaming-workunits-level=trace).

The item next in the list below source hydration is time spent fetching from the (remote) cache. There is a known performance issue in that area, which is that we currently eagerly fetch all cache outputs before calling the cache hit successful. That is because we can't know whether we'll need them later, and we don't implement "backtracking" to re-run a process which we thought had hit but which actually missed -- covered by https://github.com/pantsbuild/pants/issues/11331. It's not an issue of bandwidth (the total amount fetched from the cache in these cases is actually only ~2MB): rather, one of latency and bounded concurrency (controlled by --remote-store-rpc-concurrency).

stuhood commented 2 years ago

New trace level profiles show that pants.backend.python.util_rules.python_sources.prepare_python_sources has been knocked out of first place in favor of dependency inference: in particular, the time taken to either execute or hit the cache for inference.

The hottest path below setup_pytest_for_target looks like:

[
    {
        "method": "pants.backend.python.goals.pytest_runner.setup_pytest_for_target",
        "count": 398,
        "duration_micros": 56748332960
    },
    {
        "method": "pants.engine.internals.graph.transitive_targets",
        "count": 398,
        "duration_micros": 37749039905
    },
    {
        "method": "pants.engine.internals.graph.transitive_dependency_mapping",
        "count": 398,
        "duration_micros": 37737757996
    },
    {
        "method": "pants.engine.internals.graph.resolve_targets",
        "count": 4585,
        "duration_micros": 46571189900
    },
    {
        "method": "pants.engine.internals.graph.resolve_unexpanded_targets",
        "count": 5009,
        "duration_micros": 46579084763
    },
    {
        "method": "pants.engine.internals.graph.resolve_dependencies",
        "count": 4585,
        "duration_micros": 46556427119
    },
    {
        "method": "pants.backend.python.dependency_inference.rules.infer_python_dependencies_via_imports",
        "count": 3216,
        "duration_micros": 44731238539
    },
    {
        "method": "pants.backend.python.dependency_inference.parse_python_imports.parse_python_imports",
        "count": 3216,
        "duration_micros": 44514117183
    },
    {
        "method": "pants.engine.process.fallible_to_exec_result_or_raise",
        "count": 3216,
        "duration_micros": 28124661088
    },
    {
        "method": "process",
        "count": 3224,
        "duration_micros": 23976103909
    }
]

The dropoff between parse_python_imports and fallible_to_exec_result_or_raise is the time spent actually capturing the sources: the time spent in process is primarily the time spent accessing and then hitting the cache(s).

Surprisingly, the breakdown of time within process execution is seemingly pretty bound on local io. The direct children of process look like:

[
    {
        "method": "remote_cache_read_speculation",
        "count": 3590,
        "duration_micros": 9230203432
    },
    {
        "method": "local_cache_read",
        "count": 3626,
        "duration_micros": 7594634338
    },
    {
        "method": "local_cache_write",
        "count": 3590,
        "duration_micros": 7303851431
    },
    {
        "method": "remote_cache_write",
        "count": 18,
        "duration_micros": 12707169
    },
    {
        "method": "remote_cache_write_setup",
        "count": 18,
        "duration_micros": 42
    }
]

Although the local cache read/write are not the hottest part of process execution, they are notable because we execute them first and last, synchronously, before doing remote cache lookups (whereas remote cache lookups occur in parallel with executing the process).

So, as a next step, it looks like we should actually try disabling the local cache entirely for this user, since reading/writing to it looks like a bottleneck. If that improves things, we might want to either make local cache lookups concurrent with remote cache lookups, and/or consider skipping local writes if we've hit remotely.

stuhood commented 2 years ago

We've made significant progress on the performance of this case (from 10-12 minutes to 3-4 minutes), so although it is definitely not as low as we'd like it to be, I'm going to call this one resolved.

Once the dust settles on the 2.11.x release, and people are stably running with the new settings (in particular, with PEX-native lockfiles and the fix for https://github.com/pantsbuild/pants/issues/14127 in place), we can revisit.

Please continue to open issues for the 2022 Performance Push. Thanks!

toolchainlabs / issues

High total runtime for `test`, even with very high cache rate #12