Triage and improve remote cache hit rate for desktop usage

stuhood commented 3 years ago

When a remote cache is configured, desktop usage (even among identical platforms) gets lower cache hit rates than CI usage. This is primarily due to inconsistent PATH/env entries between differently configured boxes.

We should:

~add enough information to workunits/metrics to validate that this is the case (i.e. to use a StreamingWorkunitHandler to compare cache lookups across different desktop machines)~ (done in #12469)
Do any sort of PATH/env filtering that we need to to improve our hitrate (potentially related to #10526 and #10769)

Some of the differences identified for a series of runs across multiple users:

PATH/LDFLAGS/CPPFLAGS
- Leaked through in order to allow for compilation of wheels.
The PEX --python-path
The python interpreter used
- Even with an identical PATH string, multiple interpreters might be identified (see #10526 and #10769).

jsirois commented 3 years ago

Agreed that measurement is critical, but we know everything downstream of a PEX build will be a cache miss by "design", right?: https://github.com/pantsbuild/pants/blob/61193e1bf5c8d9d3b77519626938906a70d5c098/src/python/pants/backend/python/util_rules/pex.py#L679-L682

stuhood commented 3 years ago

Agreed that measurement is critical, but we know everything downstream of a PEX build will be a cache miss by "design", right?: ...

This ticket is intended to cover cases where multiple desktop machines are using the same platform: I'll clarify that.

xlevus commented 2 years ago

Another cause for cache-misses we have theorized to cause issues are API keys. Some of our tests run against 3rd party systems and authenticated via an API key. With each developer having a unique key we believe this would effectively nullify any benefits of remote caching.

Given the design of the system, the API key is irrelevant and ignoring any security concerns a singular key could be used by all developers. Or hash(api_key) could safely return a constant.

stuhood commented 2 years ago

I did a little bit of thinking about this before we decided to start #13682, so will braindump some of that for now.

I believe that a path forward to allow for "adjusting" the fingerprint that is included in a Process for certain environment variables and absolute file args is essentially to reify them into types which:

Would have their actual string content applied below/after cache lookups. For example:
- a reified PATH env var containing your HOME directory would not include the HOME portion in the Process' digest, but would use the entire value at execution time.
Describe how to compute a deeper/different fingerprint for the entry which was included for cache lookups. For example:
- a reified PATH entry might be constructed by the @rule implementer by listing which (sub)processes they thought that a Process would use. Fingerprinting would then collect versions of those processes and apply some rounding to attempt to match.

The hope would be that these types would be composable, such that they didn't add much complexity to constructing a Process.

tgolsson commented 11 months ago

I enabled remote caching when upgrading to 2.18, and am seeing similar issues -- but even on the same machine by enabling our pants.ci.toml.

You can see here; that even the fingerprints match, but we get a cache miss when running with only base config:

# WITH pants.ci.toml
11:03:10.24 [DEBUG] remote cache hit for: "Building 2 requirements for ci/emote-override_py.pex from the locks/cpu.lock resolve: coloredlogs~=15.0, ruamel.yaml~=0.16.0" digest=Digest { hash: Fingerprint<4dd1225aba792cee25d7aa09c810024d91f4cb87385044c287d63e0f78b88d34>, size_bytes: 142 } 

# WITHOUT pants.ci.toml
11:49:55.05 [DEBUG] remote cache miss for: "Building 2 requirements for ci/emote-override_py.pex from the locks/cpu.lock resolve: coloredlogs~=15.0, ruamel.yaml~=0.16.0" digest=Digest { hash: Fingerprint<4dd1225aba792cee25d7aa09c810024d91f4cb87385044c287d63e0f78b88d34>, size_bytes: 142 }

This is a pex_binary that we build that only contains two files and two dependencies. We only write the cache from CI, but I assume if I did write locally it'd at least hit it that, but I don't want all users to write to cache to avoid cache pollution. The only thing that could possibly affect this from our config is that we enable pyenv in pants.ci.toml -- we do override the default resolve, but I've made sure to include that on the command line.

I've not dug more into it yet, but I'll diff the actual process invocations when I have time and see if I can prove that it's the Python interpreter location that matters.

For completeness; this is the pants.ci.toml we use:

[GLOBAL]
colors = true
print_stacktrace = true
plugins.add = [
    "hdrhistogram",
]

backend_packages.add = [
    "pants.backend.python.providers.experimental.pyenv",
]

remote_cache_write = true

[stats]
log = true

[test]
use_coverage = true

[coverage-py]
report = ["json"]
global_report = true

[pytest]
args = ["-vv", "--no-header", "--benchmark-disable"]

[python]
default_resolve = "cpu"

[oci]
rootless = false
uid_map = ["0:0:65536"]
gid_map = ["0:0:65536"]

[pyenv-python-provider]
installation_extra_env_vars = [
    "PYTHON_CONFIGURE_OPTS=--with-lto=thin",
    "PYTHON_CFLAGS=-march=native -mtune=native",
]

pantsbuild / pants

Triage and improve remote cache hit rate for desktop usage #12203