pex-tool / pex

A tool for generating .pex (Python EXecutable) files, lock files and venvs.
https://docs.pex-tool.org/
Apache License 2.0
2.54k stars 258 forks source link

Possible bug with cache TTL expiration resulting in redownloading files #1084

Closed jriddy closed 3 years ago

jriddy commented 4 years ago

Using pex version 2.1.19

The current default cache-ttl of 1 hr (3600 s) is far too short for pants use cases where you end up changing dependencies a lot over a short period of time. This results in frequent unnecessary re-downloading and rebuilding of library wheels/sdists when you're changing dependencies frequently (something that can happen very easily and almost inadvertently with pants' dependency inference). I spent a day or two trying to figure out why it was taking 5 min to build constraints files all the time

This issue is compounded by pants not having a way to control this flag (gonna make an issue over there for that), nor is there any documentation on how to set this value to never expire that I can find.

That said, I don't fully understand why TTL caching is even at play here. When you download python packages, you have hashes in the filenames that you can check against what you have in your cache. If the hashes match...there doesn't need to be a download regardless of TTL or cache expiry issues. Perhaps there are other issues at play when it comes to determining which file to download, but once that determination is made, it seems like we could safely let checksums rule the day.

I can provide logs of this and the trouble that it causes if needed.

jsirois commented 4 years ago

It probably makes sense to decouple Pants from the Pex defaults here with Pants work to plumb these options.

Moving to your hashes / why is TTL in play: unless a requirement is exact, once a TTL expires, the ranged requirement sort of has to be looked up again. I'd be surprised if the new PIp resolver does anything different here compared to the old Pip resolver (which Pex currently uses).

jriddy commented 4 years ago

Thanks for the quick feedback

why is TTL in play

I can see why it is in play for to check for new requirements, but if that check results in a found link to a file we've already downloaded, why does it have to download it again? But come to think of it, I think pip shares this behavior, with perhaps a more reasonable default.

the ranged requirement sort of has to be looked up again I'm using pinned requirements in a constraints file. I don't really want a resolution, I want the dependencies i've specified and already resolved previously (although I know pex and internally pip still have to locate files with matching tags and find wheels/sdists etc).

I think what we're really missing here for these use cases is a lock file.

jsirois commented 4 years ago

I think what we're really missing here for these use cases is a lock file.

Pants has that: https://www.pantsbuild.org/docs/python-third-party-dependencies#using-a-lockfile-strongly-recommended

jsirois commented 4 years ago

... as does Pex which Pants passes through to using pex --constraints=...

jriddy commented 4 years ago

Pants has that: https://www.pantsbuild.org/docs/python-third-party-dependencies#using-a-lockfile-strongly-recommended

... as does Pex which Pants passes through to using pex --constraints=...

And following this advice is how I discovered this issue. I keep hitting huge slowdowns in one of the pex commands pants runs to resolve the constraints file whenever it changes. And if I don't change it when my requirements change, then I run into slowdowns as it tries to resolve my requirements each time.

The constraints file is not a lock file, not if pex still has to do a resolution. A lock file is the output of a resolution, not an input to it. For real lock file support I opened a feature request: #1086

jriddy commented 4 years ago

For reference, here's a pex command that pants runs that results in sometimes extremely slow builds:

pex \
-vvvvvv \
--python-path \
/home/jrloc/.pyenv/versions/3.6.9/bin:/home/jrloc/.pyenv/versions/3.7.7/bin:/home/jrloc/.pyenv/versions/3.8.5/bin:/home/jrloc/.pyenv/versions/miniconda3-4.7.12/bin:/home/jrloc/.pyenv/shims:/home/jrloc/.pyenv/bin:/home/jrloc/.local/bin:/home/linuxbrew/.linuxbrew/bin:/home/jrloc/.cargo/bin:/home/linuxbrew/.linuxbrew/opt/fzf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin \
--pex-root \
/home/jrloc/.cache/pants/named_caches/pex_root \
--tmpdir \
/tmp/tmppex \
--output-file \
requirements.pex \
--no-pypi \
--index=https://pypi.org/simple/ \
--not-zip-safe \
--no-emit-warnings \
--jobs \
2 \
--manylinux \
manylinux2014 \
--constraints \
constraints.txt \
--sources-directory=source_files \
Flask-BasicAuth==0.2.0 \
Flask==1.0.2 \
Jinja2==2.11.2 \
MarkupSafe==1.1.1 \
Pillow==8.0.0 \
PyYAML==5.3.1 \
Pygments==2.7.1 \
Werkzeug==0.16.1 \
aniso8601==8.0.0 \
apispec-webframeworks==0.5.2 \
apispec==4.0.0 \
attrs==20.2.0 \
avro-python3==1.9.0 \
backcall==0.2.0 \
boto3==1.9.100 \
botocore==1.12.253 \
certifi==2020.6.20 \
chardet==3.0.4 \
click==7.1.2 \
consulate==0.6.0 \
cytoolz==0.11.0 \
'dataclasses==0.7; python_version < "3.7"' \
decorator==4.4.2 \
dnspython3==1.15.0 \
dnspython==1.15.0 \
docutils==0.15.2 \
filelock==3.0.12 \
flask-apispec==0.10.0 \
flask-consulate==0.2.0 \
flask-restplus==0.12.1 \
flatten-dict==0.0.3.post1 \
idna==2.10 \
importlib-metadata==2.0.0 \
ipython-genutils==0.2.0 \
ipython==7.16.1 \
itsdangerous==1.1.0 \
jedi==0.17.2 \
jmespath==0.10.0 \
joblib==0.17.0 \
jsonschema==3.2.0 \
marshmallow-dataclass==8.1.0 \
marshmallow-enum==1.5.1 \
marshmallow==3.8.0 \
mypy-extensions==0.4.3 \
numpy==1.19.2 \
opencv-python-headless==4.4.0.44 \
pants==1.0.1 \
parso==0.7.1 \
pathlib2==2.3.5 \
pexpect==4.8.0 \
pickleshare==0.7.5 \
pip==20.2.4 \
prompt-toolkit==3.0.8 \
ptyprocess==0.6.0 \
pyrsistent==0.17.3 \
python-dateutil==2.8.1 \
pytz==2020.1 \
redis==3.2.1 \
requests==2.24.0 \
s3transfer==0.2.1 \
scikit-learn==0.23.2 \
scipy==1.5.3 \
setuptools==46.1.3 \
six==1.15.0 \
statsd-telegraf==3.2.1.post1 \
threadpoolctl==2.1.0 \
toolz==0.11.1 \
traitlets==4.3.3 \
typeguard==2.9.1 \
typing-extensions==3.7.4.3 \
typing-inspect==0.6.0 \
urllib3==1.25.10 \
wcwidth==0.2.5 \
webargs==6.1.1 \
wheel==0.34.2 \
zipp==3.3.1

This command has been reformatted and edited slightly to i could run it outside of pants to debug this problem

Eric-Arellano commented 4 years ago

unless a requirement is exact, once a TTL expires, the ranged requirement sort of has to be looked up again.

Ah, so, if I understand correctly, turning off cache TTL would mean a loose req like Django will never pull in new versions once cached. Makes sense.

--

Do you know why it might be taking >3 minutes to resolve Josh's constraints file when the TTL is expired, even though few dependencies changed; but only ~30 seconds if the TTL is not expired? That is, how much overhead do you expect when it expires? Do you know if we actually redownload files, for example, or solely ping to see if the version has changed?

(I'm being lazy not looking; I can dig if you're not already familiar.)

jsirois commented 4 years ago

Do you know why it might be taking >3 minutes to resolve Josh's constraints file when the TTL is expired, even though few dependencies changed; but only ~30 seconds if the TTL is not expired?

A quick search of the code over in Pants shows we do not pass --intransitive to Pex when emulating lockfile resolves. In order to emulate a lockfile you need two things:

  1. A fully pinned, fully transitive explicit set of requirements.
  2. --intransitive

Number 1 does not require passing --constraints, it just requires using a psuedo-lockfile fully pinned, fully transitive constraints file contents to be passed as requirements.

Number 2 is required since a fully pinned set of requirements, even with a fully pinned set of constraints does not guaranty closure. For example, say I have 1 requirement, requests==1.0.0 and constraints requests==1.0.0. Pex (Pip) can't know if requests 1.0.0 has ranged dependencies without reading requests distribution metadata and fully recursing. That recursion could reveal a non-pinned, non-constrained transitive dep. As such, you have to have pre-arranged the fully pinned, fully transitive property and then tell Pex (and Pip) "trust me" aka --intransitive.

jriddy commented 4 years ago

I see the--intransitive issue, that makes sense, and I'll update #1086 to reflect that. But I think there's still a bug in the caching logic here...

I've attached a log of the session I did with --cache-ttl=1 to test this. From what I can see, this cache-ttl factor is applied to both the resolution stage (which is fine), and to the download stage (which I think is not).

At the start of the attached excerpt, I can see it scanning the project for links and looking to select one that matches. Past all the "skipping links" parts, it finds the link, and performs a cache lookup for that exact file, finds it and decides to redownload it again anyways:

Oct 19 16:15:36 Given no hashes to check 2 links for project 'MarkupSafe': discarding no candidates
Oct 19 16:15:36 Using version 1.1.1 (newest of versions: 1.1.1)
Oct 19 16:15:36 Collecting MarkupSafe==1.1.1
Oct 19 16:15:36   Created temporary directory: /tmp/tmppex/pip-unpack-ztij8yvb
Oct 19 16:15:36   Looking up "https://files.pythonhosted.org/packages/4b/20/f6d7648c81cb84815d0be935d5c74cd1cc0239e43eadb1a61062d34b6543/MarkupSafe-1.1.1-cp38-cp38-manylinux1_x86_64.whl" in the cache
Oct 19 16:15:36   Current age based on date: 1203
Oct 19 16:15:36   Ignoring unknown cache-control directive: immutable
Oct 19 16:15:36   Freshness lifetime from max-age: 365000000
Oct 19 16:15:36   Freshness lifetime from request max-age: 1
Oct 19 16:15:36   https://files.pythonhosted.org:443 "GET /packages/4b/20/f6d7648c81cb84815d0be935d5c74cd1cc0239e43eadb1a61062d34b6543/MarkupSafe-1.1.1-cp38-cp38-manylinux1_x86_64.whl HTTP/1.1" 200 32690
Oct 19 16:15:36   Downloading MarkupSafe-1.1.1-cp38-cp38-manylinux1_x86_64.whl (32 kB)
Oct 19 16:15:36   Ignoring unknown cache-control directive: immutable

This is the behavior I find kinda baffling. I'm not sure what "given no hashes to check links means here", since the hashes for these files is in the URL itself. It doesn't need to download this again.

pex-logexerpt.txt

jriddy commented 4 years ago

I've added --intransivite to my test command and I'm getting the exact same behavior. It's redownloading everything. Despite the files being present in the pex cache. Is this perhaps related to the line Ignoring unknown cache-control directive: immutable in the above excerpt?

jsirois commented 3 years ago

@jriddy it appears to be the case in command line tests that Pip, with a Cache-Control: max-age=N where N > 0 (this is how --cache-ttl is plumbed) re-fetches and re-builds once the TTL expires. With Cache-Control: max-age=0 - the Pip default - a conditional GET will always be requested, but when that returns 304, the cached wheel / tgz is re-used. As such, the default setting of Pex here of --cache-ttl=3600 is not generally useful and certainly not for Pants. In the upgrade to Pip 20.3.1 in #1133 the support for setting this goes away; so Pex will return to a default of always using a conditional GET which proves to provide better latency on re-resolves. The fix to not re-resolve at all when a lockfile is used tracked by #1086 will be the proper remedy.

jriddy commented 3 years ago

Thanks for the update @jsirois.

I've mostly worked around these issues on my end by setting --cache-ttl to very high numbers, but I'm looking forward to this release, as I do change my constraints.txt files pretty often.

Thanks for the great work as always!

Eric-Arellano commented 3 years ago

Sounds like this was fixed in Pex.