optimize package installation for space and speed by using copy-on-write file clones ("reflinks") and storing wheel cache unpacked

glyph commented 2 years ago

What's the problem this feature will solve?

Creating a new virtual environment in a modern Python project can be quite slow, sometimes on the order of tens of seconds even on very high-end hardware, once you have a lot of dependencies. It also takes up a lot of space; my ~/.virtualenvs/ is almost 3 gigabytes, and this is a relatively new machine; and that isn't even counting my ~/.local/pipx, which is another 434M.

Describe the solution you'd like

Rather than unpacking and duplicating all the data in wheels, pip could store the cache unpacked, so all the files are already on the filesystem, and then clone them into place on copy-on-write filesystems rather than copying them. While there may be other bottlenecks, this would also reduce disk usage by an order of magnitude. (My ~/Library/Caches/pip is only 256M, and presumably all those virtualenvs contain multiple full, uncompressed copies of it!)

Alternative Solutions

You could get a similar reduction effect by setting up an import hook, using zipimport, or doing some kind of .pth file shenanigans but I feel like those all have significant drawbacks.

Additional context

Given that platforms generally use shared memory-maps for shared object files, if it's done right this could additionally reduce the memory footprint of python interpreters in different virtualenvs with large C extensions loaded.

Code of Conduct

[X] I agree to follow the PSF Code of Conduct.

njsmith commented 2 years ago

Symlinks are tricky because they make it impossible to know whether you can safely remove an entry from the cache. Of course AFAIK pip doesn't actually have any policy for evicting items from the cache currently, but this would rule it out forever, and also mean that it's no longer safe for the user to blow away the cache. (Also, I suspect there are plenty of automated tools out there that do this? It's easy to imagine a Dockerfile doing rm -rf ~/.cache after completing installation.)

Hardlinks avoid these issues. I dunno if they create any new ones – might want to check with the conda folks, since they have years of experience with doing this (with hardlinks).

exarkun commented 2 years ago

A third option to consider might be reflinks.

glyph commented 2 years ago

Symlinks are tricky because they make it impossible to know whether you can safely remove an entry from the cache.

Hardlinks have a different version of this issue, too, which is knowing the side effects of editing the file in place. I definitely scribble on my venvs periodically because of my experience of unreliability of Python debuggers, and scribbling on all of them at once would definitely be an unwelcome surprise.

glyph commented 2 years ago

A third option to consider might be reflinks.

So I thought of this and immediately discarded the thought, because reflinks are a super obscure feature that only barely works on Btrfs, right? But your comment got me to do a little bit of research and discovered they're supported on Windows, APFS on macOS, and Btrfs, CIFS, NFS 4.2, OCFS2, overlayfs, and XFS on Linux. Given this surprisingly wide deployment, and the relative lack of any issues of refcounting or accidental mutability, maybe it would be good to implement these first?

pfmoore commented 2 years ago

they're supported on Windows

As far as I understand, they are supported on ReFS, but this isn't the default filesystem on Windows (my laptop is still using NTFS). Unless ReFS presents itself as NTFS (and hence I'm using it without knowing) I suspect that the number of Windows environments where reflinks work is likely to be extremely small...

RonnyPfannschmidt commented 2 years ago

AFAIR pythons copy tree shutil automatically tries to use reflink when avaliable, (verification needed)

pfmoore commented 2 years ago

I don't see any references to reflink in 3.10's shutil.py...

RonnyPfannschmidt commented 2 years ago

@pfmoore they come in via the copy file range heleprs used to optimize

https://docs.python.org/3/library/shutil.html#shutil-platform-dependent-efficient-copy-operations since python 3.8

potiuk commented 2 years ago

@RonnyPfannschmidt Are you sure of that ?

Starting from Python 3.8, all functions involving a file copy ([copyfile()]
(https://docs.python.org/3.8/library/shutil.html#shutil.copyfile), [copy()]
(https://docs.python.org/3.8/library/shutil.html#shutil.copy), [copy2()]
(https://docs.python.org/3.8/library/shutil.html#shutil.copy2), [copytree()]
(https://docs.python.org/3.8/library/shutil.html#shutil.copytree), and [move()]
(https://docs.python.org/3.8/library/shutil.html#shutil.move)) may use platform-specific
“fast-copy” syscalls in order to copy the file more efficiently 
(see [bpo-33671](https://bugs.python.org/issue33671)). 
“fast-copy” means that the copying operation occurs within the kernel,
avoiding the use of userspace buffers in Python as in “outfd.write(infd.read())”.

This explanation does not match reflink. Reflink is a feature of "some" Filesystem (nice explanation here) https://blog.ram.rachum.com/post/620335081764077568/symlinks-and-hardlinks-move-over-make-room-for and the shutil Python 3.8 implementation just mentions "fast-copy" operation done in the kernel rather than using user-space. Optimization coming from avoiding going through multiple user<->kernel syscalls and using userspace buffers for that. This is different thing than reflinks altogether IMHO.

I believe reflinks require explicit system calls (like for example https://pypi.org/project/reflink/ provides) and it's very much tied to which filesystem you have the files on).

dstufft commented 2 years ago

On linux, I think using os.copy_file_range will use COW copying if possible? See https://github.com/python/cpython/issues/81338

No idea what it does on Windows, or if that function is even available on Windows, but it looks like reflink is only available on Windows with ReFS.

It doesn't appear like shutil currently uses os.copy_file_range though, but https://github.com/python/cpython/issues/81338 is trying to add that feature. So pip could just take the approach of caching unpacked wheels, and using shutil. copytree on them, and leave it up to Python to implement reflink support where possible.

We'd still get performance improvements from not having to unzip into a temporary location and copy out of that, and IIRC we're using the default temporary directory by default, which is often times on another file system, so we'd be more likely to use fast copying at a minimum as well.

RonnyPfannschmidt commented 2 years ago

indeed, it seems like i misremembered a detail

potiuk commented 2 years ago

TIL about the reflinks. BTW. Reflinks is nice feature - pity it's only available for some "obscurish" filesystems.

uranusjr commented 2 years ago

I have a feeling that this may need to be solved one layer up, in the virtual environment abstraction. Node has more or less the same problem, and the way they currently address this is (pnpm) is to share package installations between environments is possible. This would be more doable from referencing files directly from the pip cache. All the same issues with soft-linking still persist, of course, although Node has never been that friendly to development environments on Windows, so probably they just don’t care that much (I didn’t check).

Symlinks are tricky because they make it impossible to know whether you can safely remove an entry from the cache. Of course AFAIK pip doesn't actually have any policy for evicting items from the cache currently, but this would rule it out forever, and also mean that it's no longer safe for the user to blow away the cache.

And there is actually work toward this right now (see recent comments in #2984 and other issues referenced in it), so we probably don’t want to go toward this particular direction, at least not without a lot of discussion.

dstufft commented 2 years ago

I don't think you can solve this in the virtual environment abstraction? At least I'm not sure how you're envisioning that working? The virtual environment abstraction largely is just setting up sys.path, how things get installed onto that sys.path isn't really it's concern, unless you have something else in mind that I'm not thinking of? Solving it there also doesn't solve it for cases that aren't inside of a virtual environment.

I think the only reasonable path here is pretty straight forward:

Within the wheel cache, stop caching zipped up wheels, unpack them and cache them unpacked.
Start caching things that we've downloaded as wheels within the wheel cache, unpacked as well.
- This might mean that we want to stop caching downloads in the HTTP cache completely then, since they'll be cached inside of the wheel cache always then. Though maybe we would still want HTTP caching for sdists? I dunno.
Adjust wheel installing so that instead of operating on a zipped wheel, it operates on an unzipped wheel, and uses shutil.copytree to copy out of the wheel cache.

This has some immediate benefits:

Right away installation of a cached wheel gets faster, because instead of copying data out of a zip file, decompressing it, then writing it to disk, we rely on shutil to use the most efficient way to copy the file.
We make our caching more consistent, we no longer have some wheels cached inside of the HTTP cache, and some wheels cached inside of the wheel cache, they're just all cached inside of the wheel cache.
- We maybe still cache sdists in the HTTP cache, but at least that is an entirely different format.

With some immediate downsides:

Old wheel cache is no longer useful
We're storing the cache uncompressed, so it will take up more room than storing it compressed.

Then it also has some longer term benefits:

If/when reflink support gets added to shutil, we should just automatically start taking advantage of it when possible, and in general we get automatic improvements from shutil (e.g. even without reflink, once it starts using os.copy_file_range that's an additional speed up.
It makes it really easy to add features to allow people to opt into additional performance enhancements that change the semantics of an install. For instance, we could add a flag that would attempt to use hard links or sym links if possible, which breaks the virtual environment isolation if people are editing installed modules, but which would be another large speed up and space savings. That would be implementable by just passing in a different copy_function to shutil.copytree.

RonnyPfannschmidt commented 2 years ago

this practically is a bit like the proposal for shared storage i tried to bring forward a while back

hazho commented 2 years ago

Here, We are talking about few things to be improved: 1- decreasing the reserved space. 2- increasing the speed of installation. 3- decreasing the processing needed due to each of (compressions/decompressions, making new network requests/checking the source and hashed codes for integrity of the source, saving new files of the same versions in the certain venv with their actual data...etc). 4- decreasing the network traffics, hense saving the developer's internet data package while they have mettered network. and as far as I can tell, this has only these negos: 1- increasing the processing to handle the copying effeciently (ex: for the shutil). 2- time consumption behind the (discussion, planning and developement of the new generic approach).

Symlinks are tricky because they make it impossible to know whether you can safely remove an entry from the cache.

soft link can point to directories or files, therefore much more viable, but to overcome your fear about the (safely removing), we can create another utility for file linking (specifically for PyPi) that generates the links in soft format but keeps the original data for all remaining links (unless there is no links any more to the actual data, only then will be removed by the last link remove action), the approach is kinda simple: once there are more than a dependant to the same package version of certain dependency, the real data of the dependency would be moved to another (shared) folder and soft-link all dependants to its location.. (this way, we don't need pip handles the linking for single dependant-projects/venv)..! when a certain venv/dependant had been removed, the actual data are still in the same shared place for other dependants.

Notes:

there is no actual need for caching in this case.
this could be a base for an interoperability and distributed PyPi option, among peers neerby or in the same intranet, saving millions of international data traffics can save the energy as well..!

benjaoming commented 2 years ago

It's a very interesting concept, but there must be a lot of edge cases to explore. What happens with this feature if I run do-release-upgrade on Ubuntu and my virtualenv is broken (they break 99% of the time)? I proceed to delete the virtualenv, but it keeps referencing wrongly built packages in the cache? So I proceed to wipe the cache? And then a bunch of other virtualenvs are no longer using a cache because of copy-on-write?

Is it correctly understood that this fits best for a CI environment where virtualenvs are created often? In this case, perhaps it could be possible to enable a behavior like this as non-default through a switch for pip where the copy-on-write unpacked cache can co-exist with the normal cache?

dstufft commented 2 years ago

It's a very interesting concept, but there must be a lot of edge cases to explore. What happens with this feature if I run do-release-upgrade on Ubuntu and my virtualenv is broken (they break 99% of the time)? I proceed to delete the virtualenv, but it keeps referencing wrongly built packages in the cache? So I proceed to wipe the cache? And then a bunch of other virtualenvs are no longer using a cache because of copy-on-write?

Is it correctly understood that this fits best for a CI environment where virtualenvs are created often? In this case, perhaps it could be possible to enable a behavior like this as non-default through a switch for pip where the copy-on-write unpacked cache can co-exist with the normal cache?

With my proposed idea few posts up, there is no semantic change for any operation as it is today. Reflinks COW properties here are well suited, but since we can't rely on them existing, we can let shutils handle that for us, and then just use shutil to copy an unpacked cached wheel into a virtual environment.

Without reflink it will just copy the file contents, basically the same as we're doing today except we skip unzipping the wheel (since it's already unzipped). With reflink support it will COW the files using reflinks.

hazho commented 2 years ago

So I proceed to wipe the cache? And then a bunch of other virtualenvs are no longer using a cache because of copy-on-write?

not using Cache (by other venvs) is not a big deal, compare to the current pains we get..!

RonnyPfannschmidt commented 2 years ago

@benjaoming most virtualenvs break on dist upgrade because of python changes, not because of so changes in wheels,

in particular all the wheels from for pypi will not break the shared libs

RonnyPfannschmidt commented 2 years ago

to elaborate - @benjaoming the proposed change tactically just changes the following - the wheels would be stored unpacked to use fast copy/cow copy instead of "unzip" to put them from cache to virtualenv

as such the expected behaviour post install will match the current mechanism 1:1

Mr-Pepe commented 2 years ago

Creating a virtual environment is a major time sink compared to actually running the pipeline in a CI system I am working with. I like @dstufft 's proposal of caching unpacked wheels because it unlocks immediate improvements without having to figure out the intricacies of sharing installed packages between environments. However, I don't have a clear idea how big the cache would become. That's not a problem on a beefy CI server but should probably be opt-in.

benjaoming commented 2 years ago

as such the expected behaviour post install will match the current mechanism 1:1

@RonnyPfannschmidt Does a cache w/unzipped wheels impose new constraints on pip's cache expiry mechanism? I take it that the consensus here is "No" - but I think it's good to ask for the sake of clarity, since caching is commonly understood as a hard problem.

pfmoore commented 2 years ago

We're storing the cache uncompressed, so it will take up more room than storing it compressed.

Do we have any idea how much extra space this would take? Over in #11143 we're having a debate over trying to reduce the space usage of the HTTP cache, it seems inconsistent to do that and yet increase the space usage of the wheel cache without worrying about it...

(Personally, my machine is big enough that cache size isn't a significant issue, but we have enough users on space-limited systems such as the Raspberry Pi, that we can't assume disk space isn't an issue in general).

dstufft commented 2 years ago

We could potentially help systems like that by offering a flag to try and issue hard links, and failing that fall back to soft links and let people opt into space saving prior to reflink support being available for them.

We could also just say that we're not super interested in this until reflink support is available in Python itself.

Or implement a cache clean up mechanism with some sort of LRU or something.

I don't know offhand how much compression a wheel achieves over uncompressed, zip files members are only stored individually compressed, so it won't be as high as it could be. Shouldn't be too hard to pull down a bunch of wheels from PyPI and look though.

glyph commented 2 years ago

they're supported on Windows

As far as I understand, they are supported on ReFS, but this isn't the default filesystem on Windows (my laptop is still using NTFS). Unless ReFS presents itself as NTFS (and hence I'm using it without knowing) I suspect that the number of Windows environments where reflinks work is likely to be extremely small...

Ah, yes, I misread this. That's unfortunate. Still, their general availability on APFS potentially serves a lot of Python developers, and Btrfs is coming to more and more linux distros as the default root filesystem.

Although https://en.wikipedia.org/wiki/ReFS looks very messy in terms of its development history / availability (it was available on most client versions of Windows until Windows 10 Creators Update and now it's reserved for Pro & Enterprise?), it still claims it has "the intent of becoming the "next generation" file system after NTFS"

glyph commented 2 years ago

(I am going to try to stop falling into the rabbit hole of reading the tea leaves on Microsoft's future plans for this filesystem, but it does seem like https://github.com/microsoft/CopyOnWrite at least implies that Microsoft cares about the feature a little? )

Mr-Pepe commented 1 year ago

To provide a few numbers I measured the time it took to install the following packages into a clean virtual environment.

Requirements

``` a2wsgi==1.6.0 alabaster==0.7.12 anyio==3.6.1 api4jenkins==1.13 arrow==1.2.3 astroid==2.12.10 attrs==22.1.0 Automat==20.2.0 Babel==2.10.3 bcrypt==4.0.0 binaryornot==0.4.4 black==22.8.0 bleach==5.0.1 bottle==0.12.23 Brotli==1.0.9 build==0.8.0 certifi==2022.9.24 cffi==1.15.1 chardet==5.0.0 charset-normalizer==2.1.1 click==8.1.3 colorama==0.4.5 commonmark==0.9.1 constantly==15.1.0 cookiecutter==2.1.1 coverage==6.5.0 CProfileV==1.0.7 cruft==2.11.1 cryptography==38.0.1 cssbeautifier==1.14.6 dash==2.6.2 dash-bootstrap-components==1.2.1 dash-core-components==2.0.0 dash-html-components==2.0.0 dash-table==5.0.0 deepdiff==5.8.1 Deprecated==1.2.13 dill==0.3.5.1 distlib==0.3.6 djlint==1.18.0 docutils==0.17.1 EditorConfig==0.12.3 elementpath==3.0.2 execnet==1.9.0 fabric==2.7.1 fastapi==0.85.0 filelock==3.8.0 fire==0.4.0 Flask==2.1.3 Flask-Compress==1.13 furl==2.1.3 gitdb==4.0.9 GitPython==3.1.27 greenlet==1.1.3 h11==0.14.0 html-tag-names==0.1.2 html-void-elements==0.1.0 hyperlink==21.0.0 idna==3.4 imagesize==1.4.1 importlib-metadata==4.12.0 incremental==21.3.0 iniconfig==1.1.1 invoke==1.7.3 isort==5.10.1 itsdangerous==2.1.2 jaraco.classes==3.2.3 jeepney==0.8.0 Jinja2==3.1.2 jinja2-time==0.2.0 joblib==1.2.0 jsbeautifier==1.14.6 keyring==23.9.3 lazy-object-proxy==1.7.1 lxml==4.9.1 MarkupSafe==2.1.1 mccabe==0.7.0 more-itertools==8.14.0 mypy==0.981 mypy-extensions==0.4.3 numpy==1.23.3 opcua==0.98.13 ordered-set==4.1.0 orderedmultidict==1.0.1 packaging==21.3 pandas==1.5.0 paramiko==2.11.0 pathlib2==2.3.7.post1 pathspec==0.10.1 pep517==0.13.0 Pillow==9.2.0 pkginfo==1.8.3 platformdirs==2.5.2 playwright==1.26.1 plotly==5.10.0 pluggy==1.0.0 prompt-toolkit==3.0.31 psutil==5.9.2 py==1.11.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.21 pydantic==1.10.2 pydocstyle==6.1.1 pyee==8.1.0 pyfakefs==4.7.0 PyGithub==1.55 Pygments==2.13.0 PyJWT==2.5.0 pylint==2.15.3 PyNaCl==1.5.0 pyOpenSSL==22.1.0 pyparsing==3.0.9 pysoem==1.0.7 pytest==7.1.3 pytest-base-url==2.0.0 pytest-cov==4.0.0 pytest-forked==1.4.0 pytest-httpserver==1.0.6 pytest-mock==3.9.0 pytest-playwright==0.3.0 pytest-randomly==3.12.0 pytest-xdist==2.5.0 python-dateutil==2.8.2 python-multipart==0.0.5 python-slugify==6.1.2 python-sonarqube-api==1.3.3 pytz==2022.2.1 PyYAML==6.0 questionary==1.10.0 readme-renderer==37.2 regex==2022.9.13 requests==2.28.1 requests-mock==1.10.0 requests-toolbelt==0.9.1 rfc3986==2.0.0 rich==12.5.1 scikit-learn==1.1.2 scipy==1.9.1 scp==0.14.4 SecretStorage==3.3.3 service-identity==21.1.0 six==1.16.0 smmap==5.0.0 sniffio==1.3.0 snowballstemmer==2.2.0 Sphinx==5.2.3 sphinx-rtd-theme==1.0.0 sphinxcontrib-applehelp==1.0.2 sphinxcontrib-devhelp==1.0.2 sphinxcontrib-htmlhelp==2.0.0 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==1.0.3 sphinxcontrib-serializinghtml==1.1.5 sphinxemoji==0.2.0 starlette==0.20.4 tabulate==0.8.10 tenacity==8.1.0 termcolor==2.0.1 text-unidecode==1.3 threadpoolctl==3.1.0 toml==0.10.2 tomli==2.0.1 tomlkit==0.11.5 torch==1.12.1 tox==3.26.0 tqdm==4.64.1 twine==4.0.1 Twisted==22.8.0 typer==0.6.1 types-cryptography==3.3.23 types-docutils==0.19.1 types-paramiko==2.11.6 types-python-dateutil==2.8.19 types-PyYAML==6.0.12 types-requests==2.28.11 types-setuptools==65.4.0.0 types-tabulate==0.8.11 types-urllib3==1.26.25 typing_extensions==4.3.0 Unidecode==1.3.6 urllib3==1.26.12 uvicorn==0.18.3 virtualenv==20.16.5 wcwidth==0.2.5 webencodings==0.5.1 websockets==10.1 Werkzeug==2.0.0 wrapt==1.14.1 xmlschema==2.1.0 xmltodict==0.13.0 zipp==3.8.1 zope.interface==5.4.0 ```

The results were:

35s spent in _install_wheel() of which
- 18s spent in compileall.compile_file
- 12s spent in ZipBackedFile.save()
988MiB http cache
1.7MiB wheels cache
2.3GiB in total when unpacking the wheel files
2.4GiB resulting venv size

The size of the unpacked wheels does not seem unreasonable to me. However, it looks like the compiled files would have to be cached as well to get the best performance improvements.

pradyunsg commented 1 year ago

FWIW, if someone wants to help move this forward, a prototype of this would be very welcome and should be relatively straightforward to implement with installer.

The logic you'd need to implement would be to derive from SchemeDictionaryDestination, and override write_to_fs to behave in the manner requested here -- writing the file to a cache (if it isn't written there already) and creating a hard link to it from the actual location you need to write it to. You might need to add additional arguments to __init__ depending on what exact semantics you're looking for.

Having a cross-platform prototype of this would be a major piece in helping move this forward; since I reckon it's unlikely that one of pip's existing maintainers will have the bandwidth to explore this.

RobertRosca commented 1 year ago

I did work on a proof of concept that tries to solve this issue just in a slightly different way, it uses installer to implement a basic wheel installer that installs packages to multi-site-packages/{package_name}/{package_version}, but instead of putting reflinks/symlinks to packages inside the site-packages directory of a venv it relies on using a custom importlib finder which reads a lockfile and inserts the path to the requested version of the package into sys.path before importing.

Made a post on the Python forums here if anybody would like to join the discussion.

pypa / pip