Open glyph opened 2 years ago
Symlinks are tricky because they make it impossible to know whether you can safely remove an entry from the cache. Of course AFAIK pip doesn't actually have any policy for evicting items from the cache currently, but this would rule it out forever, and also mean that it's no longer safe for the user to blow away the cache. (Also, I suspect there are plenty of automated tools out there that do this? It's easy to imagine a Dockerfile
doing rm -rf ~/.cache
after completing installation.)
Hardlinks avoid these issues. I dunno if they create any new ones – might want to check with the conda folks, since they have years of experience with doing this (with hardlinks).
A third option to consider might be reflinks.
Symlinks are tricky because they make it impossible to know whether you can safely remove an entry from the cache.
Hardlinks have a different version of this issue, too, which is knowing the side effects of editing the file in place. I definitely scribble on my venvs periodically because of my experience of unreliability of Python debuggers, and scribbling on all of them at once would definitely be an unwelcome surprise.
A third option to consider might be reflinks.
So I thought of this and immediately discarded the thought, because reflinks are a super obscure feature that only barely works on Btrfs, right? But your comment got me to do a little bit of research and discovered they're supported on Windows, APFS on macOS, and Btrfs, CIFS, NFS 4.2, OCFS2, overlayfs, and XFS on Linux. Given this surprisingly wide deployment, and the relative lack of any issues of refcounting or accidental mutability, maybe it would be good to implement these first?
they're supported on Windows
As far as I understand, they are supported on ReFS, but this isn't the default filesystem on Windows (my laptop is still using NTFS). Unless ReFS presents itself as NTFS (and hence I'm using it without knowing) I suspect that the number of Windows environments where reflinks work is likely to be extremely small...
AFAIR pythons copy tree shutil automatically tries to use reflink when avaliable, (verification needed)
I don't see any references to reflink in 3.10's shutil.py...
@pfmoore they come in via the copy file range heleprs used to optimize
https://docs.python.org/3/library/shutil.html#shutil-platform-dependent-efficient-copy-operations since python 3.8
@RonnyPfannschmidt Are you sure of that ?
Starting from Python 3.8, all functions involving a file copy ([copyfile()]
(https://docs.python.org/3.8/library/shutil.html#shutil.copyfile), [copy()]
(https://docs.python.org/3.8/library/shutil.html#shutil.copy), [copy2()]
(https://docs.python.org/3.8/library/shutil.html#shutil.copy2), [copytree()]
(https://docs.python.org/3.8/library/shutil.html#shutil.copytree), and [move()]
(https://docs.python.org/3.8/library/shutil.html#shutil.move)) may use platform-specific
“fast-copy” syscalls in order to copy the file more efficiently
(see [bpo-33671](https://bugs.python.org/issue33671)).
“fast-copy” means that the copying operation occurs within the kernel,
avoiding the use of userspace buffers in Python as in “outfd.write(infd.read())”.
This explanation does not match reflink. Reflink is a feature of "some" Filesystem (nice explanation here) https://blog.ram.rachum.com/post/620335081764077568/symlinks-and-hardlinks-move-over-make-room-for and the shutil Python 3.8 implementation just mentions "fast-copy" operation done in the kernel rather than using user-space. Optimization coming from avoiding going through multiple user<->kernel syscalls and using userspace buffers for that. This is different thing than reflinks altogether IMHO.
I believe reflinks require explicit system calls (like for example https://pypi.org/project/reflink/ provides) and it's very much tied to which filesystem you have the files on).
On linux, I think using os.copy_file_range
will use COW copying if possible? See https://github.com/python/cpython/issues/81338
No idea what it does on Windows, or if that function is even available on Windows, but it looks like reflink is only available on Windows with ReFS.
It doesn't appear like shutil currently uses os.copy_file_range
though, but https://github.com/python/cpython/issues/81338 is trying to add that feature. So pip could just take the approach of caching unpacked wheels, and using shutil. copytree
on them, and leave it up to Python to implement reflink support where possible.
We'd still get performance improvements from not having to unzip into a temporary location and copy out of that, and IIRC we're using the default temporary directory by default, which is often times on another file system, so we'd be more likely to use fast copying at a minimum as well.
indeed, it seems like i misremembered a detail
TIL about the reflinks. BTW. Reflinks is nice feature - pity it's only available for some "obscurish" filesystems.
I have a feeling that this may need to be solved one layer up, in the virtual environment abstraction. Node has more or less the same problem, and the way they currently address this is (pnpm) is to share package installations between environments is possible. This would be more doable from referencing files directly from the pip cache. All the same issues with soft-linking still persist, of course, although Node has never been that friendly to development environments on Windows, so probably they just don’t care that much (I didn’t check).
Symlinks are tricky because they make it impossible to know whether you can safely remove an entry from the cache. Of course AFAIK pip doesn't actually have any policy for evicting items from the cache currently, but this would rule it out forever, and also mean that it's no longer safe for the user to blow away the cache.
And there is actually work toward this right now (see recent comments in #2984 and other issues referenced in it), so we probably don’t want to go toward this particular direction, at least not without a lot of discussion.
I don't think you can solve this in the virtual environment abstraction? At least I'm not sure how you're envisioning that working? The virtual environment abstraction largely is just setting up sys.path
, how things get installed onto that sys.path
isn't really it's concern, unless you have something else in mind that I'm not thinking of? Solving it there also doesn't solve it for cases that aren't inside of a virtual environment.
I think the only reasonable path here is pretty straight forward:
shutil.copytree
to copy out of the wheel cache.This has some immediate benefits:
With some immediate downsides:
Then it also has some longer term benefits:
os.copy_file_range
that's an additional speed up.copy_function
to shutil.copytree
.this practically is a bit like the proposal for shared storage i tried to bring forward a while back
Here, We are talking about few things to be improved: 1- decreasing the reserved space. 2- increasing the speed of installation. 3- decreasing the processing needed due to each of (compressions/decompressions, making new network requests/checking the source and hashed codes for integrity of the source, saving new files of the same versions in the certain venv with their actual data...etc). 4- decreasing the network traffics, hense saving the developer's internet data package while they have mettered network. and as far as I can tell, this has only these negos: 1- increasing the processing to handle the copying effeciently (ex: for the shutil). 2- time consumption behind the (discussion, planning and developement of the new generic approach).
Symlinks are tricky because they make it impossible to know whether you can safely remove an entry from the cache.
soft link can point to directories or files, therefore much more viable, but to overcome your fear about the (safely removing), we can create another utility for file linking (specifically for PyPi) that generates the links in soft format but keeps the original data for all remaining links (unless there is no links any more to the actual data, only then will be removed by the last link remove action), the approach is kinda simple: once there are more than a dependant to the same package version of certain dependency, the real data of the dependency would be moved to another (shared) folder and soft-link all dependants to its location.. (this way, we don't need pip handles the linking for single dependant-projects/venv)..! when a certain venv/dependant had been removed, the actual data are still in the same shared place for other dependants.
Notes:
It's a very interesting concept, but there must be a lot of edge cases to explore. What happens with this feature if I run do-release-upgrade
on Ubuntu and my virtualenv is broken (they break 99% of the time)? I proceed to delete the virtualenv, but it keeps referencing wrongly built packages in the cache? So I proceed to wipe the cache? And then a bunch of other virtualenvs are no longer using a cache because of copy-on-write?
Is it correctly understood that this fits best for a CI environment where virtualenvs are created often? In this case, perhaps it could be possible to enable a behavior like this as non-default through a switch for pip
where the copy-on-write unpacked cache can co-exist with the normal cache?
It's a very interesting concept, but there must be a lot of edge cases to explore. What happens with this feature if I run do-release-upgrade on Ubuntu and my virtualenv is broken (they break 99% of the time)? I proceed to delete the virtualenv, but it keeps referencing wrongly built packages in the cache? So I proceed to wipe the cache? And then a bunch of other virtualenvs are no longer using a cache because of copy-on-write?
Is it correctly understood that this fits best for a CI environment where virtualenvs are created often? In this case, perhaps it could be possible to enable a behavior like this as non-default through a switch for pip where the copy-on-write unpacked cache can co-exist with the normal cache?
With my proposed idea few posts up, there is no semantic change for any operation as it is today. Reflinks COW properties here are well suited, but since we can't rely on them existing, we can let shutils handle that for us, and then just use shutil to copy an unpacked cached wheel into a virtual environment.
Without reflink it will just copy the file contents, basically the same as we're doing today except we skip unzipping the wheel (since it's already unzipped). With reflink support it will COW the files using reflinks.
So I proceed to wipe the cache? And then a bunch of other virtualenvs are no longer using a cache because of copy-on-write?
not using Cache (by other venvs) is not a big deal, compare to the current pains we get..!
@benjaoming most virtualenvs break on dist upgrade because of python changes, not because of so changes in wheels,
in particular all the wheels from for pypi will not break the shared libs
to elaborate - @benjaoming the proposed change tactically just changes the following - the wheels would be stored unpacked to use fast copy/cow copy instead of "unzip" to put them from cache to virtualenv
as such the expected behaviour post install will match the current mechanism 1:1
Creating a virtual environment is a major time sink compared to actually running the pipeline in a CI system I am working with. I like @dstufft 's proposal of caching unpacked wheels because it unlocks immediate improvements without having to figure out the intricacies of sharing installed packages between environments. However, I don't have a clear idea how big the cache would become. That's not a problem on a beefy CI server but should probably be opt-in.
as such the expected behaviour post install will match the current mechanism 1:1
@RonnyPfannschmidt Does a cache w/unzipped wheels impose new constraints on pip's cache expiry mechanism? I take it that the consensus here is "No" - but I think it's good to ask for the sake of clarity, since caching is commonly understood as a hard problem.
We're storing the cache uncompressed, so it will take up more room than storing it compressed.
Do we have any idea how much extra space this would take? Over in #11143 we're having a debate over trying to reduce the space usage of the HTTP cache, it seems inconsistent to do that and yet increase the space usage of the wheel cache without worrying about it...
(Personally, my machine is big enough that cache size isn't a significant issue, but we have enough users on space-limited systems such as the Raspberry Pi, that we can't assume disk space isn't an issue in general).
We could potentially help systems like that by offering a flag to try and issue hard links, and failing that fall back to soft links and let people opt into space saving prior to reflink support being available for them.
We could also just say that we're not super interested in this until reflink support is available in Python itself.
Or implement a cache clean up mechanism with some sort of LRU or something.
I don't know offhand how much compression a wheel achieves over uncompressed, zip files members are only stored individually compressed, so it won't be as high as it could be. Shouldn't be too hard to pull down a bunch of wheels from PyPI and look though.
they're supported on Windows
As far as I understand, they are supported on ReFS, but this isn't the default filesystem on Windows (my laptop is still using NTFS). Unless ReFS presents itself as NTFS (and hence I'm using it without knowing) I suspect that the number of Windows environments where reflinks work is likely to be extremely small...
Ah, yes, I misread this. That's unfortunate. Still, their general availability on APFS potentially serves a lot of Python developers, and Btrfs is coming to more and more linux distros as the default root filesystem.
Although https://en.wikipedia.org/wiki/ReFS looks very messy in terms of its development history / availability (it was available on most client versions of Windows until Windows 10 Creators Update and now it's reserved for Pro & Enterprise?), it still claims it has "the intent of becoming the "next generation" file system after NTFS"
(I am going to try to stop falling into the rabbit hole of reading the tea leaves on Microsoft's future plans for this filesystem, but it does seem like https://github.com/microsoft/CopyOnWrite at least implies that Microsoft cares about the feature a little? )
To provide a few numbers I measured the time it took to install the following packages into a clean virtual environment.
The results were:
35s spent in _install_wheel()
of which
compileall.compile_file
ZipBackedFile.save()
988MiB http
cache
1.7MiB wheels
cache
2.3GiB in total when unpacking the wheel files
2.4GiB resulting venv size
The size of the unpacked wheels does not seem unreasonable to me. However, it looks like the compiled files would have to be cached as well to get the best performance improvements.
FWIW, if someone wants to help move this forward, a prototype of this would be very welcome and should be relatively straightforward to implement with installer
.
The logic you'd need to implement would be to derive from SchemeDictionaryDestination
, and override write_to_fs
to behave in the manner requested here -- writing the file to a cache (if it isn't written there already) and creating a hard link to it from the actual location you need to write it to. You might need to add additional arguments to __init__
depending on what exact semantics you're looking for.
Having a cross-platform prototype of this would be a major piece in helping move this forward; since I reckon it's unlikely that one of pip's existing maintainers will have the bandwidth to explore this.
I did work on a proof of concept that tries to solve this issue just in a slightly different way, it uses installer to implement a basic wheel installer that installs packages to multi-site-packages/{package_name}/{package_version}
, but instead of putting reflinks/symlinks to packages inside the site-packages
directory of a venv it relies on using a custom importlib
finder which reads a lockfile and inserts the path to the requested version of the package into sys.path
before importing.
Made a post on the Python forums here if anybody would like to join the discussion.
What's the problem this feature will solve?
Creating a new virtual environment in a modern Python project can be quite slow, sometimes on the order of tens of seconds even on very high-end hardware, once you have a lot of dependencies. It also takes up a lot of space; my
~/.virtualenvs/
is almost 3 gigabytes, and this is a relatively new machine; and that isn't even counting my~/.local/pipx
, which is another 434M.Describe the solution you'd like
Rather than unpacking and duplicating all the data in wheels, pip could store the cache unpacked, so all the files are already on the filesystem, and then clone them into place on copy-on-write filesystems rather than copying them. While there may be other bottlenecks, this would also reduce disk usage by an order of magnitude. (My
~/Library/Caches/pip
is only 256M, and presumably all those virtualenvs contain multiple full, uncompressed copies of it!)Alternative Solutions
You could get a similar reduction effect by setting up an import hook, using zipimport, or doing some kind of
.pth
file shenanigans but I feel like those all have significant drawbacks.Additional context
Given that platforms generally use shared memory-maps for shared object files, if it's done right this could additionally reduce the memory footprint of python interpreters in different virtualenvs with large C extensions loaded.
Code of Conduct