Closed pradyunsg closed 3 years ago
Sign me up; though for virtualenv I'd have some perhaps unusual requirements (which may end up falling out of scope, in which case virtualenv could not use it to be fair):
If we’re going the sans-I/O route, it would make most sense to me to structure this project as a collection of tools—parsers and “unparsers” (is this an actual word?) needed to convert a wheel into an “installation”.
It likely also makes sense to also implement a wrapper API that calls those tools in order, which can both be used as-is, or serve as an example if someone wants to plug the tools into a different I/O source. The API @xavfernandez used in cric makes a lot of sense to me, although I’d prefer it taking a file-like object, and make the scheme a structured object.
I don’t have experience with sans-I/O interface design, so it makes most sense to me if we can implement the wrapper API first with the installation steps suggested by @dholth, and slowly pick it apart into the two-layer design. But I’ll happily follow and learn from people who do know how this kind of interface works 🙂
The README says the project is for installing Python wheels
“unparsers”
"compiler", or "formatter", depending on context
I'll (potentially foolishly) toss my hat in the ring to help out. 😉
And I agree with @uranusjr that this project should decompose into individual functions so that it can compose into a complete installation tool (probably based on the general outlined in PEP 427). That should give @gaborbernat the flexibility he's after -- especially if this is kept sans-I/O as that won't make assumptions about how things are stored -- while still being useful to build up to a single pep517-like "install this wheel" function for the common case.
I spent some time thinking about this, reading the discussions and... here's an initial design to iterate upon. We'll likely higher-level functions too (eg. to go directly from wheel to installed in the same interpreter), but before that, what I'd like to pin down what the internals look like. :)
installer
__init__.py
core.py
-> describes abstraction + holds the sans-I/O logic
Scheme = Literal['purelib', 'platlib', 'headers', 'scripts', 'data']
WheelSource
name: str
version: packaging.Version
get_all_names() -> List[str]
get_info(filename: str) -> ZipInfo or KeyError
open(zipinfo) -> BufferedReader
Destination
write_to(scheme_key: Scheme, path: str, file: BufferedReader) -> None
compile_py(scheme_key: Scheme, path: str) -> None
write_record(scheme_key: Scheme) -> None
validator = (source: Source) -> bool
make_script = (specification: str, gui: bool = False) -> BufferedReader
Installer
__init__(name, script_maker: ScriptMaker, validators: List[validator])
name -> ends up in INSTALLER
-> normalized lower-case string matching "[a-z0-9_\-\.]"
install(source: WheelSource, destination: Destination)
1. check WHEEL file
2. validators[*](source)
3. "spread" using zipinfo objects in-memory
4. destination.write_to(<every non-script file>)
5. destination.write_to(<every script file w/ rewritten shebang>)
6. destination.write_to(make_script(<every console_scripts entry>, gui=False))
7. destination.write_to(make_script(<every gui_scripts entry>, gui=True))
8. destination.compile_py(<every .py file>)
9. destination.write_record(...)
shell.py
-> Provides a few concrete implementations for Source, Destination,
validator and make_script
WheelFileSource
__init__(filename)
...
UnpackedWheelSource
__init__(directory)
...
# Script Makers
-> something built on top of simple_launcher?
# Validators
validate_record(...)
# Destinations
DirectoryMappedDestination
__init__(scheme_location_mapping: Dict[Scheme, str])
...
Notes:
Destination
implementations can use shutil.copyfileobj
, if they know their corresponding Source
is coming from disk..pyc
files be in RECORD? Should we be compiling unconditionally?
Installer
defers this to the Destination
, but what behavior do we provide in shell.py
out-of-the-box?A note I missed:
ScriptMaker
w/ make_console_script
and make_gui_script
? ... to get rid of the need for a boolean parameter (gui: bool
) on the function.WheelSource
interface to zipfile.ZipFile
s? Did you want something more generic for the return-value of WheelSource.get_info
, like a new data-class defined in "core.py"?WheelSource.open
require an info object, and not the filename?WheelSource
, such as a get_wheel_metadata
method? This could help if it's decided to split wheel metadata ("entry_points.txt", "METADATA", etc) from package data, see in discourse: Making the wheel format more flexiblescheme_key
is an argument to every method in Destination
. Another option is to have a Destination
subclass for each scheme-type (ie a PurelibDestination
subclass, a ScriptsDestination
subclass, etc), and compile_py
would only be available for Python files. This might lock in scheme types thoughwrite_record
supposed to be called for each (used) scheme, or called once? Does it require state in Destination
of what's been installed?validator
returning True/False is limited. Instead, it could raise a ValidationError
(defined in "core.py") subclass with descriptive message, which Installer.install
can catch and handle or leave for callersdestination
as a parameter to Installer.install
rather than an initialisation parameter implies that you intend to allow reuse of the Installer
installer for difference destinationsAre you tying the
WheelSource
interface tozipfile.ZipFile
s? Did you want something more generic for the return-value of WheelSource.get_info, like a new data-class defined in "core.py"?
Yes. No, since wheels are currently "literally a zip file". I'm open to changing this, if there's a good reason to do so.
The main goal here is to have a good way to preserve file metadata. I'm open to using something else entirely. :)
Why does WheelSource.open require an info object, and not the filename?
🤷♂️ I'm fine w/ either.
scheme_key
is an argument to every method inDestination
. Another option is to have aDestination
subclass for each scheme-type (ie aPurelibDestination
subclass, aScriptsDestination
subclass, etc), andcompile_py
would only be available for Python files. This might lock in scheme types though
This is intentional -- I don't think having 5 classes that need to be subclassed by the users would be a good idea, especially given how uniformly that they're treated. See https://www.python.org/dev/peps/pep-0427/#installing-a-wheel-distribution-1-0-py32-none-any-whl for how wheel install procedure works.
Is
write_record
supposed to be called for each (used) scheme, or called once? Does it require state inDestination
of what's been installed?
Just once; with the scheme based on Root-Is-Purelib
's value.
validator
returning True/False is limited. Instead, it could raise aValidationError
(defined in "core.py") subclass with descriptive message, whichInstaller.install
can catch and handle or leave for callers
Good call.
Would it worthwhile to have more direct access to wheel metadata in
WheelSource
, such as aget_wheel_metadata
method?
Hmm... Maybe?
I think you can count on random access to the files in the .dist-info directory and encourage sequential access to the rest of the data.
There's not too much important metadata in ZipInfo from wheel's perspective. I suppose we care about the +x bit, the 'is_dir' flag, the timestamp which cannot be before 1980 but might be fixed to a single date by the wheel generator for reproducible builds. ZIP should also make it possible to support symlinks, don't think anyone has tried that in wheel. Then there is the compression method which is also part of ZipInfo but might not be exposed per-file in a higher level wheel library.
In bdist_wheel's WheelFile RECORD gets automatically written on close. WheelFile calculates the hashes as you are writing data to each member.
I do want to avoid depending on zipfile
constructs though, since it would hard-marry the implementation to the module. This would be a problem if we want to support e.g. “exploded” wheel directory. I would suggest making the interface more like importlib.metadata.Distribution
, i.e. only interact with relative paths and raw bytes/str.
Thanks to zipfile.Path
we could use a pathlib.Path
interface, @uranusjr .
As for the proposal, my initial reaction is "where's the WHEEL
file parser?" (and then the follow-up question is if we are parsing here would we want to provide support to generate a WHEEL
file?). Same goes for reading/generating the RECORD
file. Basically I was expecting a function or class for every step in the PEP 427 outline that wasn't explicitly "copy this file from here to there" for composability. And all of those would be sans-I/O since they are operating on strings or bytes and don't need to deal with files themselves to operate.
I also wonder why Scheme
is a list of string literals instead of an enum?
Write RECORD using the csv module https://github.com/pypa/wheel/blob/master/src/wheel/wheelfile.py#L154
Filter an IO stream to calculate hashes automatically from zip file-like objects, override zipfile.open() (read direction only implemented here) https://github.com/dholth/wgc/blob/master/wgc2.py#L95 https://github.com/dholth/wgc/blob/master/wgc2.py#L26 ; bdist_wheel's is smarter and actually supports other hash algorithms, but uses a zipfile specific monkey patch to do the math.
I'm told getting utf-8 in and out of email "inspired" messages, like WHEEL and METADATA is what Python 3.3's email.policy module is, this is what we did this before that was available. https://github.com/pypa/wheel/blob/master/src/wheel/pkginfo.py
@dholth thanks! PEP 376 seems to have it all specified out. It would still be great to have a sans-I/O function, though, to make it so people don't have to read the PEP and instead could figure it out based on running help()
on some function from this module.
I was thinking if we take out the io and the side effects it will be guaranteed to inconvenience no one 😜 On Thu, Apr 16, 2020, at 5:23 PM, Brett Cannon wrote:
@dholth https://github.com/dholth thanks! PEP 376 https://www.python.org/dev/peps/pep-0376/#record seems to have it all specified out. It would still be great to have a sans-I/O function, though, to make it so people don't have to read the PEP and instead could figure it out based on running
help()
on some function from this module.— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pradyunsg/installer/issues/1#issuecomment-614903820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABSZEV44MFXE4WI2XPN5I3RM5ZOLANCNFSM4MGEFPLQ.
@pradyunsg your API looks clean. Don't include a 'list all files' API, ok to include a list dist-info files API. Instead, only assume random access to the dist-info directory. dist-info is like a dict, we expect files and not directories in there.
Assume sequential access to the bulk data, like a wheel.unpackall() generator that yields (ZipInfo, readable) tuples and its pair wheel.packall(). These could yield the metadata entries (.dist-info) as well. You should be able to create a new identical wheel by piping unpackall() into packall(). It is probably an error to perform a metadata operation at the same time as a bulk operation.
The original list of schemes comes from distutils. It is supposed to be possible to add more schemes, for example it would make sense to add a license or docs path. The feature has been requested several times but we haven't been able to specify it. See https://www.gnu.org/software/automake/manual/html_node/Standard-Directory-Variables.html#Standard-Directory-Variables
I've started writing a new WheelFile implementation using what I've learned and what Python has added over the last eight years. We need a flexible implementation that is not tied to bdist_wheel. Parsers are mostly there, haven't tried to make a clean API yet. Realized the wheel format doesn't use METADATA itself though anyone who is opening a wheel would be interested in that file. https://github.com/dholth/wgc/commit/1df15b31ce4a5b51ea48076aced0552bcc38f0ef
The feature has been requested several times but we haven't been able to specify it.
I'm definitely open to making this flexible in installer
but I'm curious how you're thinking we'd "add" these schemes to the format. I'm pretty sure we'd have a PEP for such a change.
Don't include a 'list all files' API, ok to include a list dist-info files API. Instead, only assume random access to the dist-info directory. dist-info is like a dict, we expect files and not directories in there.
This makes sense, yea.
Open question: Is it worth trying to get rid of BufferedReader
from the API? If yes, what would you replace it with?
Users can make a BufferedReader
that doesn't actually do I/O, and using it allows for optimizations (like the ones @gaborbernat wants for virtualenv) by special casing specific implementations of that protocol/interface (eg: going a copy instead of read-into-python-then-write).
Open question: Is it worth trying to get rid of
BufferedReader
from the API? If yes, what would you replace it with?Users can make a
BufferedReader
that doesn't actually do I/O, and using it allows for optimizations (like the ones @gaborbernat wants for virtualenv) by special casing specific implementations of that protocol/interface (eg: going a copy instead of read-into-python-then-write).
Why not both? You could have a Destination.copy_from(wheel_source)
method (called by Installer.install
) which by default simply calls Destination.write(WheelSource.open())
, but implementers can overwrite for any reason (including a file-system copy optimisation).
If extra wheel categories were properly relocatable, as in they will still work no matter where they are installed, you could leave new ones in site-packages/{namever}.dist-info/category/ and emit a warning.
If you passed { category : destination_path, <standard distutils paths> }
to the low level wheel installer then it would copy those files to the new path.
But I think a useful implementation would require a paths .json in dist-info to define completely custom paths based on substitution from known variables. For example { "docs" : "$docs_base/$namever/" } would copy your {namever}.dist-info/docs/* into a nice per-package directory. You would have to record { "docs" : "where it was installed" } in each package's installed metadata so that you could find them later.
We have this problem with data/ now. It has been broken since virtualenv was invented because the destination of the data category is not consistent depending on what kind of environment you're installing into. Your program can't find its own data. It's in a silly place.
Thanks for starting this - I've been thinking about making a minimal wheel installer tool myself, so I'm happy to see someone else got there first. :slightly_smiling_face:
Perhaps I lack imagination, but aiming for Sans-IO on something that creates lots of files and directories feels like making a simple task more complicated.
I'd also suggest that it would be valuable to have a command-line interface to install a specified wheel. This could be a python -m installer
style interface, not necessarily a script, as it's not something people are likely to be typing often.
@takluyver we definitely plan to use a dead-simple interface, however the idea is to allow more advanced use cases too as virtualenv has for example.
Noting that a bunch of the PEP related standard parts could likely also live in packaging
package.
IMHO the sans-IO part would be nice but seems lower priority than having a functional installer.
As for having a CLI capable of performing a simple installation, it means computing the installation scheme ( https://github.com/pypa/pip/blob/1b4c0866ab1108162ee00bd38a0fb5657b9e9aea/src/pip/_internal/locations.py#L95-L156 ) and would likely mean supporting all the expected options (--prefix
, --root
, etc) or maybe only --target
?
Not sure if we want to have this added complexity:
in which case would you prefer python -m installer the_wheel.whl
to python -m pip install --no-deps the_wheel.whl
@takluyver ?
in which case would you prefer
python -m installer the_wheel.whl
topython -m pip install --no-deps the_wheel.whl
I found this project via this discussion, where @hroncok mentioned that the Fedora build system uses a command like this:
python -m pip install --root $RPM_BUILD_ROOT --no-deps --disable-pip-version-check --progress-bar off --verbose --ignore-installed --no-warn-script-location pyproject-wheeldir/*.whl
I.e. they need a whole lot of options to tell pip not to do lots of things. You want those things when using it as a package manager yourself, but they don't make sense when it's a piece in the toolchain of another packaging system. Presumably, each time pip adds a new feature, someone adds another flag to that command to disable it.
This project - once it's been written - would make much more sense for that context than using pip.
supporting all the expected options (--prefix, --root, etc) or maybe only --target ?
I envisage it having a way to specify the desired install scheme (e.g. posix_prefix, prefix=...), as well as a way to specify the target directory for each category (purelib, scripts, etc.) individually. The latter would obviously be very verbose, but it should also be pretty stable.
Regarding aiming for as much of a sans-I/O approach is possible, I have found that has led to much easier testing and better API design. For instance, to parse a RECORD
file, do you really need to pass in the file path, or could a file-like object do (I would even argue just bytes, but unfortunately the PEP for wheels is written as such that you essentially have to use csv
and that expects a file-like object)? To me, anything that takes a file path to just read or write without doing some extra work to figure such a path is a code smell.
Now I will also say that aiming for as much of a sans-I/O API prevents no I/O-based API. But at least the bottom API layer that is parsing and such definitely does not need to perform I/O. So for me this means parsing RECORD
files shouldn't take a path and read a file on its own, but a helper function that takes a pathlib.Path
object -- to work with a file system or zip file -- and a target directory and copies the files over while validating hashes is also reasonable to do.
@takluyver fair point, I had forgotten the --installer
request :+1:
I envisage it having a way to specify the desired install scheme (e.g. posix_prefix, prefix=...), as well as a way to specify the target directory for each category (purelib, scripts, etc.) individually. The latter would obviously be very verbose, but it should also be pretty stable.
:+1:
I think the verbose approach is the way to go. One thing that annoys me about existing wheel installing tools is they all expect to run “in-environment” i.e. introspect what itself is in and behave accordingly. But this does not need to be the case since installing a wheel is simply extracting and moving files around. Explicitely specifying each path would enable the user to install a wheel to any environment as long as they know its layout, which is a huge win for out-of-environment tools like Poetry and Pipenv.
Do we have a list of differences? I mean how does pipenv/poetry installs wheels differently than pip? I've also thought the layout is an interpreter specification, not isntaller specification.
I've always asked distutils where it should go but you should be able to install each category where you please. You'll be happier if certain parts are on python path obviously.
On Sun, May 3, 2020, at 4:33 PM, Bernát Gábor wrote:
Do we have a list of differences? I mean how does pipenv/poetry installs wheels differently than pip? I've also thought the layout is an interpreter specification, not isntaller specification.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pradyunsg/installer/issues/1#issuecomment-623177114, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABSZEU2FW3FG5EKK2F3TMDRPXIIVANCNFSM4MGEFPLQ.
I don't see the need for that unless you can bring some concrete use cases 🤷♂️ I think what distutil tells you should be enough 🤔
I don’t know about Poetry; Pipenv currently just delegate the job to pip in the environment. This works because people expect pip to in the environment anyway, but the same can’t be said for any other installation tools.
The most prominent use case of this would be installing a package to an environment built from a different Python. Say you install Pipenv with Python 3.6, but build a project on Python 3.8. If the installer requires environment sniffing, you’ll need to somehow inject the installer and all its dependencies into the target environment for installation, which in turn messes with dependency discovery since now you don’t have a straightward picture what packages are “actually installed” (not injected) and available at runtime, when the installer and its dependencies are no longer injected. By contrast, if the installer can install to any layout scheme, Pipenv can use a simple python -c
call to ask the environment what scheme it wants to use, and just copy files accordingly, eluviating all the problems mentioned.
Makes sense, I suppose then maybe allow to change but default to distutils is the right path ahead.
default to distutils
About this… I was thinking about defaulting to sysconfig.get_paths()
which has nicely standardised keys, but TBH I’ve never been able to tell the difference between this and distutils.sysconfig
. According to this discussion their implementations can be different, but the distutils
one should be avoided?
Well, not all the time and depends on python packaging. Some distributions patch things and it happens that they patch only one of the two. pip uses distutils, so IMHO distutils is the safe choice to go with, as that what most OS will work with. That being said it's possible distutils to not exist (Debian packaging) so falling back to sysconfig is decent idea.
FWIW, flit install
uses sysconfig
. I think that's the right thing to do for the long term, even if distributions don't always patch it today.
I don't think we can afford to take the moral high ground for tools meant for the entire ecosystem. flit is a brand new tool (relatively to Python) and used by a small section of users, can be more aggressive on such topics. But the tools we design to be used by the entire ecosystem should be a lot more backwards compatible.
I don’t think the decision is that important TBH. The only place we really need a default is python -m installer
(if we even have that at all); the Python interface can simply require the user to pass in the paths; they are always only a few function calls away no matter what they want to use. Minimal documentation (put up some example usages and explanation) is plenty for everyone to choose whatever is preferred.
And even with python -m installer
it’s trivial to add a flag to switch between the two schemes.
It's also a new tool targeted - in part - at the distributions which introduce discrepancies between sysconfig
and distutils.sysconfig
. So if distro packaging uses it, the incentive and ability to address that discrepancy is aligned. :wink:
I thought we agreed that it's a new library primarily to be used by pip/virtualenv, and then also provides new CLI API and usage for other tools. Neither pip or virtualenv can afford to take the moral high ground on what's right and wrong without a very long period of deprecation. Status quo wins.
I don’t know about Poetry;
Also delegates to pip: https://github.com/python-poetry/poetry/tree/master/poetry/installation
I am really interested in helping out here as we need this for distribution packaging.
I think my needs go on par with the direction this is taking:
- Minimal dependencies (avoid at all cost)
I think this will depend on what versions of Python you need to target. It is easy enough to have zero dependencies in modern Python, but much less viable if you need to support Python 2.7.
- Ability to opt-out of compilation (.pyc) in the install process
This makes sense to me. I haven’t thought about this at all. It is likely not too dificult (just skip one call during installation), but we need to remember this.
- Minimal dependencies (avoid at all cost)
I think this will depend on what versions of Python you need to target. It is easy enough to have zero dependencies in modern Python, but much less viable if you need to support Python 2.7.
We are trying our best to rip out Python 2. But this is reasonable. I would like to try to get 0 dependencies on the latest version(s) of Python 3.
- Ability to opt-out of compilation (.pyc) in the install process
This makes sense to me. I haven’t thought about this at all. It is likely not too dificult (just skip one call during installation), but we need to remember this.
I want to contribute and review PRs so I will keep this in check.
@gaborbernat I think the key point, though, is this project is low-level enough that backwards-compatibility is going to be via explicit path specifications. So there isn't anything being said here where using sysconfig
for the CLI prevents pip using distutils
to specify paths for backwards-compatibility.
I will also say that if distutils leaves the stdlib as has been discussed for years now, sysconfig
will quickly become the preferred way as the thing that's still in the stdlib. 😄
Agreed, just wanted to make sure backwards compatibility is not totally thrown out the window 😊
As a first step, we should figure out what exactly do we want this project to be. :)
Here's my answer...
installer
is a pure-Python library that provides the bare-essentials for implementing a Python wheel installation tool.
It provides:
re (2) above: it'll be very basic things, like providing an implementation for a "WheelSource" based on a local wheel file, a "Destination" based on a dictionary (installation scheme to local paths) and other really bare-bones implementations, that can serve as a base of simple scripts and serve as a reference for folks who wanna do more complex stuff.
re CLI: I'll also say that a CLI is "out of scope, for now" on the premise that there's no clear concensus on what that should look like. If you reckon we should have one, file a new issue. :)
re Python support: Let's support Python 2 and current Python 3 versions (3.5+) in the initial release. AFAICT, everyone who wants Python 2 support knows that we are already on the tail end of upstream packages supporting it. I don't think Python 2 users would care about incremental changes to this library TBH, so I'm gonna say this library will drop Python 2 support in Jan 2021. I don't want to bother myself with Python 2 support in 2021 -- I've dealt with enough issues around unicode and typing that it's unlikely that I'll budge on this. :)
If you're OK with this, add a 👍 reaction on this. If you think this excludes your use case, please drop a comment below.
I have a draft locally of what I think the API should look like, so I'm gonna file a PR for that once this settles down. :)
I know I'm coming to this late, but I'm a strong -1 on Python 2 support. What's the motivation here? Anyone writing a new tool now should be targetting Python 3, and I don't see any advantage in making it possible for older tools to switch to using this library.
Why put in Python 2 support for 6 months, only to rip it out in 2021? If Python 2 users want a backport, that's fine, but let the interested parties do the work.
In my view, let's just make this Python 3 only.
Well, unfortunately Python 2 is still around and distros will still have to deal with it. I think most of the other people involved here are also stuck with Python 2, and would need at least one release of the tooling supporting it.
A lot of ecosystem changes that create the need for this tooling also impact Python 2 (eg. the adoption of PEP517 and deprecation of setup.py install
).
Don't get me wrong, I really don't want to support Python 2, but I am stuck with it.
But I don't see why the current solutions for Python 2 won't still work. You can install wheels with pip. This project isn't enabling new features, just new ways of accessing existing ones.
This project is born out of https://discuss.python.org/t/3869/. Let me know here if you participated there and want push access.
As a first step, we should figure out what exactly do we want this project to be. :)