Closed dstufft closed 8 months ago
Sounds like a good idea. It would probably need to be an optional feature of the API, as we need to keep the spec backward-compatible, and really basic "serve a directory over HTTP" indexes might not be able to support the API.
But that is a minor point. Basically +1 from me.
One pip-specific note on timing, though. It looks like pip will get range request based metadata extraction before this API gets formalised/implemented. That's fine, but I think that when this does become available, pip should drop the range approach and switch to just using this. That would be a performance regression for indexes that support range requests but not the new API, but IMO that's more acceptable than carrying the support cost for having both approaches.
I agree, it seems like a reasonable solution. If we design how the metadata is listed carefully, it’d likely also be reasonable for the files-in-a-directory use case to optionally implement.
What would the filename of this file be? Something like pip-20.1.1-py2.py3-none-any.whl.METADATA
?
Trying to think of alternatives: since METADATA is already RFC 822 compliant, we could include the metadata as headers on the response to requests for .whl
files. Clients that only want the metadata could call HEAD
on the URL, clients that want both metadata and the .whl
file itself would call GET
and get both in a single request. This would be a bit more challenging for PyPI to implement, though.
It would also be more challenging for mirrors like bandersnatch to implement, since they don't have any runtime components where they could add those headers, but the bigger thing is header's can't be protected by TUF, and we definitely want this to be TUF protected.
The other option would be to embed this inside the TUF metadata itself, which is a JSON doc and has an area for arbitrary metadata to be added.. however I think that's worse for us since it's a much larger change in that sense, and sort of violates a bit of the separation of concerns we currently have with TUF.
As far as file name, I don't really have a strong opinion on it. something like pip-20.1.1-py2.py3-none-any.whl.METADATA
works fine for me, there's a very clear marker for what file the metadata belongs to, and in the "serve a directory over HTTP" index, they could easily add that file too.
Got it, I wasn't thinking that TUF couldn't protect headers but that makes sense in retrospect.
I don't see any significant issues with the proposal aside from the fact that PyPI will finally need to get into the business of unzipping/extracting/introspecting uploaded files. Do we think that should happen during the request (thus guaranteeing that the METADATA
file is available immediately after upload, but potentially slowing down the request) or can it happen outside the request (by kicking off a background task)?
Within the legacy upload API we will probably want to do it inline? I don't know, that's a great question for whoever writes the actual PEP to figure out the implications of either choice 😄 . https://github.com/pypa/warehouse/issues/7730 is probably the right long term solution to that particular problem.
Alternatively it might be nice to provide the entire *.dist-info directory as a separable part. Or, going the other direction, METADATA without long-description. Of course it can be different per each individual wheel.
I thought about the entire .dist-info
directory. If we did that we would probably want to re-zip it into a single artifact, It just didn't feel super worthwhile to me as I couldn't think of a use case for accessing files other than METADATA
as part of the resolution/install process, which is all this idea really cared about. Maybe there's something I'm not thinking about though?
Agreed, anything other than METADATA
feels like YAGNI. After all, the only standardised files in .dist-info
are METADATA
, RECORD
and WHEEL
. RECORD
is not much use without the full wheel, and there's not enough in WHEEL
to be worth exposing separately.
So unless there's a specific use case, like there is for METADATA
, I'd say let's not bother.
Off the top of my head the entry points are the most interesting metadata not in 'METADATA'
Are we expecting to backfill metadata for a few versions of popular projects, particularly those that aren't released often?
What do folks think?
I quite like it. :)
pip-20.1.1-py2.py3-none-any.whl.METADATA
:+1: I really like that this makes it possible for static mirrors to provide this information! :)
not sure who else work on poetry, feel free to CC more folks in
@abn @finswimmer @stephsamson
My main concern is the same as @ofek -- how does this work with existing uploads? Would it make sense for PyPI to have a "backfill when requested" approach for existing uploads?
I think we'd just backfill this for every .whl
distribution that has a METADATA
file in it?
Yea. It would take some time but we would presumably just do a backfill operation.
Sent from my iPhone
On Jul 14, 2020, at 6:01 PM, Dustin Ingram notifications@github.com wrote:
I think we'd just backfill this for every .whl distribution that has a METADATA file in it?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
and there's not enough in
WHEEL
to be worth exposing separately
In pip at least we extract and parse WHEEL
first, to see if we can even understand the format of the wheel. In a future where we actually want to exercise that versioning mechanism, if we make WHEEL
available from the start then we can avoid considering new wheels we wouldn't be able to use. If we don't take that approach then projects may hesitate to release new wheels because it would cause users' pips to fully resolve then backtrack (or error out) when encountering a new wheel once downloaded.
It seems to me like we could side step this issue by simply having PyPI extract the
METADATA
file of a wheel as part of the upload process, and storing that alongside the wheel itself.
Great idea! In fact, we should be able to list this METADATA
as yet another TUF targets file, and associate it with all of its wheels using custom targets metadata... @woodruffw @mnm678
Great idea! In fact, we should be able to list this
METADATA
as yet another TUF targets file, and associate it with all of its wheels using custom targets metadata... @woodruffw @mnm678
Yep! This should be doable, as long as it's part of (or relationally connected to) the Release
or File
models.
What information do you need stored in the DB? In my head I just assumed it would get stored alongside the file in the object store. I guess probably the digest of the METADATA
file?
What information do you need stored in the DB? In my head I just assumed it would get stored alongside the file in the object store. I guess probably the digest of the
METADATA
file?
Yep, exactly. We wouldn't need the METADATA
filename itself stored, assuming that it can be inferred (i.e. that it's always {release_file}.METADATA
).
Cross-post of a proposition I made on discuss.python.org:
In short: we extend the concept of the data-requires-python
attribute to cover all the necessary metadata for pip's dependency resolution.
(There are some more details in the post over there, and in #8733)
So that is a different take on this issue. I opened a dedicated ticket in case there is enough interest for further discussion: https://github.com/pypa/warehouse/issues/8733
[note from @pradyunsg: I've trimmed content duplicated here and in #8733, about the proposed design. This is to prevent this proposal from taking over the discussion here]
This would probably require a PEP
Looks like I never said this explicitly, but yes, I definitely think this needs a PEP.
What are the next steps here @ewdurbin @dstufft @pfmoore? Writing a PEP to update the simple API standard, to allow inclusion of metadata files?
I think so, yes. Someone needs to decide how to include a link to the metadata file in the simple API (in a backward compatible way). I'd suggest following the GPG signature approach:
If there is a metadata file for a particular distribution file it MUST live alongside that file with the same name with a
.metadata
appended to it. So if the file /packages/HolyGrail-1.0-py3-none-any.whl existed and had associated metadata, the metadata would be located at /packages/HolyGrail-1.0-py3-none-any.whl.metadata. Metadata MAY be provided for any file type, but it is expected that package indexes will only provide it for wheels, in practice.
A PEP for that should be reasonably straightforward and non-controversial. I assume @dstufft would be PEP delegate and could approve it reasonably easily.
After that, it's a case of
It might be nice to also expose that data via the JSON API, seeing as we're going to have it available - but that's a separate question. Let's get the basic feature in place first!
By the way, if we can get this available sooner rather than later, it would save me running any more of my job that's downloading 7TB of wheels from PyPI just to extract the metadata file 🙂
Don't worry @ewdurbin - I promise I won't actually run it for all of those files, I'll pass on the hundreds of 800MB PyTorch wheels, for a start!
(@pfmoore this is a bit off topic, but I wrote some code a few months ago that leverages HTTP Range requests and the zip format to minimize the number of bytes you need to extract metadata from wheels retrieved through HTTP :) https://github.com/python-poetry/poetry/pull/1803/files#diff-5a449c4763ca5e9acfa4dbb6bc875866981cb3d8ca12121ae78035faa16999b2R525 . If it helps you in any way, I'm glad!)
And pip has similar code too, behind fast-deps. :)
Also, I'll note that I was thinking of adding a data-has-static-metadata to the link tag, to denote that it has static metadata available.
I’ll spend some time writing a draft in the coming days unless there’s already someone doing it (doesn’t seem the case from recent comments?)
The other option would be to expose the entire dist-info folder...
On Sat, May 8, 2021, at 1:28 PM, Tzu-ping Chung wrote:
I’ll spend some time writing a draft in the coming days unless there’s already someone doing it (doesn’t seem the case from recent comments?)
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/pypa/warehouse/issues/8254#issuecomment-835438369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABSZESHOYZYMNVVTI5VOM3TMVYFHANCNFSM4OYW7VJQ.
adding a
data-has-static-metadata
to the link tag
What would be the advantage of this? If we follow Paul’s naming proposal above, whether a link has static metadata can be canonically detected by searching for an <a>
tag containing f"{filename}.metadata"
.
I'm pretty strongly inclined to only expose what there's a clearly established need for (i.e. the metadata file). It's easier to add more later than to remove something once it's added.
Also, I'll note that I was thinking of adding a data-has-static-metadata to the link tag, to denote that it has static metadata available.
At this point, it might be worth adding the METADATA file as a link data attribute (whole or relevant parts, base-64 encoded or whatever).
adding the METADATA file as a link data attribute (whole or relevant parts, base-64 encoded or whatever).
That would create huge index pages and will probably make things slower in practice. The metadata file contains free form fields like Description
and I know there are projects with super long content in there (because they include their super long README). And the index would then include METADATA from every single version ever published, times every platform wheel for each version.
Even just including the relevant parts could be wasteful if there are more than a few releases in a project. You'd be throwing away most of them most of the time since it's very rare the resolver would ever need all of them.
It's easier to add more later than to remove something once it's added.
This brings out a consideration on the entry’s name on the index though, f"{filename}.metadata"
would make it more difficult to include more files if we ever decide to. I’m current leaning toward specifying METADATA
by its “full” name instead, e.g. distribution-1.0-py3-none-any.whl/distribution-1.0.dist-info/METADATA
.
Another silly idea would be to include the offset and length for easy range requests against the zip file .whl
On Sat, May 8, 2021, at 2:37 PM, Tzu-ping Chung wrote:
It's easier to add more later than to remove something once it's added.
This brings out a consideration on the entry’s name on the index though,
f"{filename}.metadata"
would make it more difficult to include more files if we ever decide to. I’m current leaning toward specifyingMETADATA
by its “full” name instead, e.g.distribution-1.0-py3-none-any.whl/distribution-1.0.dist-info/METADATA
.— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/pypa/warehouse/issues/8254#issuecomment-835464675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABSZETD5EFQ5P3EFWIJY5DTMWAHDANCNFSM4OYW7VJQ.
That would create huge index pages [...] the index would then include METADATA from every single version ever published, times every platform wheel for each version.
You'd be throwing away most of them most of the time since it's very rare the resolver would ever need all of them.
Agreed. There would be some waste for sure.
[Thinking out loud... I don't know what should be optimized: number of requests or size of payloads. I have no idea what those numbers are, where the critical point is. Newer protocols seem to make it relatively painless to make multiple requests.]
I’ve drafted a PEP: python/peps#1955. I think I’ll need a sponsor and a PEP delegate (can they be the same person?)
Update: This is now PEP 658 https://www.python.org/dev/peps/pep-0658/
And now the PEP was merged 🎉
Is the implementation up for grabs, or do you plan to tackle it @uranusjr?
Also, trying to expose as much design decisions as possible early to help whoever will implement:
text/plain
?)The PEP is currently being discussed at https://discuss.python.org/t/8651.
Anything that isn't specific to Warehouse itself should probably be discussed there.
Ah, sorry, got confused 😅
(I’m limiting the response to Warehouse-specific; others should be posted in the Discourse thread for visibility.)
Is the implementation up for grabs, or do you plan to tackle it.
It’s up for grabs. I plan to take a look into it only if nobody does anything and I can’t take it anymore. My threshold on this kind of things is pretty high so I suggest someone else work on this if they really want this to happen sooner than later 🙂
is Warehouse expected to
- store the metadata file on S3 / something ?
- store the text contents in the database?
- Extract the file on request from the wheel (and make sure we have a very long CDN cache)
I haven’t thought that far tbh. I’d probably store the file separately on S3 if I have to implement it right now (I believe that’s how the GPG key is stored?) but I’m not the right person to make the decision either way.
What is the expected content type of a Metadata file? (
text/plain
?)
Good question! I almost wanted to write it in the PEP but decided otherwise since it doesn’t really matter (PEP 503 didn’t specify the content type for wheel either), and I don’t know the answer. I guess text/plain
is good enough. Feel free to ask this in the Discourse thread if feel the PEP should explicitly specify one.
Feel free to ask this in the Discourse thread if feel the PEP should explicitly specify one.
(not sure this needs to be in the PEP, but this will need to be in the implementation somehow)
It would help a lot to understand how this API work if anybody could provide a Jupyter or Observable notebook with a complete example of accessing the API for fetching dependency information.
The client-side logic is basically
def find_metadata(anchor):
"""Fetch distribution metadata with PEP 658."""
try:
metadata_validation = anchor.attrs["data-dist-info-metadata"].partition("=")
except KeyError:
raise PEP658NotAvailable()
if metadata_validation == "true":
# Skip validation.
hashname = hashvalue = None
else:
hashname, sep, hashvalue = metadata_validation.partition("=")
if not sep:
raise InvalidDataDistInfoMetadataAttr()
if hashname not in hashlib.algorithms_available:
raise HashAlgorithmUnsupported()
metadata_url = f"{anchor.attrs['href']}.metadata"
metadata_content = _fetch_metadata(metadata_url) # Download metadata with HTTP.
if hashname is None:
return metadata_content
if hashlib.new(hashname, metadata_content).hexdigest() != hashvalue:
raise HashMismatch()
return metadata_content
# Find the `<a>` tag representing the file you want on a PEP 503 index page.
# You need to figure out this part yourself; it is out of scope of this issue and PEP 658.
anchor = _find_distribution_i_want()
metadata_content = find_metadata_content(anchor)
# Use a compatible parser (e.g. email.parser) to parse the metadata.
BTW PEP 658 has been accepted, so anyone interested please feel free to proceed on an implementation.
BTW PEP 658 has been accepted, so anyone interested please feel free to proceed on an implementation.
Was there an attempt to reduce the length of the param? Looks like the proposed scheme is to add the anchor each line in https://pypi.org/simple/ like this.
- <a href="/simple/abcmeta/">abcmeta</a>
+ <a href="/simple/abcmeta/#data-dist-info-metadata=sha3-256=a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a">abcmeta</a>
Which will increase the size form the current 17M to about 60M.
Instead of layering more hacks on top of existing hacks, may I propose to create a /simple.csv
endpoint which will host the extensible CSV with header.
name, metahashtype, metahashvalue
abcmeta, sha3-256, a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a
name | metahashtype | metahashvalue |
---|---|---|
abcmeta | sha3-256 | a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a |
Then anybody could get the info without the custom html and anchor parser. That would be easier implementation both on the client and on server side.
I would also consider using a single hashing function to reduce possible duplication when implementing distributed PyPI with content addressing later.
Looks like the proposed scheme is to add the anchor each line in https://pypi.org/simple/ like this.
According to the PEP it's added to the project page (e.g. https://pypi.org/simple/pydash) rather than the index page. It would look something like this (assuming I've interpreted the PEP correctly).
@domdfcoding is correct. The PEP describes the per-project index pages.
Currently a number of projects are trying to work around the fact that in order to resolve dependencies in Python you have to download the entire wheel in order to read the metadata. I am aware of two current strategies for working around this, one is the attempt to use the PyPI JSON API (which isn't a good solution because it's non standard, the data model is wrong, and it's not going to be secured by TUF) and the other is attempting to use range requests to fetch only the
METADATA
file from the wheel before downloading the entire wheel (which isn't a good solution because TUF can currently only verify entire files, and it depends on the server supporting range requests, which not every mirror is going to support).It seems to me like we could side step this issue by simply having PyPI extract the
METADATA
file of a wheel as part of the upload process, and storing that alongside the wheel itself. Within TUF we can ensure that these files have not been tampered with, by simply storing it as another TUF secured target. Resolvers could then download just the metadata file for a wheel they're considering as a candidate, instead of having to download the entire wheel.This is a pretty small delta over what already exists, so it's more likely we're going to get it done than any of the broader proposals of trying to design an entire, brand new repository API or by ALSO retrofitting the JSON API inside of TUF.
The main problems with it is that the
METADATA
file might also be larger than needed since it contains the entire long description of the wheel and that it still leaves sdists unsolved (but they're not currently really solvable). I don't think either problem is too drastic though.What do folks thinks? This would probably require a PEP and I probably don't have the spare cycles to do that right now, but I wanted to get the idea written down incase someone else felt like picking it up.
@pypa/pip-committers @pypa/pipenv-committers @sdispater (not sure who else work on poetry, feel free to CC more folks in).