pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.6k stars 963 forks source link

Expose the METADATA file of wheels in the simple API #8254

Closed dstufft closed 8 months ago

dstufft commented 4 years ago

Currently a number of projects are trying to work around the fact that in order to resolve dependencies in Python you have to download the entire wheel in order to read the metadata. I am aware of two current strategies for working around this, one is the attempt to use the PyPI JSON API (which isn't a good solution because it's non standard, the data model is wrong, and it's not going to be secured by TUF) and the other is attempting to use range requests to fetch only the METADATA file from the wheel before downloading the entire wheel (which isn't a good solution because TUF can currently only verify entire files, and it depends on the server supporting range requests, which not every mirror is going to support).

It seems to me like we could side step this issue by simply having PyPI extract the METADATA file of a wheel as part of the upload process, and storing that alongside the wheel itself. Within TUF we can ensure that these files have not been tampered with, by simply storing it as another TUF secured target. Resolvers could then download just the metadata file for a wheel they're considering as a candidate, instead of having to download the entire wheel.

This is a pretty small delta over what already exists, so it's more likely we're going to get it done than any of the broader proposals of trying to design an entire, brand new repository API or by ALSO retrofitting the JSON API inside of TUF.

The main problems with it is that the METADATA file might also be larger than needed since it contains the entire long description of the wheel and that it still leaves sdists unsolved (but they're not currently really solvable). I don't think either problem is too drastic though.

What do folks thinks? This would probably require a PEP and I probably don't have the spare cycles to do that right now, but I wanted to get the idea written down incase someone else felt like picking it up.

@pypa/pip-committers @pypa/pipenv-committers @sdispater (not sure who else work on poetry, feel free to CC more folks in).

pfmoore commented 4 years ago

Sounds like a good idea. It would probably need to be an optional feature of the API, as we need to keep the spec backward-compatible, and really basic "serve a directory over HTTP" indexes might not be able to support the API.

But that is a minor point. Basically +1 from me.

One pip-specific note on timing, though. It looks like pip will get range request based metadata extraction before this API gets formalised/implemented. That's fine, but I think that when this does become available, pip should drop the range approach and switch to just using this. That would be a performance regression for indexes that support range requests but not the new API, but IMO that's more acceptable than carrying the support cost for having both approaches.

uranusjr commented 4 years ago

I agree, it seems like a reasonable solution. If we design how the metadata is listed carefully, it’d likely also be reasonable for the files-in-a-directory use case to optionally implement.

di commented 4 years ago

What would the filename of this file be? Something like pip-20.1.1-py2.py3-none-any.whl.METADATA?

Trying to think of alternatives: since METADATA is already RFC 822 compliant, we could include the metadata as headers on the response to requests for .whl files. Clients that only want the metadata could call HEAD on the URL, clients that want both metadata and the .whl file itself would call GET and get both in a single request. This would be a bit more challenging for PyPI to implement, though.

dstufft commented 4 years ago

It would also be more challenging for mirrors like bandersnatch to implement, since they don't have any runtime components where they could add those headers, but the bigger thing is header's can't be protected by TUF, and we definitely want this to be TUF protected.

The other option would be to embed this inside the TUF metadata itself, which is a JSON doc and has an area for arbitrary metadata to be added.. however I think that's worse for us since it's a much larger change in that sense, and sort of violates a bit of the separation of concerns we currently have with TUF.

As far as file name, I don't really have a strong opinion on it. something like pip-20.1.1-py2.py3-none-any.whl.METADATA works fine for me, there's a very clear marker for what file the metadata belongs to, and in the "serve a directory over HTTP" index, they could easily add that file too.

di commented 4 years ago

Got it, I wasn't thinking that TUF couldn't protect headers but that makes sense in retrospect.

I don't see any significant issues with the proposal aside from the fact that PyPI will finally need to get into the business of unzipping/extracting/introspecting uploaded files. Do we think that should happen during the request (thus guaranteeing that the METADATA file is available immediately after upload, but potentially slowing down the request) or can it happen outside the request (by kicking off a background task)?

dstufft commented 4 years ago

Within the legacy upload API we will probably want to do it inline? I don't know, that's a great question for whoever writes the actual PEP to figure out the implications of either choice 😄 . https://github.com/pypa/warehouse/issues/7730 is probably the right long term solution to that particular problem.

dholth commented 4 years ago

Alternatively it might be nice to provide the entire *.dist-info directory as a separable part. Or, going the other direction, METADATA without long-description. Of course it can be different per each individual wheel.

dstufft commented 4 years ago

I thought about the entire .dist-info directory. If we did that we would probably want to re-zip it into a single artifact, It just didn't feel super worthwhile to me as I couldn't think of a use case for accessing files other than METADATA as part of the resolution/install process, which is all this idea really cared about. Maybe there's something I'm not thinking about though?

pfmoore commented 4 years ago

Agreed, anything other than METADATA feels like YAGNI. After all, the only standardised files in .dist-info are METADATA, RECORD and WHEEL. RECORD is not much use without the full wheel, and there's not enough in WHEEL to be worth exposing separately.

So unless there's a specific use case, like there is for METADATA, I'd say let's not bother.

dholth commented 4 years ago

Off the top of my head the entry points are the most interesting metadata not in 'METADATA'

ofek commented 4 years ago

Are we expecting to backfill metadata for a few versions of popular projects, particularly those that aren't released often?

pradyunsg commented 4 years ago

What do folks think?

I quite like it. :)

pip-20.1.1-py2.py3-none-any.whl.METADATA

:+1: I really like that this makes it possible for static mirrors to provide this information! :)

not sure who else work on poetry, feel free to CC more folks in

@abn @finswimmer @stephsamson


My main concern is the same as @ofek -- how does this work with existing uploads? Would it make sense for PyPI to have a "backfill when requested" approach for existing uploads?

di commented 4 years ago

I think we'd just backfill this for every .whl distribution that has a METADATA file in it?

dstufft commented 4 years ago

Yea. It would take some time but we would presumably just do a backfill operation.

Sent from my iPhone

On Jul 14, 2020, at 6:01 PM, Dustin Ingram notifications@github.com wrote:

 I think we'd just backfill this for every .whl distribution that has a METADATA file in it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

chrahunt commented 4 years ago

and there's not enough in WHEEL to be worth exposing separately

In pip at least we extract and parse WHEEL first, to see if we can even understand the format of the wheel. In a future where we actually want to exercise that versioning mechanism, if we make WHEEL available from the start then we can avoid considering new wheels we wouldn't be able to use. If we don't take that approach then projects may hesitate to release new wheels because it would cause users' pips to fully resolve then backtrack (or error out) when encountering a new wheel once downloaded.

trishankatdatadog commented 4 years ago

It seems to me like we could side step this issue by simply having PyPI extract the METADATA file of a wheel as part of the upload process, and storing that alongside the wheel itself.

Great idea! In fact, we should be able to list this METADATA as yet another TUF targets file, and associate it with all of its wheels using custom targets metadata... @woodruffw @mnm678

woodruffw commented 4 years ago

Great idea! In fact, we should be able to list this METADATA as yet another TUF targets file, and associate it with all of its wheels using custom targets metadata... @woodruffw @mnm678

Yep! This should be doable, as long as it's part of (or relationally connected to) the Release or File models.

dstufft commented 4 years ago

What information do you need stored in the DB? In my head I just assumed it would get stored alongside the file in the object store. I guess probably the digest of the METADATA file?

woodruffw commented 4 years ago

What information do you need stored in the DB? In my head I just assumed it would get stored alongside the file in the object store. I guess probably the digest of the METADATA file?

Yep, exactly. We wouldn't need the METADATA filename itself stored, assuming that it can be inferred (i.e. that it's always {release_file}.METADATA).

sinoroc commented 4 years ago

Cross-post of a proposition I made on discuss.python.org:

In short: we extend the concept of the data-requires-python attribute to cover all the necessary metadata for pip's dependency resolution.

(There are some more details in the post over there, and in #8733)

So that is a different take on this issue. I opened a dedicated ticket in case there is enough interest for further discussion: https://github.com/pypa/warehouse/issues/8733

[note from @pradyunsg: I've trimmed content duplicated here and in #8733, about the proposed design. This is to prevent this proposal from taking over the discussion here]

pfmoore commented 3 years ago

This would probably require a PEP

Looks like I never said this explicitly, but yes, I definitely think this needs a PEP.

piroux commented 3 years ago

FYI: PEP-643 (Metadata for Package Source Distributions) has been approved. :rocket:

pradyunsg commented 3 years ago

What are the next steps here @ewdurbin @dstufft @pfmoore? Writing a PEP to update the simple API standard, to allow inclusion of metadata files?

pfmoore commented 3 years ago

I think so, yes. Someone needs to decide how to include a link to the metadata file in the simple API (in a backward compatible way). I'd suggest following the GPG signature approach:

If there is a metadata file for a particular distribution file it MUST live alongside that file with the same name with a .metadata appended to it. So if the file /packages/HolyGrail-1.0-py3-none-any.whl existed and had associated metadata, the metadata would be located at /packages/HolyGrail-1.0-py3-none-any.whl.metadata. Metadata MAY be provided for any file type, but it is expected that package indexes will only provide it for wheels, in practice.

A PEP for that should be reasonably straightforward and non-controversial. I assume @dstufft would be PEP delegate and could approve it reasonably easily.

After that, it's a case of

  1. Adding support for the new data into PyPI. That's probably the hardest bit.
  2. Adding support to clients like pip.

It might be nice to also expose that data via the JSON API, seeing as we're going to have it available - but that's a separate question. Let's get the basic feature in place first!

pfmoore commented 3 years ago

By the way, if we can get this available sooner rather than later, it would save me running any more of my job that's downloading 7TB of wheels from PyPI just to extract the metadata file 🙂

Don't worry @ewdurbin - I promise I won't actually run it for all of those files, I'll pass on the hundreds of 800MB PyTorch wheels, for a start!

ewjoachim commented 3 years ago

(@pfmoore this is a bit off topic, but I wrote some code a few months ago that leverages HTTP Range requests and the zip format to minimize the number of bytes you need to extract metadata from wheels retrieved through HTTP :) https://github.com/python-poetry/poetry/pull/1803/files#diff-5a449c4763ca5e9acfa4dbb6bc875866981cb3d8ca12121ae78035faa16999b2R525 . If it helps you in any way, I'm glad!)

pradyunsg commented 3 years ago

And pip has similar code too, behind fast-deps. :)

pradyunsg commented 3 years ago

Also, I'll note that I was thinking of adding a data-has-static-metadata to the link tag, to denote that it has static metadata available.

uranusjr commented 3 years ago

I’ll spend some time writing a draft in the coming days unless there’s already someone doing it (doesn’t seem the case from recent comments?)

dholth commented 3 years ago

The other option would be to expose the entire dist-info folder...

On Sat, May 8, 2021, at 1:28 PM, Tzu-ping Chung wrote:

I’ll spend some time writing a draft in the coming days unless there’s already someone doing it (doesn’t seem the case from recent comments?)

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/pypa/warehouse/issues/8254#issuecomment-835438369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABSZESHOYZYMNVVTI5VOM3TMVYFHANCNFSM4OYW7VJQ.

uranusjr commented 3 years ago

adding a data-has-static-metadata to the link tag

What would be the advantage of this? If we follow Paul’s naming proposal above, whether a link has static metadata can be canonically detected by searching for an <a> tag containing f"{filename}.metadata".

pfmoore commented 3 years ago

I'm pretty strongly inclined to only expose what there's a clearly established need for (i.e. the metadata file). It's easier to add more later than to remove something once it's added.

sinoroc commented 3 years ago

Also, I'll note that I was thinking of adding a data-has-static-metadata to the link tag, to denote that it has static metadata available.

At this point, it might be worth adding the METADATA file as a link data attribute (whole or relevant parts, base-64 encoded or whatever).

uranusjr commented 3 years ago

adding the METADATA file as a link data attribute (whole or relevant parts, base-64 encoded or whatever).

That would create huge index pages and will probably make things slower in practice. The metadata file contains free form fields like Description and I know there are projects with super long content in there (because they include their super long README). And the index would then include METADATA from every single version ever published, times every platform wheel for each version.

uranusjr commented 3 years ago

Even just including the relevant parts could be wasteful if there are more than a few releases in a project. You'd be throwing away most of them most of the time since it's very rare the resolver would ever need all of them.

uranusjr commented 3 years ago

It's easier to add more later than to remove something once it's added.

This brings out a consideration on the entry’s name on the index though, f"{filename}.metadata" would make it more difficult to include more files if we ever decide to. I’m current leaning toward specifying METADATA by its “full” name instead, e.g. distribution-1.0-py3-none-any.whl/distribution-1.0.dist-info/METADATA.

dholth commented 3 years ago

Another silly idea would be to include the offset and length for easy range requests against the zip file .whl

On Sat, May 8, 2021, at 2:37 PM, Tzu-ping Chung wrote:

It's easier to add more later than to remove something once it's added.

This brings out a consideration on the entry’s name on the index though, f"{filename}.metadata" would make it more difficult to include more files if we ever decide to. I’m current leaning toward specifying METADATA by its “full” name instead, e.g. distribution-1.0-py3-none-any.whl/distribution-1.0.dist-info/METADATA.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/pypa/warehouse/issues/8254#issuecomment-835464675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABSZETD5EFQ5P3EFWIJY5DTMWAHDANCNFSM4OYW7VJQ.

sinoroc commented 3 years ago

That would create huge index pages [...] the index would then include METADATA from every single version ever published, times every platform wheel for each version.

You'd be throwing away most of them most of the time since it's very rare the resolver would ever need all of them.

Agreed. There would be some waste for sure.


[Thinking out loud... I don't know what should be optimized: number of requests or size of payloads. I have no idea what those numbers are, where the critical point is. Newer protocols seem to make it relatively painless to make multiple requests.]

uranusjr commented 3 years ago

I’ve drafted a PEP: python/peps#1955. I think I’ll need a sponsor and a PEP delegate (can they be the same person?)

Update: This is now PEP 658 https://www.python.org/dev/peps/pep-0658/

ewjoachim commented 3 years ago

And now the PEP was merged 🎉

Is the implementation up for grabs, or do you plan to tackle it @uranusjr?

Also, trying to expose as much design decisions as possible early to help whoever will implement:

pradyunsg commented 3 years ago

The PEP is currently being discussed at https://discuss.python.org/t/8651.

Anything that isn't specific to Warehouse itself should probably be discussed there.

ewjoachim commented 3 years ago

Ah, sorry, got confused 😅

uranusjr commented 3 years ago

(I’m limiting the response to Warehouse-specific; others should be posted in the Discourse thread for visibility.)

Is the implementation up for grabs, or do you plan to tackle it.

It’s up for grabs. I plan to take a look into it only if nobody does anything and I can’t take it anymore. My threshold on this kind of things is pretty high so I suggest someone else work on this if they really want this to happen sooner than later 🙂

is Warehouse expected to

  • store the metadata file on S3 / something ?
  • store the text contents in the database?
  • Extract the file on request from the wheel (and make sure we have a very long CDN cache)

I haven’t thought that far tbh. I’d probably store the file separately on S3 if I have to implement it right now (I believe that’s how the GPG key is stored?) but I’m not the right person to make the decision either way.

What is the expected content type of a Metadata file? (text/plain ?)

Good question! I almost wanted to write it in the PEP but decided otherwise since it doesn’t really matter (PEP 503 didn’t specify the content type for wheel either), and I don’t know the answer. I guess text/plain is good enough. Feel free to ask this in the Discourse thread if feel the PEP should explicitly specify one.

ewjoachim commented 3 years ago

Feel free to ask this in the Discourse thread if feel the PEP should explicitly specify one.

(not sure this needs to be in the PEP, but this will need to be in the implementation somehow)

abitrolly commented 3 years ago

It would help a lot to understand how this API work if anybody could provide a Jupyter or Observable notebook with a complete example of accessing the API for fetching dependency information.

uranusjr commented 3 years ago

The client-side logic is basically

def find_metadata(anchor):
    """Fetch distribution metadata with PEP 658."""
    try:
        metadata_validation = anchor.attrs["data-dist-info-metadata"].partition("=")
    except KeyError:
        raise PEP658NotAvailable()
    if metadata_validation == "true":
        # Skip validation.
        hashname = hashvalue = None
    else:
        hashname, sep, hashvalue = metadata_validation.partition("=")
        if not sep:
            raise InvalidDataDistInfoMetadataAttr()
        if hashname not in hashlib.algorithms_available:
            raise HashAlgorithmUnsupported()
    metadata_url = f"{anchor.attrs['href']}.metadata"
    metadata_content = _fetch_metadata(metadata_url)  # Download metadata with HTTP.
    if hashname is None:
        return metadata_content
    if hashlib.new(hashname, metadata_content).hexdigest() != hashvalue:
        raise HashMismatch()
    return metadata_content

# Find the `<a>` tag representing the file you want on a PEP 503 index page.
# You need to figure out this part yourself; it is out of scope of this issue and PEP 658.
anchor = _find_distribution_i_want()

metadata_content = find_metadata_content(anchor)
# Use a compatible parser (e.g. email.parser) to parse the metadata.

BTW PEP 658 has been accepted, so anyone interested please feel free to proceed on an implementation.

abitrolly commented 3 years ago

BTW PEP 658 has been accepted, so anyone interested please feel free to proceed on an implementation.

Was there an attempt to reduce the length of the param? Looks like the proposed scheme is to add the anchor each line in https://pypi.org/simple/ like this.

-    <a href="/simple/abcmeta/">abcmeta</a>
+    <a href="/simple/abcmeta/#data-dist-info-metadata=sha3-256=a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a">abcmeta</a>

Which will increase the size form the current 17M to about 60M.

abitrolly commented 3 years ago

Instead of layering more hacks on top of existing hacks, may I propose to create a /simple.csv endpoint which will host the extensible CSV with header.

name, metahashtype, metahashvalue
abcmeta, sha3-256, a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a
name metahashtype metahashvalue
abcmeta sha3-256 a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a

Then anybody could get the info without the custom html and anchor parser. That would be easier implementation both on the client and on server side.

I would also consider using a single hashing function to reduce possible duplication when implementing distributed PyPI with content addressing later.

domdfcoding commented 3 years ago

Looks like the proposed scheme is to add the anchor each line in https://pypi.org/simple/ like this.

According to the PEP it's added to the project page (e.g. https://pypi.org/simple/pydash) rather than the index page. It would look something like this (assuming I've interpreted the PEP correctly).

pradyunsg commented 3 years ago

@domdfcoding is correct. The PEP describes the per-project index pages.