Provide distribution metadata as ‘data’ attributes on links in simple index project page

sinoroc commented 4 years ago

Cross-post of a proposition I made on discuss.python.org:

Provide distribution metadata as ‘data’ attributes on links in simple index

In short: we extend the concept of the data-requires-python attribute to cover all the necessary metadata for pip's dependency resolution.

(There are some more details in the post over there, in case I forget to mention them here.)

Basically we write the content of Requires-Dist in a data attribute on each link, and whatever else is needed. Could be each metadata field in its own attribute, could be all fields in one attribute. Could be Base 64 encoded. Could be the whole METADATA file Base 64 encoded in one data attribute. Whatever works best.

The one big advantage, is that in a best case scenario, a single HTTP request per project is enough to complete the whole dependency resolution. We could even go as far as letting pip read that HTML page from cache and skip even more network traffic.

Drawbacks:

sdist is still an issue (no surprise)
backfill necessary (unless the info previously transmitted from twine is still available in some DB)
not helpful for static HTTP servers (and generally alternative implementations would need to catch up)
is the info coming from twine reliable? (warehouse could read it from the distributions, but that is an entirely different story)
and much more I guess, I count on you

Advantages:

1 HTTP request per project (best case scenario)
most of the code paths are already here (in pip, in twine, in warehouse), what is good enough for Requires-Python should be good enough for the other fields
seems backward compatible to me

I obviously do not know enough about the existing code to make a good guess, but seems pretty straightforward to implement.

Related:

Expose the METADATA file of wheels in the simple API: https://github.com/pypa/warehouse/issues/8254

pradyunsg commented 4 years ago

Providing the METADATA file directly (i.e. #8254) is generally going to be more extensible (in case we add more capable behaviors around Provides and other metadata fields) and doesn't tie the Core Metadata to the Simple API (or vice-versa, however you wanna look at it). This is a big reason I'd prefer the proposal in that issue instead of the one here.

most of the code path are already here

This is not True. Regardless of what shape this takes, twine is likely not going to be affected much and pip needs to be updated to take advantage of this additional information.

1 HTTP request per project (best case scenario)

I don't think a reduced HTTP request is worth it TBH, but folks can have a different opinion on the trade-offs here.

warehouse could read it from the distributions, but that is an entirely different story

Warehouse already does this, and provides this information in a non-standard JSON API.

and much more I guess, I count on you

We'd also need to figure out some mechanism to clearly add values for all the relevant fields losslessly into HTML tag attributes. IDK if it'll prove to be painful, but I'm not sure. ;)

westurner commented 4 years ago

Are JSON-LD and RDFa the appropriate Linked Data specs for metadata and metadata-in-HTML?

There are parsers in JS and in Python which handle RDFa and JSON-LD metadata in the general case: no modifications to parsers should be necessary when the schema changes.

Python & RDFa:

https://github.com/scrapinghub/extruct wraps pyRDFa
- https://github.com/RDFLib/pyrdfa3

Python & JSON-LD:

westurner commented 4 years ago

JS + JSON-LD & RDFa:

sinoroc commented 4 years ago

Providing the METADATA file directly (i.e. #8254) is generally going to be more extensible (in case we add more capable behaviors around Provides and other metadata fields) and doesn't tie the Core Metadata to the Simple API (or vice-versa, however you wanna look at it). This is a big reason I'd prefer the proposal in that issue instead of the one here.

That is why I also suggest that we could publish the full METADATA file (compressed, encoded, whatever) as data attribute, if it makes sense.

As mentioned in the other thread, overly long long-description could be an issue, though. The HTML page could get quite heavy. So maybe a simplified version of the METADATA, there are things we can be quite sure we do not need: author, maintainer, summary, long description.

METADATA is only for wheels though. Core metadata fields should cover a much wider range of distributions (not all sdists have reliable dependency info, but I would assume that a fair amount have).

All in all, I do not see how this solution would be less extensible than the other. Distributions that do not have a field that does not exist yet, simply do not have it, and won't have it either once it is added to the spec. But probably I am missing some information to make a reasonable comparison.

most of the code path are already here

This is not True. Regardless of what shape this takes, twine is likely not going to be affected much and pip needs to be updated to take advantage of this additional information.

Yes, mistake on my side. I had in mind the code that is already in production. So I didn't count pip's new resolver which is still experimental.

From my point of view it is an advantage: the code for pip's dependency resolver would become much more simple as a result, wouldn't it?

And if we compare the two propositions, it's probably a draw: in both solutions, pip's code would have to change.

1 HTTP request per project (best case scenario)

I don't think a reduced HTTP request is worth it TBH, but folks can have a different opinion on the trade-offs here.

Maybe we are not speaking about the same thing. Of course, it would be only marginally better than the download the METADATA file solution. But when I wrote this proposition I did not know about that solution. So when I say huge gain, I mean compared to the current solution (downloading all distributions until we find one that fits).

warehouse could read it from the distributions, but that is an entirely different story

Warehouse already does this, and provides this information in a non-standard JSON API.

This is one aspect I am very much not clear on. What kind of process happens in warehouse when a distribution is uploaded? Is the distribution somehow opened and is some data extracted out of it (trove classifiers, names of the author and maintainer, etc.)? Or does all the info come from the upload (i.e. from twine)?

If some of the info is already there, it is quite good news, no need for intensive backfill operations.

We'd also need to figure out some mechanism to clearly add values for all the relevant fields losslessly into HTML tag attributes. IDK if it'll prove to be painful, but I'm not sure. ;)

I do not understand that point.

Thanks for the feedback.

sinoroc commented 4 years ago

@westurner I think in both threads we are looking for a quick way to improve pip's dependency resolution with minimal effort: only slightly extending the current simple repository API (PEP 503). I believe there are some other plans for a completely new PyPI API, your suggestions might be a better fit for those plans. Especially here in this solution, I do not really want to introduce anything new, I just want to reuse a technique already in use.

Or maybe I am entirely missing the point of your messages. How do you see JS + JSON-LD & RDFa fitting in here?

sinoroc commented 1 year ago

We now have PEP 658.

pypi / warehouse

Provide distribution metadata as ‘data’ attributes on links in simple index project page #8733