pypa / packaging-problems

An issue tracker for the problems in packaging
146 stars 33 forks source link

Request: PEP to describe current Warehouse JSON API #367

Open brainwane opened 4 years ago

brainwane commented 4 years ago

Desired: a PEP to describe the current Warehouse JSON API, to:

Does anyone want to volunteer to do this? It might take 15-20 hours total and would help a lot of folks out.

Stuff to cover

Includes:

Also, the current JSON API has some flaws, and so as we document it, that's an opportunity to find out what people expect, how the designers expected it to be used, what people need and want, etc.. But do not think your job is to fix those things. Your job is to log those things and document the existing state.

Background

Today in IRC @dstufft, @techalchemy, and I discussed:

@brainwane: I think some developers of client apps have basically said "what is the risk of depending on the current Warehouse JSON API?" and have heard confusing things back from the main Warehouse developers. Like "we're gonna change the JSON API at some point in the future but...." and then have wildly different estimates of how long it will be till then. And Warehouse maintainers haven't had the resources, the dedicated time, to collaborate with contributors who want to help develop the new JSON API, review branches, etc. So this has left everyone in a state of uncertainty, so that developers of clients of Warehouse's current API have a hard time making engineering tradeoff decisions ("it's ok to write this and know we'll have to rewrite it in a year") .

@dstufft said that interoperability depends on standardizing the Warehouse API:

@dstufft: if you want to work with things that aren't warehouse, you can't rely on the JSON API really, because it's non standard and support for it varies @techalchemy yeah it's pretty nuanced, other indexes would have to implement the entire data model of warehouse Donald: if you only care about working with pypi itself, then the JSON API is fine Dan: and/or at least parse wheel dependencies to json Donald: it's almost certainly not going away Donald it will probably get deprecated whenever/if we ever get a next gen api

But it may be several months if not more before volunteers can design and implement a next-generation Warehouse JSON API. So how can we help consumers and peers of the current Warehouse API work with what currently exists? We agreed that it would be a good interim step to document the current API in a PEP. We estimate "documenting what exists today is probably a less than 10 hours of total work" for the initial draft.

I would like to see this written and accepted within a few months. But as @techalchemy notes: "Even if it's not accepted, as long as it makes it to draft status it can generally start being useful. A draft PEP means pip can move toward supporting something which will force adoption ...will force other index server implementations to start considering building support."

Checklist

After this After this, we can get some dedicated volunteer time committed, from a Warehouse expert, to write the next-gen JSON API PEP (this will be a substantial task, and I'm pretty wary of getting a grant for it because it's way more research than development), and then get that discussed and accepted, and then apply for and get money to implement it.

brainwane commented 4 years ago

@webknjaz and @hugovk, you both came to mind as people who might like to do this task -- feel free to speak up if it's something you'd be interested in doing.

webknjaz commented 4 years ago

@brainwane I haven't worked w/ that API myself and haven't written any PEPs ever. So I don't feel confident about starting this.

P.S. Your mention of #8090 is not clickable because it's a cross-repo link. I guess you can change it to pypa/warehouse#8090

hugovk commented 4 years ago

Thanks also for the suggestion, I've used it the API a little bit but am in a similar situation, and also don't think I've the time to fully commit to it.

brainwane commented 4 years ago

@webknjaz link fixed, thanks.

@hugovk and @webknjaz - thanks for clearly saying no so we can move on and ask other people. :-)

@dholth and @cooperlees -- are either of you interested in taking this on, maybe part of it (see the checklist in the initial post for things that need doing)?

cooperlees commented 4 years ago

@dholth and @cooperlees -- are either of you interested in taking this on, maybe part of it (see the checklist in the initial post for things that need doing)?

I want this, so I'll commit to this. I'll even help with the new API design and possible implementation. My only warning here is my english skills are bad and this will be my first PEP.

I'll put aside some time Sunday night to try get the sketch outline started and possible we can chat on IRC more Monday @brainwane .

pfmoore commented 4 years ago

To clarify for my benefit - is the intention here to define a standard that all indexes must implement (in the same sense that PEP 503 covers the simple API) or to define and document how Warehouse (PyPI) operates?

The former would mean that we intend to allow tools to assume the existence of such an API and would mandate that all index implementations (devpi, pypiserver, Artifactory) should implement it¹ And IMO it would mean that we should be collecting input from the developers of those implementations as well as Warehouse.

If it's intended simply to document the Warehouse API as "the reference implementation of a JSON API" then it's not so much of an interoperability standard and we can avoid those complexities (although conversely, it would be of more limited use for general tools like pip).

¹ Yes, it could be defined as an optional API, in which case we'd need a means of querying "do you support this?"

brainwane commented 4 years ago

@pfmoore I'd say all indexes, which is why one of the items on the checklist is

Author submits as a Work-In-Progress PR to the python/peps repo, and circulates on distutils-sig/discuss.python.org for comment, and to maintainers of Warehouse clients and other indexes, and revises in response to their comments

brainwane commented 4 years ago

But also I'll defer to folks like @dstufft @techalchemy @Julian @mplanchard @fschulze on Paul's question.

@cooperlees:

I want this, so I'll commit to this.

@cooperlees in saying this, you are a model of an open source citizen. Thank you. :-)

I'll even help with the new API design and possible implementation. My only warning here is my english skills are bad and this will be my first PEP.

Everybody's got a first time. :-) And I know other folks will help with refining the prose.

I'll put aside some time Sunday night to try get the sketch outline started and possible we can chat on IRC more Monday @brainwane .

Sounds great!

I think we'll also need a PEP sponsor. @jaraco @cjerdonek @zooba @gpshead @merwok are any of you open to sponsoring this?

pfmoore commented 4 years ago

Cool - sorry, I missed that item. I presume that @dstufft would be BDFL-Delegate for this.

dstufft commented 4 years ago

I would say I don't think all repositories have to implement it, but rather the goal would be to standardize it so that tooling can say "I depend on a repository that implements this API", repositories are of course free to say they don't support that API, but those tools won't work with them then. We should def try to get feedback from them though, if the answer from all of them is they can't or won't implement it, then maybe we need to think harder about the path forward for it.

One key thing I think I'd want to see in a PEP for this, is trying to explicitly document what use cases we're trying to make the current JSON API good for. Are we looking to just standardize it to function as a general purpose "pull data from PyPI" API, or are we looking to allow specialized tooling to use it for some purpose (as an example, do we want Bandersnatch to be able to use this for implementing mirroring? Does the current "shape" of the API allow that? If not what's our smallest change we can make to allow that? etc).

dstufft commented 4 years ago

https://github.com/devpi/devpi/issues/801 is an example of something to look at to figure out why they want this API standardized and to make sure our API actually satisfies their use case.

brainwane commented 4 years ago

Thanks Donald - sorry for misremembering and thanks for the correction.

pfmoore commented 4 years ago

I'd also like the PEP to have a clear way for clients to query servers as to whether they support this API. Just trying a query and checking the response runs the risk of people exposing a "similar, but not the same" API and clients having no way of knowing.

brainwane commented 4 years ago

@havocp and @tiegz and @katzj -- if you use Warehouse's JSON API in https://github.com/librariesio/bibliothecary/ or in the Tidelift CLI tool then check out this question:

One key thing I think I'd want to see in a PEP for this, is trying to explicitly document what use cases we're trying to make the current JSON API good for. Are we looking to just standardize it to function as a general purpose "pull data from PyPI" API, or are we looking to allow specialized tooling to use it for some purpose (as an example, do we want Bandersnatch to be able to use this for implementing mirroring? Does the current "shape" of the API allow that? If not what's our smallest change we can make to allow that? etc).

fschulze commented 4 years ago

From the devpi side a big requirement for an API are relative links from a common root, because we support multiple indexes. Besides that I don't have much input at this point.

cooperlees commented 4 years ago

I have an extreme draft up on my PEP fork here: https://github.com/cooperlees/peps/blob/warehouse_json_api/pep-9999.rst

What's the best way to have everyone be able to comment + add to? Should we use a Google doc and I transfer back to the rst? Is there a better way?

From the devpi side a big requirement for an API are relative links from a common root, because we support multiple indexes. Besides that I don't have much input at this point.

Do you mean for the releases and urls section "url" ? Wouldn't you just put an absolute URL using your domain? Can you maybe give me an example on how you'd use a relative URL and I'll maybe understand your use case better.

fschulze commented 4 years ago

@cooperlees the current json API is at https://pypi.org/pypi/[projectname]/json, tools like https://github.com/peterbe/hashin/ often hardcode that absolute URL. So even though it is possible to provide an alternate URL, it will always start with /pypi. With devpi there are many indexes. Each user can create several of the form https://example.com/username/indexname and it isn't possible for devpi to provide the PyPI json API for tools like that, because each index needs its own endpoint, for example https://example.com/username/indexname/+json (the + in there is to distinguish from project names which live at https://example.com/username/indexname/projectname/. That is what I mean with relative vs absolute URL endpoints. I hope I was able to describe it properly.

cooperlees commented 4 years ago

@cooperlees the current json API is at https://pypi.org/pypi/[projectname]/json, tools like https://github.com/peterbe/hashin/ often hardcode that absolute URL. So even though it is possible to provide an alternate URL, it will always start with /pypi. With devpi there are many indexes. Each user can create several of the form https://example.com/username/indexname and it isn't possible for devpi to provide the PyPI json API for tools like that, because each index needs its own endpoint, for example https://example.com/username/indexname/+json (the + in there is to distinguish from project names which live at https://example.com/username/indexname/projectname/. That is what I mean with relative vs absolute URL endpoints. I hope I was able to describe it properly.

Ahh got it. Here I'd love to propose (in my PEP) that we keep the legacy URLs on PyPI (for legacy reasons) but in the standard move something like (and implement on PyPI - I will happily do that):

Totally open to better ideas, but something like this will allow you to get your per Index JSON API :)

cooperlees commented 4 years ago

Ok - I finally sat down and described all the JSON fields I could decipher what they are for.

Returned JSON fields I need help with:

Branch is here: https://github.com/cooperlees/peps/tree/warehouse_json_api

What's left to do before I put up a pull request for review more PEP savvy people? How do I get a PEP number etc.

I still expect this needs a lot of refinement, but I'm getting to the limits of my knowledge of the API from just using it. I think the best way forward is possibly having PyPI maintainers all take a pass at cleaning it up. I think I've done the grunt of the boring manual reading JSON files and trying to workout all fields we should make required etc.

Thanks! Looking forwarding to closing this one out.

pradyunsg commented 4 years ago

https://github.com/cooperlees/peps/blob/warehouse_json_api/pep-9999.rst

For anyone else trying to get to the PEP quickly. :P

pradyunsg commented 4 years ago

What's left to do before I put up a pull request for review more PEP savvy people? How do I get a PEP number etc.

Brett recently answered a few questions related to this over on discuss.python.org.

In terms of the process, I think you'll also want to file a PR to packaging.python.org -- adding a page to https://github.com/pypa/packaging.python.org/tree/master/source/specifications detailing the final design that folks use/implement.

From https://discuss.python.org/t/how-to-propose-new-specs/4721/7?u=pradyunsg:

The way that I understand the situation is:

  • the PEP contains all the information like "Why did we do \ and not \"
  • the PR to packaging.python.org adds a page that describes \
fschulze commented 4 years ago

I still think we should clarify the location of the API to not make it pypi.org centric. I would propose that the base for PyPI be defined as https://pypi.org/json and that all other endpoints like /json/discover/$call_name are redefined from that base, i.e. $base/discover/$call_name. It should also be made clear that tools should strive to offer a way to configure the base to be usable with non PyPI package indexes like devpi.net

cooperlees commented 4 years ago

I still think we should clarify the location of the API to not make it pypi.org centric. I would propose that the base for PyPI be defined as https://pypi.org/json and that all other endpoints like /json/discover/$call_name are redefined from that base, i.e. $base/discover/$call_name. It should also be made clear that tools should strive to offer a way to configure the base to be usable with non PyPI package indexes like devpi.net

That's the main intent and why I added the /json URLs on the PEP. Please feel free to suggest wording changes to make it clearer. I am a terrible writer. Just doing this cause I want the functionality, not cause I like writing. I actually dislike it a lot, so would appreciate ALL help I can get.

I think tools is scope creep for this PEP. This PEP is to just make a standard designed API so we can all implement it the same. Once we have that we should request tools to support it - i.e. different base Index URLs ... like pip can today.

brainwane commented 3 years ago

@kpfleming I think, based on https://discuss.python.org/t/pep-for-the-python-package-index-json-api/5717/16 , that you might want to check in here and give @cooperlees some feedback on the current draft.

mplanchard commented 3 years ago

I totally missed the ping on this back in June, but happened to see a notification about it yesterday. Thanks for thinking of us! The proposed PEP seems straightforward enough to implement, and it doesn't conflict with anything pypiserver is currently providing. I have some minor questions (let me know if you'd prefer we had this conversation over on discuss.python.org -- I don't have an account there currently so figured I'd ask here):

I also have some questions that are definitively outside the scope of the PEP, like how pip will handle backwards compatibility with the old simple API and whether the intent is for pip to eventually drop support for it, the answers to which will inform the degree of urgency in updating pypisever to support the new API.

Definitely glad to see this effort. It'll be great to have a clear schema that we can implement again.

cooperlees commented 3 years ago
  • What would non-PyPI repositories be expected to send for last_serial, which is described as being a required field defined as "Internal PyPI serial indicating last modification"?

For pip and many tools this is not really used. Bandersnatch uses it to ask for packages that have changed since serial X. This should just be some sort of incrementing integer. Every upload you could just increment it. I would envision this could even just be 0 on your mirrors, unless you'd want to make your package index's bandersnatch mirror-able

  • Currently pypiserver doesn't bother to parse the metadata files in the packages that are uploaded, instead using the standardized filenames to parse package names and versions. As such, populating some of the required fields in the info response would require larger changes that just adding endpoints, specifically author, author_email, license, and project_url. Given the pypiserver's goal of being able to immediately serve packages that are simply scp'ed or whatever to a server, we've avoided so far implementing a local metadata cache or anything like that. It's seeming more and more likely that we'll eventually have to do that regardless, but I'd be curious to know whether these fields are really required.

I think you should just start off puling the size from the file and use upload time etc. etc. to fill in as much metadata as you can and see how happy that makes your users. Othetwise, have a formal upload where all the metadata is calculated, and for your scp files you best effort it imo.

I also have some questions that are definitively outside the scope of the PEP, like how pip will handle backwards compatibility with the old simple API and whether the intent is for pip to eventually drop support for it, the answers to which will inform the degree of urgency in updating pypisever to support the new API.

I would expect once pypi.org supports this PEP, we would go make pip use it asap. I would also expect pip keep the legacy methods for a period of time and kill the non PEP code. I am not a pip maintainer so I can't make an authoritative decision here, but would be down to help do this work, if I ever get this PEP through. I am sure this would be a GitHub issue etc. and I would just say to be involved in those PRs / issues and follow along.

kpfleming commented 3 years ago

@brainwane Thanks for the shoutout to bring me here :-)

@cooperlees I'd be happy to collaborate on this PEP with you, acting as the copy-editor/reviewer to help ensure that the content is readable and understandable. I have both a desire for this PEP to be published (so that my company's tooling can benefit from it) and plenty of experience in document review and editing, so hopefully that will be a good combination.

cooperlees commented 3 years ago

Well I feel it's ready (and has been for quite some time) to just get polished up and have any technical issues debated out.

I'll try rebase the commit and remind myself where we all are. I feel we just need approval from @ambv and @dstufft really.

@kpfleming - Happy for you to fork and PR or just go comment on the latest commit suggestions + fixes etc.

I'd love to land it and go and implement the endpoints for Warehouse ASAP.

pfmoore commented 3 years ago

I'd love to land it and go and implement the endpoints for Warehouse ASAP.

I assume it still needs to be published for review & discussion prior to approval (as far as I've seen it's not been posted to Discourse yet)? I'm very interested in this PEP but haven't paid much attention while it was in pre-PEP stage.

kpfleming commented 3 years ago

OK, I'll put together a PR this weekend to try to get the pre-draft into a submittable state.

A question though: "go and implement the endpoints for Warehouse ASAP" implies that this PEP will result in work in Warehouse, but this PEP is supposed to document the existing API. Which way is this going to go?

cooperlees commented 3 years ago

It is documenting it, but the way it is today it can not support self contained mirrors for third party indexes, particularly if they serve multiple indexes. There is also lots of little endpoints and tweaks to make it more complete. @pfmoore and others have made requests in regards to this. This "JSON API" was never fully designed to be an authoritative API, thus the need for this PEP.

For Example:

mplanchard commented 3 years ago
  • Currently pypiserver doesn't bother to parse the metadata files in the packages that are uploaded, instead using the standardized filenames to parse package names and versions. As such, populating some of the required fields in the info response would require larger changes that just adding endpoints, specifically author, author_email, license, and project_url. Given the pypiserver's goal of being able to immediately serve packages that are simply scp'ed or whatever to a server, we've avoided so far implementing a local metadata cache or anything like that. It's seeming more and more likely that we'll eventually have to do that regardless, but I'd be curious to know whether these fields are really required.

I think you should just start off puling the size from the file and use upload time etc. etc. to fill in as much metadata as you can and see how happy that makes your users. Othetwise, have a formal upload where all the metadata is calculated, and for your scp files you best effort it imo.

Yeah there's definitely metadata we can collect. I guess my concern was more about those fields in particular being specified as required in the jsonschema. I assume that means that tools using this API are free to break if those fields are absent, which does put us into a position of needing to support those required fields specifically in order to remain compatible.

nchepanov commented 3 years ago

Hello! I work in Bloomberg's Python Infrastructure team and now was given time to be involved in this on a regular basis and hopefully help moving this forward.

After speaking with @di @cooperlees @pradyunsg and reading through all of the context I've discovered that there seem to be a tendency to scope creep when discussing the PEP and I'd like to first define the scope of the proposed PEP more clearly.

If we can agree on the scope below, I will make adjustments to @cooperlees draft and begin working on implementing a PR for pypa/warehouse.

CC: @dstufft @ewdurbin

PEP Scope

Future work

The following improvements should be eventually done, but arguably deserve their own PEPs

ewdurbin commented 3 years ago

@nchepanov Thank you for helping to bring this together!

At this moment in time, the only concern I have with your proposed course of action is regarding authentication and establishing a URL structure.

While the current JSON endpoints are highly cacheable, if that changes in the newly proposed API endpoints we would need to require authentication (or consider how we would manage a highly cacheable unauth'd endpoint and require auth for other cases).

As the stated course of this PEP would propose an API URL structure, I think we would want to at least start a draft for the "foundational" components of a new PyPI API to reference (and discuss those concerns separately). These include URL structure, versioning, and likely authentication.

also pinging @di for input.

di commented 3 years ago

I'll +1 @ewdurbin's comments: I think the only endpoints that would absolutely need auth are those that create/update/delete. The current JSON API does none of that, but I don't think we'd want to rule it out for a hypothetical future API.

nchepanov commented 3 years ago

@ewdurbin @di thank you for taking the time to review the outline!

Let me make sure I understand your concerns correctly:

It is important to you that any API URL structure this PEP may introduce allows for the ability to add authentication for create/update/delete operations or operations that are not designed to be highly cacheable.

If this understanding is correct, then we are all set:

One of the intentions of the new API is to make no changes to the shape of the API, consequently, the new API end-points have the same exact properties as the existing JSON API. In other words, the new API is both read-only and highly cacheable.

The current JSON API does none of that, but I don't think we'd want to rule it out for a hypothetical future API.

This is absolutely correct. To be more specific, the PEP suggests:

They can be extended in many ways to accommodate discoverability capabilities, auth, pagination and more.

I think we would want to at least start a draft for the "foundational" components of a new PyPI API to reference

This is definitely important, however, it's arguably outside of the scope of this PEP. If we want to make progress, carrying this PEP all the way would be the first step. Once this is done, the next PEP can address the "foundational components" of the new PyPI API.

It looks like we can take https://github.com/pypa/packaging-problems/issues/367#issuecomment-819691296 as written and proceed to update the PEP draft, and start working on the PR that implements it.

Cooper and I are waiting on your LGTM to move forward.

pfmoore commented 3 years ago

Just to clarify, am I right that the intention here is to write a PEP, which defines the API that any index which claims to support the "Package index JSON API v1" will provide? The API will be functionally identical to the existing Warehouse API, but with more precisely defined semantics.

As a Package Index Interface PEP, this would be down to @dstufft to pronounce on, I assume, and it would go through the normal PEP process, with a round of discussion in the Packaging category on Discourse before approval? You'll need a PEP sponsor to take this through the process as well. I'd advise getting the draft PEP published first, so that discussion can get under way while you're working on the PR. You should also reach out to interested parties like the devpi and Artifactory developers, who may want to add the JSON API to their index software, so they can add their feedback.

Personally, I'm very much in favour of this, not least because it will provide a good foundation for migrating the XMLRPC API to JSON in the future (but I completely agree that should be out of scope for v1). So for what it's worth, you have a +1 from me. But I do think there are some rough edges in the existing API which will need sorting out - for example, the project-level requires_dist and requires_python data, which isn't actually meaningful (even though Warehouse collects and stores it). That sort of point can be thrashed out in the public discussion, though.

nchepanov commented 3 years ago

Looks like the outline in https://github.com/pypa/packaging-problems/issues/367#issuecomment-819691296 didn't cause any major opposition. Here's what I'm going to do starting next week:

I'm new to this community / process, let me know if there's anything I'm missing or can do better.

pradyunsg commented 3 years ago

update the draft and submit a PR into python/peps

You should find a PEP sponsor before creating the PR. The process would be roughly:

pradyunsg commented 3 years ago

One thing that I just realized that I never stated publicly: I don't like having the releases key in the standardised API. That key singlehandedly hurts cachability and is very odd to include in release-specific endpoints too.

I'd like for us to remove that key on the release-specific endpoints, as part of this change/restructuring.

I know I'm asking for something that Nikita explicitly wants to keep out-of-scope ("no changes to the output"). I think removal of existing keys is less problematic for this whole effort than addition/renaming/restructuring, especially since this is information that's duplicated and unnecessarily increases response sizes.

To be abundantly clear, this doesn't directly affect any of the plans/next steps here. I feel that the weirdness of this key would become clearer once it is described in a design document (i.e. PEP).

pfmoore commented 3 years ago

See https://github.com/pypa/warehouse/issues/9536 for a use case that's not covered by the existing JSON API (mirroring the PyPI metadata).

I understand that the scope here is just to reorganise the existing API, not to add new functionality or deprecate the XMLRPC API, but I think that while we are reorganising things, we should keep known use cases in mind so that they don't get "lost in the shuffle".

nchepanov commented 3 years ago

@pfmoore the need to some form of subscription API on "what changed since I last looked" is the very motivation for our involvement in this project. It appears that standardization of the JSON API as is is the first step to getting anywhere, without it the maintainers are not comfortable evolving the API in any direction. But ultimately, we (Bloomberg Engineering) want some form of "what changed" API that we can subscribe to.

nchepanov commented 3 years ago

The PEP Draft is ready for comments: https://discuss.python.org/t/pep-rfc-python-package-index-warehouse-json-api-v1/9205

Once we get enough feedback, I will open a PR into python/peps.

cooperlees commented 1 year ago

Should we close this? It seems the consensus is to move to extending the Simple API moving forward like PEP 692 through PEP and requests for comments on said PEPs. This has gone round and round and a lot of people said we should freeze the "JSON API" and move elsewhere. Thoughts?