Determine new API URL structure for warehouse (starting with new JSON API)

ctheune commented 10 years ago

At the PyCon2014 sprint I have started to make bandersnatch easier to cache. This means moving away from XML-RPC in general.

I'm leveraging the existing /pypi//json API which already helps, but I'll need two more endpoints:

get a list of all packages and their most recent serial
get the changelog

I implemented the necessary code on a branch for PyPI: https://bitbucket.org/ctheune/pypi/branch/ctheune-bandersnatch-json

However, I don't wanna force this through but have a decision how the URLs should look like.

Ideally we can implement this in both warehouse and PyPI in a way that bandersnatch can support both of them without breaking when you guys switch the public server (and I might be on vacation. ;) )

dstufft commented 10 years ago

So I have some ideas on both a new API for accessing data about PyPI and also some rough ideas for a new mirroring API in general. I'll take a look at what you have so far.

dstufft commented 10 years ago

So if I read your PR correctly, the new URLs would be https://pypi.python.org/json/changes and https://pypi.python.org/json/packages? If that's the case then I'm not really a big fan for adding those in Warehouse.

Ideally what i'd like to do is get a nice hypermedia based API setup probably rooted at /api/. Using something based on https://jsonapi.org/ is a possibility. There are a few options and need to dive into it to figure out what exactly needs done. Ideally the new API will also replace the existing JSON api and we can deprecate the old JSON api (but leave it in place until (or if!) it's no longer getting traffic.

r1chardj0n3s commented 10 years ago

I echo @dstufft in this. The question I have is whether we go all the way to /api/v0/ to future-proof us a little too. Unless we'd be happy with /api-v1/ or similar later on?

dstufft commented 10 years ago

So there are two ways to deal with that, one way is to version using the content type, so it's always /api/ but it'll select the version based on the content type, github uses this like: Accept: application/vnd.github.beta+json or Accept: application/vnd.github.v3+json. The other way is to do /api/v0/ etc. I lean towards using the content type but we'll need to figure out in general how we want to handle versioning going forward and how the code to handle that looks like.

steveklabnik commented 10 years ago

Just let me know if I can help out at all regarding JSON API stuff.

brainwane commented 6 years ago

As I understand it, this issue (designing and implementing a new Warehouse API) is a prerequisite for integrating twine into pip and thus dealing with pypa/packaging-problems#76 and pypa/packaging-problems#60 , per @dstufft's comment in pypa/twine#127. Is that correct? If so, I'd suggest we add this to one of our upcoming milestones.

brainwane commented 6 years ago

We talked about this issue in today's bug triage meeting and folks explained to me: Even though this may be necessary for some Twine improvements, this is not a ticket we will address before launch. This is a new feature and is best suited for post-launch; Warehouse needs to be done before we can improve twine.

phildini commented 6 years ago

Hello! Quick call-out that some of us would really enjoy the JSON API containing info like owners/maintainers before the XMLRPC API is shut down. See ticket linked right above. Cheers, thanks for all your work!

brainwane commented 6 years ago

I've marked #2914 as something we should address before shutting down legacy PyPI, but developing the structure for the new API can wait till after we shut down the legacy site.

brainwane commented 6 years ago

As we develop the new API we should consider #347 as well. And I've added this issue to the list of things we might work on at the PyCon sprints.

theacodes commented 6 years ago

I would also love an API for managing both my account and my projects. For some examples of where this is useful:

We have an account that owns all of the projects our organization publishes. I want to rotate its password every week.
Likewise, I want to audit all of my organization's project and verify that no more than n people have admin access to it.
I am actually in the process of migrating all of my projects in my personal account to a new account. It would be cool to do that programmatically.

I'm happy to help with the design and discussions around this (my day job is helping design APIs and implement clients for Google Cloud Platform).

di commented 6 years ago

I am actually in the process of migrating all of my projects in my personal account to a new account. It would be cool to do that programmatically.

We could probably just call this "the ability to add/remove collaborators via API" I think, since actual account migration is probably not something that happens very often.

dstufft commented 6 years ago

I'm hoping to carve out some ideas on this soon, maybe next week? Ideally the output of this ticket is the basic framework/skeleton of the API, and then further tasks can extend the functionality of it.

Defining APIs for PyPI is a tad bit trickier than the general case, because we generally have to design for a decade+ (for instance, XMLRPC got added, but it has not aged or scaled well! From my investigations so far, GraphQL would be a similar mistake). I'm almost certain that something Hypermedia based is the way forward here, but there's a lot of different ways to take that, we'll also need to be ensure to include all of the typical scaling things like pagination and the like.

theacodes commented 6 years ago

We could probably just call this "the ability to add/remove collaborators via API" I think, since actual account migration is probably not something that happens very often.

Yep just calling a a specific use case.

I'm hoping to carve out some ideas on this soon, maybe next week? Ideally the output of this ticket is the basic framework/skeleton of the API, and then further tasks can extend the functionality of it.

Sounds good, happy to review and be around to bounce ideas off of (I'm on IRC during PST working hours as thea).

Hypermedia based is the way forward here, but there's a lot of different ways to take that, we'll also need to be ensure to include all of the typical scaling things like pagination and the like.

Agreed - REST/JSON (and to some extent RPC/JSON) has more or less stood the test of time (in tech years, at least). Happy to provide feedback on that sort of stuff as well.

brainwane commented 6 years ago

@jonparrott In our meeting on Monday Donald said he's started a design document about the new API(s). He's currently trying to focus on Python open source work on Thursdays and Fridays, so maybe you could ping him this Friday and start an Etherpad/Google Doc?

@steveklabnik and @phildini, will you be at PyCon North America later this month? Several of us will be discussing and working on the API redesign during the sprints.

steveklabnik commented 6 years ago

I will not. I wasn't aware until now :) maybe we could make some time to chat via skype or something? Honestly, at this point, @dgeb is the person you'd want to talk to, so let's cc him here too :)

theacodes commented 6 years ago

@dstufft I'll be around on IRC today if you want to chat about the API design. I'll also be at PyCon if you want to discuss in person.

di commented 6 years ago

Some notes from today's sprints:

Simple api is the biggest success story for scaling
- Hypermedia API
We should have a hypermedia API
- Allows us to change things in the background
- Move things around as we need to
- Not having a single route for everything
- Rules out GraphQL
- Bad idea: something that allows you to reduce the data you get back
- Because it's easier/better for us to just catch the full response
- Having more responses that are smaller is not a big deal with HTTP2
- Bad idea: something that lets you get data in bulk
- We have a long tail of packages
- Also uploads are very active
- This means that the cache for a bulk request will frequently be out of date
- It's easier for us to cache smaller responses about individual projects/releases
Technologies for hypermedia APIs
- Hyperschema (json schema stuff)
- Plain JSON over HTTP
- http://json-schema.org/latest/json-schema-hypermedia.html
- Actual json document + schema document
- No library implementation
- If you ignore the schema, it's just a plain JSON api
- Ideally, you don't hard-code URLs in the individual endpoints
  - These are defined by the schema instead
Versioning is just done by changing link headers, not via URL path
- Perhaps the schemas will be versioned
Would need to be in the core of warehouse
- could live at api.pypi.org?
Eventually existing APIs will just use this new API
- JSON-legacy/XML-RPC will just wrap this new API
- Then we can work in getting people off these APIs
Other things we need:
- Some standard for error responses for the entire endpoint
- Standard SUNSET header: allows the API to tell clients that endpoints are being deprecated
- Will need to bake this into the various clients, but should be done from day one

(cc @cooperlees @asmacdo @dwighthubbard)

theacodes commented 6 years ago

Was the possibility of an RPC-based API discussed at all? I'd like to know the reasons for ruling it out.

(for context, I'm not a big fan of hypermedia APIs - they tend to add an enormous amount of cognitive overhead for both users and maintainers).

Not that we should block on this by any means, just curious as to what was discussed.

merwok commented 6 years ago

There are many downsides to RPC APIs. You need to name every operation; you don’t benefit from HTTP features like caching or transparent retries since everything is a POST; you need to identify every resource ID instead of relying on URLs to identity them; you need to add authorization code in all methods instead of benefiting from Pyramid ACLs linked to the resource tree.

I find too that it’s not always easy to come up with names for all resources and define what the HTTP verbs should mean, but otherwise the uniform interface has lots of benefits.

theacodes commented 6 years ago

There are lots of ways of doing RPC APIs, and while your concerns are valid for some implementations, they aren't universal.

I mentioned this briefly during the sprints: Google's API style guide is an extremely well-done guide for creating RPC APIs that can also be exposed in a REST-like way. While it walks through the patterns and examples using protobuf, it's applicable regardless of the encoding type.

Hypermedia APIs are really bizarre to me personally. If we're going to go the hypermedia route, I would prefer us take after GitHub and be relatively conservative about it. Although I really hate the idea of using HTTP headers to convey useful information like next page tokens.

merwok commented 6 years ago

Thanks for the link!

(Same feeling about using headers. I am using JSON-API in a project and like the uniform response format with pagination links and related resources.)

theacodes commented 6 years ago

Yeah, using headers for actual data makes things transport-specific. Using full URLs for links also ties the API to a specific transport, whereas using resource names (basically just the path part of the URL) provides agnosticism. We want something that will have longevity, I think buying in deeply into hypermedia would be counterproductive to that.

dstufft commented 6 years ago

Was the possibility of an RPC-based API discussed at all? I'd like to know the reasons for ruling it out.

We didn't discuss RPC APIs though I had considered it when I was thinking about it prior to PyCon. My thinking is basically that:

The sort of stereotypical RPC API tends to use a single route URL (often times with POST) for all of it's requests. This is true of the older ones like XMLRPC and JSONRPC and some newer ones like GraphQL and I believe gRPC[1].
A large portion of our API is read-only, and much of that we want to implement in a way that can be served statically using nothing more than some files on disk and Apache/Nginx/etc. None of the gRPC frameworks seem to support this at all, and while they can maybe be twisted into making it workable, you're getting far off the beaten path where it's likely to break.
We scale our API by caching, while you can do an RPC based API [2] from what I've found of the various RPC frameworks, it's either not documented at all how to do it, or it's impossible. I delved into this with gRPC and basically just found a bunch of open issues asking for guiddance or mechanics for doing so.
Ultimately though, I said hypermedia because it's the style of API I most prefer and I think they best fit the use case of the web (after all, websites are the best example of hypermedia and they last decades just fine!). I'm happy to be convinced otherwise, but skimming through the google guide, I'm not seeing anything that stands out that makes me feel like "yea, this makes me really like RPC APIs now".

Of course similarly to RPCs, there are a lot of ways to do Hypermedia APIs ;) For instance, sticking full URLs or putting Link metadata into headers is not a mandatory part of Hypermedia APIs. Although having the option to do that is incredibly useful.

One of the key benefits of a fully hypermedia API is that the only thing you need is the root URL of the API and everything else is discovered at runtime. A historical use case where that would have been extremely useful is the upload API, it was originally living under the same domain, but as time went on we realized we needed it to live at a new, different domain. With a hypermedia API that would have been as trivial as starting to serve the root document that says the upload API resources are located at a different hostname, and all existing clients would have simply automatically started uploading to the new domain.

If you think in terms of HTML, the original API could have been written like:

<!-- https://api.pypi.org/ -->

<form method="POST" action="/upload/" name="upload">
   <!-- Some Upload Fields -->
</form>

As client could then do something like:

import urllib.parse

import html5lib
import requests

resp = requests.get("https://api.pypi.org/")
resp.raise_for_status()

data = html5lib.parse(resp.content)
upload = data.find("//form[name=upload]")

resp = requests.request(upload["method"], urllib.parse.urljoin("https://api.pypi.org/", upload["action"]))
resp.raise_for_status()

Now that code isn't perfect, but you can see here all that really got hard coded in the end client is that there is an action (via a form) named upload, and the base url. Now if we ever want to do something like, move uploading to it's own domain name (or a different URL or whatever), we simply start returning a new response like:

<!-- https://api.pypi.org/ -->

<form method="POST" action="//upload.pypi.org/" name="upload">
   <!-- Some Upload Fields -->
</form>

And all of the existing clients are not only still going to work, but they'll automatically adjust to point to the new location. Effectively instead of hardcoding URLs or structure, we're hardcoding link/action names (such as "upload" here).

Obviously HTML is a pretty crummy data exchange format for computers (great for presentation to users as a UI builder though) so we'll want some encoding that is built for data exchange. JSON is a popular one, but the big problem with JSON is there is nothing like forms or links built into it. So in order to implement something like the above, your client has to start baking assumptions into it like "when I fetch this URL, the data located in the "url" key is an URL". That isn't what we want (and isn't hypermedia!) so we have to look to other content types that do include those things.

There are a number of options for this like:

There are other ones too that offer similar things, but effectively what all of these do is attempt to marry the idea of actions and links into a data format that sucks less than HTML. Some of them rely on things like headers, while others rely on specially formatted response bodies, and others rely on a side by side schema file that can be combined with the response data to generate links/actions at runtime.

Again, I'm not particularly married to any one of these, however my personal favorite is JSON Hyper Schema. It relies on a side by side schema file that both validates the data in the response and lets you extract links/actions out of a JSON body without having to spend a lot of time formatting it in exactly the right way.

Here's a quick example of what it looks like using a demo of a blog:

Say you have a resource that can represent a blog post, you might have a JSON response that returns something like:

{
    "id": 5,
    "title": "JSON Hyper-Schema",
    "slug": "json-hyper-schema",
    "body": "My long post about JSON Hyper-Schema..."
}

You could then craft a JSON Schema document that looked like:

{
    "type": "object",
    "properties": {
        "id": {
            "type": "number"
        },
        "title": {
            "type": "string"
        },
        "urlSlug": {
            "type": "string"
        },
        "body": {
            "type": "string"
        }
    },
    "required": ["id"],
    "base": "http://api.example.com/",
    "links": [{
        "rel": "self",
        "href": "posts/{id}",
        "templateRequired": ["id"]
    }]
}

When you combine these two responses at runtime, you can fetch any of the data out of the JSON response you like, but you can also look and see "oh I can find out the link to myself by looking in links, finding the one that is rel=self, and combining it with data from the response (in this case id) to generate a href.

The full spec allows much more than this of course, you can link to other things like next/prev, where you'd include a bit in the response body that tells you what the next/previous page is, and then a link object to describe how to turn that into an URL or linking to actions that don't cleanly map to GET/POST/DELETE/etc.

One of the reasons I really like this style over the other styles, is the Hypermedia portion of it can more or less be completely ignored for just playing around use cases. Since the data itself looks like a typical "JSON over HTTP" API, you can treat it like that and hard bake in URLs etc etc, however if you do that then you risk breaking if the API gets restructured (like moving things to a different URL path, or what have you). That often times doesn't matter for quick one off scripts you're never going to run again, but for things that you plan on implementing for wider consumption or that you want to have working for a long time, you'll want to opt into the extra complexity of parsing the schema object and combining them at runtime.

Probably the biggest downside here is there are not a lot of things doing a full out Hypermedia API today, which means that tooling for them isn't super great across all languages.

Does that all make sense? I'm not currently married to any part of this of course, but that's sort of the thinking that I had.

[1] I had found some documentation that suggests maybe you can do something saner with gRPC, but it's really hard to find much documentation on how to actually use gRPC in an existing application, or really at all in Python. After spending like 2 hours unable to get even a demo gRPC service working, I gave up on it. [2] RPC and hypermedia aren't really mutually exclusive tbh, you have things like the defunct hyperglyph project which gave RPC using hypermedia.

theacodes commented 6 years ago

The example you showed for hypermedia is super problematic from my perspective because it requires clients to always do run-time discovery of the API. In practice, almost no one does this (and when forced to, it leads to overly complex client libraries). They will hardcode URLs and make assumptions, and if they don't, the will make assumptions about the links themselves (just like your hardcoded //form[name=upload]). This is really problematic in my opinion.

A true, fixed, versioned IDL (like protobuf, but I am by no means advocating for protobuf, JSON-Schema and friends also satisfy this) solves this. It is an ahead-of-time commitment to a specific interface. This is what I want. This allows clients to build with confidence.

Also, I wanted to clarify something about linking to the Google API Style Guide - I absolutely do not want warehouse to have a gRPC API, but I wanted to call out two things: that guide has well established patterns that work regardless of transport and Google uses protobuf to serve both gRPC and HTTP/JSON/REST APIs. See https://cloud.google.com/apis/design/standard_methods for examples of how we map common methods (List, Create, Get, Update, Delete) to HTTP methods. It might be a useful exercise to sketch out the API in proto or OpenAPI to see how it would look.

(I would strongly support adoption of OpenAPI specs, as we could immediately take advantage of swagger-generated clients)

steveklabnik commented 6 years ago

You don't do full discovery every single time. You keep your current position. It's one request each time, just like RPC.

On Thu, May 17, 2018 at 9:41 PM, Thea Flowers notifications@github.com wrote:

The example you showed for hypermedia is super problematic from my perspective because it requires clients to always do run-time discovery of the API. In practice, almost no one does this (and when forced to, it leads to overly complex client libraries). They will hardcode URLs and make assumptions, and if they don't, the will make assumptions about the links themselves (just like your hardcoded //form[name=upload]). This is really problematic in my opinion.

A true, fixed, versioned IDL (like protobuf, but I am by no means advocating for protobuf, JSON-Schema and friends also satisfy this) solves this. It is an ahead-of-time commitment to a specific interface. This is what I want. This allows clients to build with confidence.

Also, I wanted to clarify something about linking to the Google API Style Guide - I absolutely do not want warehouse to have a gRPC API, but I wanted to call out two things: that guide has well established patterns that work regardless of transport and Google uses protobuf to serve both gRPC and HTTP/JSON/REST APIs. See https://cloud.google.com/apis/ design/standard_methods for examples of how we map common methods (List, Create, Get, Update, Delete) to HTTP methods. It might be a useful exercise to sketch out the API in proto or OpenAPI https://github.com/OAI/OpenAPI-Specification/blob/master/examples/v3.0/petstore.yaml to see how it would look.

(I would strongly support adoption of OpenAPI specs, as we could immediately take advantage of swagger-generated clients)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pypa/warehouse/issues/284#issuecomment-390068471, or mute the thread https://github.com/notifications/unsubscribe-auth/AABsikKs0o1_7znSwi0BE4LM4ixG8cMhks5tzibggaJpZM4Byq-v .

dstufft commented 6 years ago

@theacodes I guess I'm just not seeing what an IDL actually gets us here? The example of JSON Hyper Schema has JSON-Schema as part of it, it's just instead of your client hardcoding URLs and actions all over the place, it can discover them at runtime. You can also ship them as part of your client so that a network access isn't required in the common case (unless you introduce a new schema).

dstufft commented 6 years ago

@theacodes If it would be helpful, I'm happy to jump on a call or into IRC to go over the two things in a higher bandwidth setting instead of throwing github comments back and forth. I feel like there are probably some misconceptions on both sides about RPC and Hypermedia, and perhaps a higher bandwidth mechanism would help to work out what those are?

theacodes commented 6 years ago

I don't want to hold up progress. I would love to see a design doc or proof of concept if/when we have one.

asmacdo commented 6 years ago

POC is up #4078, I'm using this etherpad to document design proposal. https://pad.sfconservancy.org/p/hypermedia_api_design

I did my best to incorporate the ideas discussed here, as well as in person discussions at pycon. I've set aside some time to keep working on this, so all feedback is welcome.

brainwane commented 4 years ago

Per discussion in IRC just now -- the author closed #4078 last year, and it's unclear whether this kind of feature would be welcome if someone were to try again at implementing it.

brainwane commented 4 years ago

Maintainers' opinions are welcome. Also, in my opinion, it would be easier to finalize design and implementation, test and review, and deploy this if we had funding for it.

brainwane commented 3 years ago

@asmacdo that Etherpad has now dissolved and reset -- did you keep a copy of your design proposal anywhere else?

Reminder to others that work on this could probably use funding.

asmacdo commented 3 years ago

@brainwane unfortunately I dont have a backup , but the PR could still be distilled down into a design proposal.

Key points:

Create a resource based, hypermedia API
Use Marshmallow to serialize
use apispec to generate OpenAPI schema

Additional necessary features

Pagination
Filtering
CDN caching (consider how pagination/filtering will affect)

di commented 2 years ago

Bit of a related update here: PEP 691 has been accepted.

pypi / warehouse

Determine new API URL structure for warehouse (starting with new JSON API) #284