Pull based model and content negotiation?

jackfirth commented 7 years ago

Right now, for the package agile (chosen arbitrarily), the following URLs are used to access information about the package:

http://pkgs.racket-lang.org/package/agile - S3-backed view-only HTML of agile package info
http://pkgd.racket-lang.org/pkgn/package/agile - Editable HTML of agile package not served by S3
~~http://pkgo.racket-lang.org/pkg/agile~~ http://pkgs.racket-lang.org/pkg/agile - Raw Racket data of agile package not served by S3 and used by raco pkg tools

Additionally, every so often (not sure 100% on the exact timing) the package server publishes the HTML of pkgd in S3 so it is served by pkgs. This is a push based model — S3 acts as an independent copy of the catalog that the Racket server is responsible for updating as needed. I think it would be simpler and easier for all systems involved if the catalog used a pull based model with AWS CloudFront, content negotiation, and appropriate use of caching headers.

In this proposed setup, there would be one URL per package that served both HTML and s-exp data using the Accept and Content-Type headers, and CloudFront would act as a reverse proxy serving cached copies of package data based on the origin server's Cache-Control headers. This would remove the need for three different URLs for the same server, as CloudFront could serve cached GET requests while forwarding POST, PUT, and DELETE requests. CloudFront can serve GETs even if the origin server is down, so we'd still be able to offer a read-only view of the catalog during an outage.

This would also allow tools other than browser to cache package data and interact with the HTML view of packages. A raco pkg user could open an installed package's catalog entry without needing to know anything other than the URL the package was installed from. Scribble could do the same (this came up in discussion of racket/racket#1797) to link package mentions in docs to their catalog entries. Additionally, raco pkg would be able to reuse the same caching headers exposed to AWS to locally cache package data.

tonyg commented 7 years ago

One requirement has been that a static render of the (human- and) machine-readable forms of the package information should be available whether the dynamic part of the site is running or not. If a pull-based model were used, we'd want some way of priming the cache (effectively pushing) and invalidating individual cached items on demand. The cache priming ensures the content is available even if the dynamic server subsequently goes away. The invalidation would be used when a package changed. We'd also want the cache to continue to serve "stale" content if it couldn't reach the upstream. (We may want to alter this requirement, but as it stands the code has been written to conform to it.)

URL structure is another question. The "/pkgn" prefix on the logged-in version of a package page is a mistake that is there for historical reasons. I messed up when we transferred to the new website. And the use of pkgo is again because of constraints during the transition to the new site. There is a pkgs version of the same information, http://pkgs.racket-lang.org/pkg/agile, and we would be using that but for #40. (Jack, is pkgo really what the raco pkg tools use?)

I don't know how we'd make it so that a single URL worked for both static-render and logged-in render if a cache was involved. It's been a while since I had to think about such issues. Jack, how would this work?

I'm neutral on conneg. It's cool in principle but sounds hairy in practice. Lots of HTTP clients out there are too basic to handle conneg well. Perhaps it'd be OK in this case since we have control over the really important HTTP clients that'd be involved.

Actually, I think there's an argument to be made that there are multiple resources here, not just multiple representations of one resource. The human-facing page could (and probably should) be viewed as logically distinct from the machine-facing information, and the editable view as distinct from the read-only view. I think I tentatively believe this argument. Consider a browser user wishing to see the machine-readable form of the information. How does she achieve this in a conneg, single-URL setup? (Relevant: https://wiki.whatwg.org/wiki/Why_not_conneg#Negotiating_by_format)

jackfirth commented 7 years ago

Oops, I hadn't noticed the http://pkgs.racket-lang.org/pkg/agile URL and assumed it was only available from pkgo. I'm sure the pkgs URL is what the raco pkg tools use.

Warning: this is long

TL;DR: I think CloudFront (and some other CDNs, and many caching reverse-proxy implementations we could run ourselves) can address the problems you mention (including auth), and I think content negotiation plus a small hack would bring benefits worth the tradeoffs.

For caching, I'd like to clarify my understanding of the package catalog's use cases. As far as I know, the primary clients of the catalog and tools integrating with it are:

Browsers, used by people who want to discover new packages and find links to code, docs, and build statuses. Browser users may be package authors or power users who want to see the raw Racket data representation of package data for any number of reasons. Heavy use of Javascript would not be preferable, ideally NoScript users should still be able to publish packages in the browser.
The raco pkg commands, which want to read catalog information to install packages (but not publish or edit them, currently)
Local Scribble docs, which want to link to the docs and catalog entries of packages not currently installed
Possibly the centrally hosted Scribble docs at docs.racket-lang.org, which may want to search the documentation of all packages in the catalog and use package metadata like description, authors, and tags.

Furthermore, the catalog's caching goals are:

Avoiding stale package information as much as possible when a package is updated.
Avoiding a stale "list of recently updated packages" when a new package is published or updated.
Allowing stale content to be served from caches in the event of an outage of the package server, even a long-term one. Here, "long-term" means "multiple days, at most".
Ensuring cached package pages still properly display editable forms only for logged in users who are authors.

The following are explicitly not goals of the catalog:

Scaling to multiple orders of magnitude more viewers of the catalog. My understanding is that the package server does not get enough traffic for there to be much risk of even a single instance of the server failing due to overload. If we wanted to, we could even run our own caching reverse proxy.
Robustly handling traffic spikes. It would be nice, but traffic spikes to the catalog are highly unexpected and the desire to avoid stale package information in the common case supersedes this.
Serving private packages. We don't need to worry about only allowing authorized users to view cached representations of certain packages.
Supporting package authors using mobile devices. I assume almost all editing and publishing of packages occurs on a desktop or laptop.

So the primary concern is providing redundancy, not scale, and saving bandwidth is more important than lowering latency. The rest of my thoughts on caching for the package catalog are based on the above assumptions, so please correct me if they're wrong. The current setup achieves a lot of this, but notably raco pkg does not benefit from caching causing CI builds to fail if the package server is down, and there is no documented protocol for how to navigate from a package catalog's entry for its Racket representation to its HTML representation.

Proposed catalog API

In this proposed setup, the catalog hosts only a few resources:

A front page (but not a searchable list of packages) as the root resource /. This resource is cacheable and only allows GET or HEAD requests and only provides text/html content.
A /login resource that can be POST-ed to with credentials and returns an empty success response (204 No Content) that sets an auth cookie. This is not cacheable. Sending credentials could be done by Javascript with the Authorization header (supporting only the Basic auth scheme for now) or by a body with the application/x-www-form-urlencoded type to allow NoScript users to login with a normal HTML form.
A /pkgs resource that provides a searchable list of links to all packages in the catalog. The complete list is cacheable and supports GET and HEAD requests. Supported formats are text/html and some Racket-specific content type like application/vnd.racket.catalog. POST requests of type application/vnd.racket.catalog, application/x-www-form-urlencoded, and possibly application/json allow creating new packages and return 201 Created responses with a Location header identifying the URL of the new package. Searching can be done client side with Javascript. If the list of packages gets too large, we can move search server-side into a group of /pkgs?search={terms} resources and use pagination or range requests to limit the number of packages returned.
A group of /pkg/{name} resources that provide text/html and application/vnd.racket.catalog representations of a package that are cacheable. PUT requests edit the package and DELETE requests delete it. Markup for editing and deleting the package can be hidden by default and revealed by Javascript that looks for the presence of an auth cookie. Packages should probably also allow application/json when viewing or editing the package to make browser integration easier. NoScript is tricky here and would require some tweaks.
Optionally, we could add /pkg/{name}.rkt resources that only support Racket data and only allows GET, to allow browser users (browsers always send a wildcard */* option in Accept) to view Racket data without messing with headers. This is a bit of a hack to make browser users' lives easier, but it doesn't compromise caching of the /pkg/{name} resources. These resources should indicate that they are not cacheable under any circumstances, because invalidating a /pkg/agile resource with a PUT request will not instruct caches to invalidate the corresponding /pkg/agile.rkt resource. There is a proposed extension to HTTP caching that would help here if we really wanted to make these resources cacheable.

By default, /pkg/{name} resources return Racket data to support raco pkg which does not send an Accept header. All modern browsers send an Accept header indicating that they prefer text/html, so this doesn't mess with browsers. We could also switch the URL form to /pkgs/{name} where we strictly enforce content negotiation, and let /pkg/{name} serve as a deprecated fallback for older clients.

Other resources like Javascript, CSS, and per-user pages would also be provided, but I won't focus on those at the moment.

Caching and CDNs

Each individual package resource sets a Cache-Control header with a low maxage, to reduce how stale content can be. The front page and search page could also do this, although they might benefit from different settings.
The Cache-Control header includes a stale-if-error extension with a value of several days. This is a standardized caching extension that instructs caches they may serve stale content for the specified duration if the origin server is unreachable for any reason. Unfortunately, CloudFront doesn't support this extension but it does let you manually configure equivalent behavior in its administration settings. With this extension we could also allow raco pkg to cache catalog data locally and CI users could persist that cache between runs. There's also a similar stale-while-revalidate extension that raco pkg could use if we wanted it to.
PUT and DELETE requests are mandated by the HTTP 1.1 spec to invalidate caches for the resource they act on. It's unclear to me whether CloudFront respects this part of the spec, but many browsers do. CloudFront offers a custom protocol to forcibly invalidate cache entries, so if we really wanted to we could write code that invalidates CloudFront whenever a package is edited. I'd rather file a bug with CloudFront or use a different CDN though.
Requests with Cache-Control: no-cache cause caches to revalidate. CloudFront respects Cache-Control: no-cache and most browsers offer a way to perform a hard refresh that sends requests with Cache-Control: no-cache.
Optionally, the package catalog server can compute a cheap hash of representation content and send that as an ETag header. This allows caches (including CloudFront) to send conditional requests that essentially say "GET me this package unless it hasn't changed since I last looked at it; here's the ETag from when I looked". This can save bandwidth costs especially for very large resources that change far less often than they're accessed, such as scripts or stylesheets. Different formats should have the same ETag if they represent the same underlying data.
Cacheable responses that are available in multiple formats should include a Vary: Accept header to indicate that a cached text/html response cannot be used for a request with Accept: application/vnd.racket.catalog. Unfortunately this means some cached responses can't be shared by non-Javascript requests from different browsers since they use different default values for the Accept header. Mozilla has a list of default Accept values of different browsers. A proposed header named Key would be a possible way to only cache each format once in spite of many different Accept header values, but the standard isn't finalized and I'm not sure what cache implementations support it.

Other benefits

It becomes much easier to support backwards-incompatible changes to the package catalog protocol. We can create a new content type (or annotate our type with a profile) and use negotiation to drive API versioning.
We don't need multiple domains or careful distinction between the static and dynamic parts of the site. All resources are dynamic, they're just cached and available from cache when the server is offline.
Creating rich browser clients and allowing non-Racket code to interact with the catalog gets much easier, as we can easily provide JSON formats of package data.
We're far less locked-in to a specific cloud platform or technology. All of the above protocols are standardized, and we could run an instance of the Varnish or Squid cache implementations ourselves without breaking backwards compatibility if we wanted to.
We can tweak cache settings without updating clients or intermediary caches. We just need to change the Cache-Control values, instead of messing with some AWS console somewhere.
raco pkg tools can get the same caching benefits as browsers. Presuming there are no hiccups with Vary: Accept, raco pkg install will still be able to function even if the package catalog is down without any code changes to raco pkg. Future changes to raco pkg can transparently cache catalog metadata using the same cache that raco pkg already uses for downloaded package sources. Travis CI users who cache the directory used by raco pkg would automatically be able to eliminate the multiple round trips made by raco pkg to the catalog and possibly eliminate all calls made by raco pkg in builds where the package's dependencies are unchanged.
We now have a protocol in place to let people publish and edit packages with a raco command, and supporting NoScript users is possible too.
One URL for a package. Racket tools can now inspect the name-to-URL mapping maintained locally and interact with the catalog, without making any assumptions about URL structure. Scribble for instance could automatically link modules to their containing package's catalog entry.
People who want to build tools on top of the catalog or host their own catalogs now have a protocol for publishing and editing, not just reading. The package server could even be wrapped in a Docker image or some other easy-to-deploy binary format that can be configured to point at a database with catalog information.
Very lightweight Javscript requirements. Almost everything mentioned above can be done without Javascript (exceptions are searching and editing). The parts that do require Javascript would likely only need a small amount of JQuery instead of some framework like Angular or React.

About multiple resources and conneg

The main reason I'm in favor of content negotiation is it is well standardized, gives us forward compatibility, and plays nicely with caching. The HTML and Racket data views of a package are not separate resources, because they logically refer to the same underlying piece of mutable state. This is why content negotiation works nicely with caching: it makes it easy for caches to automatically invalidate all representations of the resource that has been edited.

Editable v.s. read-only views have some issues when presented as separate resources, because whether or not a user can edit or view a resource is a dynamic property of many different factors that a user cannot always be aware of. For instance, a user might GET an editable view only to discover by the time they send a request to edit the resource, they no longer have access. Or a user might have access, but be either unwilling or unable to provide credentials in the form requested by the server, such as a server that only supports cookie-based auth and a client that does not implement a cookie store. Or clients may completely ignore any forms or editing controls and construct PUT requests with a machine-readable format, since the semantics of PUT claim it always represents an update to a resource. To handle such cases, it's reasonable for a server to always provide editing controls in human-readable representations and rely on metadata on the controls or scripting to only display them when a user can actually use them.

Things this doesn't handle well

This protocol does not address the package catalog proactively putting package data in CloudFront before it's requested, and it does not ensure that every package has a machine-readable representation in CloudFront's caches to prevent some subset of rarely-requested packages from being unavailable in the event of a catalog outage. I'm not sure if CloudFront has a way to function as a "backup", but it wouldn't be hard to configure Varnish or Squid to do this.

Caching the list of all packages doesn't scale well as the frequency of new packages being published increases. This could be addressed with cache channels, which Squid supports. We already have some of the infrastructure needed to implement cache channels via the Atom feed.

NoScript editing of a package is tricky because HTML does not support sending PUT or DELETE requests via forms or buttons. This could be done by tunneling the form submission over a POST request, but we'd lose the automatic cache invalidation that PUT and DELETE provide.

Cookie-based auth requires the catalog server maintain session state, and raco pkg probably shouldn't implement support for cookies. The Authorization header is a more standard and open (in the sense that intermediaries can understand it) way of specifying auth, but NoScript users can only use very basic auth protocols such as Basic or Digest which have their own problems.

racket / racket-pkg-website