racket / racket-pkg-website

A frontend for the Racket Package Catalog.
Other
11 stars 16 forks source link

Pull based model and content negotiation? #59

Open jackfirth opened 7 years ago

jackfirth commented 7 years ago

Right now, for the package agile (chosen arbitrarily), the following URLs are used to access information about the package:

Additionally, every so often (not sure 100% on the exact timing) the package server publishes the HTML of pkgd in S3 so it is served by pkgs. This is a push based model — S3 acts as an independent copy of the catalog that the Racket server is responsible for updating as needed. I think it would be simpler and easier for all systems involved if the catalog used a pull based model with AWS CloudFront, content negotiation, and appropriate use of caching headers.

In this proposed setup, there would be one URL per package that served both HTML and s-exp data using the Accept and Content-Type headers, and CloudFront would act as a reverse proxy serving cached copies of package data based on the origin server's Cache-Control headers. This would remove the need for three different URLs for the same server, as CloudFront could serve cached GET requests while forwarding POST, PUT, and DELETE requests. CloudFront can serve GETs even if the origin server is down, so we'd still be able to offer a read-only view of the catalog during an outage.

This would also allow tools other than browser to cache package data and interact with the HTML view of packages. A raco pkg user could open an installed package's catalog entry without needing to know anything other than the URL the package was installed from. Scribble could do the same (this came up in discussion of racket/racket#1797) to link package mentions in docs to their catalog entries. Additionally, raco pkg would be able to reuse the same caching headers exposed to AWS to locally cache package data.

tonyg commented 7 years ago

One requirement has been that a static render of the (human- and) machine-readable forms of the package information should be available whether the dynamic part of the site is running or not. If a pull-based model were used, we'd want some way of priming the cache (effectively pushing) and invalidating individual cached items on demand. The cache priming ensures the content is available even if the dynamic server subsequently goes away. The invalidation would be used when a package changed. We'd also want the cache to continue to serve "stale" content if it couldn't reach the upstream. (We may want to alter this requirement, but as it stands the code has been written to conform to it.)

URL structure is another question. The "/pkgn" prefix on the logged-in version of a package page is a mistake that is there for historical reasons. I messed up when we transferred to the new website. And the use of pkgo is again because of constraints during the transition to the new site. There is a pkgs version of the same information, http://pkgs.racket-lang.org/pkg/agile, and we would be using that but for #40. (Jack, is pkgo really what the raco pkg tools use?)

I don't know how we'd make it so that a single URL worked for both static-render and logged-in render if a cache was involved. It's been a while since I had to think about such issues. Jack, how would this work?

I'm neutral on conneg. It's cool in principle but sounds hairy in practice. Lots of HTTP clients out there are too basic to handle conneg well. Perhaps it'd be OK in this case since we have control over the really important HTTP clients that'd be involved.

Actually, I think there's an argument to be made that there are multiple resources here, not just multiple representations of one resource. The human-facing page could (and probably should) be viewed as logically distinct from the machine-facing information, and the editable view as distinct from the read-only view. I think I tentatively believe this argument. Consider a browser user wishing to see the machine-readable form of the information. How does she achieve this in a conneg, single-URL setup? (Relevant: https://wiki.whatwg.org/wiki/Why_not_conneg#Negotiating_by_format)

jackfirth commented 7 years ago

Oops, I hadn't noticed the http://pkgs.racket-lang.org/pkg/agile URL and assumed it was only available from pkgo. I'm sure the pkgs URL is what the raco pkg tools use.

Warning: this is long

TL;DR: I think CloudFront (and some other CDNs, and many caching reverse-proxy implementations we could run ourselves) can address the problems you mention (including auth), and I think content negotiation plus a small hack would bring benefits worth the tradeoffs.

For caching, I'd like to clarify my understanding of the package catalog's use cases. As far as I know, the primary clients of the catalog and tools integrating with it are:

Furthermore, the catalog's caching goals are:

The following are explicitly not goals of the catalog:

So the primary concern is providing redundancy, not scale, and saving bandwidth is more important than lowering latency. The rest of my thoughts on caching for the package catalog are based on the above assumptions, so please correct me if they're wrong. The current setup achieves a lot of this, but notably raco pkg does not benefit from caching causing CI builds to fail if the package server is down, and there is no documented protocol for how to navigate from a package catalog's entry for its Racket representation to its HTML representation.

Proposed catalog API

In this proposed setup, the catalog hosts only a few resources:

By default, /pkg/{name} resources return Racket data to support raco pkg which does not send an Accept header. All modern browsers send an Accept header indicating that they prefer text/html, so this doesn't mess with browsers. We could also switch the URL form to /pkgs/{name} where we strictly enforce content negotiation, and let /pkg/{name} serve as a deprecated fallback for older clients.

Other resources like Javascript, CSS, and per-user pages would also be provided, but I won't focus on those at the moment.

Caching and CDNs

Other benefits

About multiple resources and conneg

The main reason I'm in favor of content negotiation is it is well standardized, gives us forward compatibility, and plays nicely with caching. The HTML and Racket data views of a package are not separate resources, because they logically refer to the same underlying piece of mutable state. This is why content negotiation works nicely with caching: it makes it easy for caches to automatically invalidate all representations of the resource that has been edited.

Editable v.s. read-only views have some issues when presented as separate resources, because whether or not a user can edit or view a resource is a dynamic property of many different factors that a user cannot always be aware of. For instance, a user might GET an editable view only to discover by the time they send a request to edit the resource, they no longer have access. Or a user might have access, but be either unwilling or unable to provide credentials in the form requested by the server, such as a server that only supports cookie-based auth and a client that does not implement a cookie store. Or clients may completely ignore any forms or editing controls and construct PUT requests with a machine-readable format, since the semantics of PUT claim it always represents an update to a resource. To handle such cases, it's reasonable for a server to always provide editing controls in human-readable representations and rely on metadata on the controls or scripting to only display them when a user can actually use them.

Things this doesn't handle well

This protocol does not address the package catalog proactively putting package data in CloudFront before it's requested, and it does not ensure that every package has a machine-readable representation in CloudFront's caches to prevent some subset of rarely-requested packages from being unavailable in the event of a catalog outage. I'm not sure if CloudFront has a way to function as a "backup", but it wouldn't be hard to configure Varnish or Squid to do this.

Caching the list of all packages doesn't scale well as the frequency of new packages being published increases. This could be addressed with cache channels, which Squid supports. We already have some of the infrastructure needed to implement cache channels via the Atom feed.

NoScript editing of a package is tricky because HTML does not support sending PUT or DELETE requests via forms or buttons. This could be done by tunneling the form submission over a POST request, but we'd lose the automatic cache invalidation that PUT and DELETE provide.

Cookie-based auth requires the catalog server maintain session state, and raco pkg probably shouldn't implement support for cookies. The Authorization header is a more standard and open (in the sense that intermediaries can understand it) way of specifying auth, but NoScript users can only use very basic auth protocols such as Basic or Digest which have their own problems.