Open jackfirth opened 7 years ago
One requirement has been that a static render of the (human- and) machine-readable forms of the package information should be available whether the dynamic part of the site is running or not. If a pull-based model were used, we'd want some way of priming the cache (effectively pushing) and invalidating individual cached items on demand. The cache priming ensures the content is available even if the dynamic server subsequently goes away. The invalidation would be used when a package changed. We'd also want the cache to continue to serve "stale" content if it couldn't reach the upstream. (We may want to alter this requirement, but as it stands the code has been written to conform to it.)
URL structure is another question. The "/pkgn" prefix on the logged-in version of a package page is a mistake that is there for historical reasons. I messed up when we transferred to the new website. And the use of pkgo
is again because of constraints during the transition to the new site. There is a pkgs version of the same information, http://pkgs.racket-lang.org/pkg/agile
, and we would be using that but for #40. (Jack, is pkgo
really what the raco pkg
tools use?)
I don't know how we'd make it so that a single URL worked for both static-render and logged-in render if a cache was involved. It's been a while since I had to think about such issues. Jack, how would this work?
I'm neutral on conneg. It's cool in principle but sounds hairy in practice. Lots of HTTP clients out there are too basic to handle conneg well. Perhaps it'd be OK in this case since we have control over the really important HTTP clients that'd be involved.
Actually, I think there's an argument to be made that there are multiple resources here, not just multiple representations of one resource. The human-facing page could (and probably should) be viewed as logically distinct from the machine-facing information, and the editable view as distinct from the read-only view. I think I tentatively believe this argument. Consider a browser user wishing to see the machine-readable form of the information. How does she achieve this in a conneg, single-URL setup? (Relevant: https://wiki.whatwg.org/wiki/Why_not_conneg#Negotiating_by_format)
Oops, I hadn't noticed the http://pkgs.racket-lang.org/pkg/agile
URL and assumed it was only available from pkgo
. I'm sure the pkgs
URL is what the raco pkg
tools use.
TL;DR: I think CloudFront (and some other CDNs, and many caching reverse-proxy implementations we could run ourselves) can address the problems you mention (including auth), and I think content negotiation plus a small hack would bring benefits worth the tradeoffs.
For caching, I'd like to clarify my understanding of the package catalog's use cases. As far as I know, the primary clients of the catalog and tools integrating with it are:
raco pkg
commands, which want to read catalog information to install packages (but not publish or edit them, currently)docs.racket-lang.org
, which may want to search the documentation of all packages in the catalog and use package metadata like description, authors, and tags.Furthermore, the catalog's caching goals are:
The following are explicitly not goals of the catalog:
So the primary concern is providing redundancy, not scale, and saving bandwidth is more important than lowering latency. The rest of my thoughts on caching for the package catalog are based on the above assumptions, so please correct me if they're wrong. The current setup achieves a lot of this, but notably raco pkg
does not benefit from caching causing CI builds to fail if the package server is down, and there is no documented protocol for how to navigate from a package catalog's entry for its Racket representation to its HTML representation.
In this proposed setup, the catalog hosts only a few resources:
/
. This resource is cacheable and only allows GET
or HEAD
requests and only provides text/html
content./login
resource that can be POST
-ed to with credentials and returns an empty success response (204 No Content) that sets an auth cookie. This is not cacheable. Sending credentials could be done by Javascript with the Authorization header (supporting only the Basic auth scheme for now) or by a body with the application/x-www-form-urlencoded
type to allow NoScript users to login with a normal HTML form./pkgs
resource that provides a searchable list of links to all packages in the catalog. The complete list is cacheable and supports GET
and HEAD
requests. Supported formats are text/html
and some Racket-specific content type like application/vnd.racket.catalog
. POST
requests of type application/vnd.racket.catalog
, application/x-www-form-urlencoded
, and possibly application/json
allow creating new packages and return 201 Created
responses with a Location
header identifying the URL of the new package. Searching can be done client side with Javascript. If the list of packages gets too large, we can move search server-side into a group of /pkgs?search={terms}
resources and use pagination or range requests to limit the number of packages returned./pkg/{name}
resources that provide text/html
and application/vnd.racket.catalog
representations of a package that are cacheable. PUT
requests edit the package and DELETE
requests delete it. Markup for editing and deleting the package can be hidden by default and revealed by Javascript that looks for the presence of an auth cookie. Packages should probably also allow application/json
when viewing or editing the package to make browser integration easier. NoScript is tricky here and would require some tweaks./pkg/{name}.rkt
resources that only support Racket data and only allows GET
, to allow browser users (browsers always send a wildcard */*
option in Accept
) to view Racket data without messing with headers. This is a bit of a hack to make browser users' lives easier, but it doesn't compromise caching of the /pkg/{name}
resources. These resources should indicate that they are not cacheable under any circumstances, because invalidating a /pkg/agile
resource with a PUT
request will not instruct caches to invalidate the corresponding /pkg/agile.rkt
resource. There is a proposed extension to HTTP caching that would help here if we really wanted to make these resources cacheable.By default, /pkg/{name}
resources return Racket data to support raco pkg
which does not send an Accept
header. All modern browsers send an Accept
header indicating that they prefer text/html
, so this doesn't mess with browsers. We could also switch the URL form to /pkgs/{name}
where we strictly enforce content negotiation, and let /pkg/{name}
serve as a deprecated fallback for older clients.
Other resources like Javascript, CSS, and per-user pages would also be provided, but I won't focus on those at the moment.
Cache-Control
header with a low maxage
, to reduce how stale content can be. The front page and search page could also do this, although they might benefit from different settings.Cache-Control
header includes a stale-if-error
extension with a value of several days. This is a standardized caching extension that instructs caches they may serve stale content for the specified duration if the origin server is unreachable for any reason. Unfortunately, CloudFront doesn't support this extension but it does let you manually configure equivalent behavior in its administration settings. With this extension we could also allow raco pkg
to cache catalog data locally and CI users could persist that cache between runs. There's also a similar stale-while-revalidate
extension that raco pkg
could use if we wanted it to.PUT
and DELETE
requests are mandated by the HTTP 1.1 spec to invalidate caches for the resource they act on. It's unclear to me whether CloudFront respects this part of the spec, but many browsers do. CloudFront offers a custom protocol to forcibly invalidate cache entries, so if we really wanted to we could write code that invalidates CloudFront whenever a package is edited. I'd rather file a bug with CloudFront or use a different CDN though.Cache-Control: no-cache
cause caches to revalidate. CloudFront respects Cache-Control: no-cache
and most browsers offer a way to perform a hard refresh that sends requests with Cache-Control: no-cache
.ETag
header. This allows caches (including CloudFront) to send conditional requests that essentially say "GET me this package unless it hasn't changed since I last looked at it; here's the ETag from when I looked". This can save bandwidth costs especially for very large resources that change far less often than they're accessed, such as scripts or stylesheets. Different formats should have the same ETag if they represent the same underlying data.Vary: Accept
header to indicate that a cached text/html
response cannot be used for a request with Accept: application/vnd.racket.catalog
. Unfortunately this means some cached responses can't be shared by non-Javascript requests from different browsers since they use different default values for the Accept
header. Mozilla has a list of default Accept
values of different browsers. A proposed header named Key
would be a possible way to only cache each format once in spite of many different Accept
header values, but the standard isn't finalized and I'm not sure what cache implementations support it.Cache-Control
values, instead of messing with some AWS console somewhere.raco pkg
tools can get the same caching benefits as browsers. Presuming there are no hiccups with Vary: Accept
, raco pkg install
will still be able to function even if the package catalog is down without any code changes to raco pkg
. Future changes to raco pkg
can transparently cache catalog metadata using the same cache that raco pkg
already uses for downloaded package sources. Travis CI users who cache the directory used by raco pkg
would automatically be able to eliminate the multiple round trips made by raco pkg
to the catalog and possibly eliminate all calls made by raco pkg
in builds where the package's dependencies are unchanged.raco
command, and supporting NoScript users is possible too.The main reason I'm in favor of content negotiation is it is well standardized, gives us forward compatibility, and plays nicely with caching. The HTML and Racket data views of a package are not separate resources, because they logically refer to the same underlying piece of mutable state. This is why content negotiation works nicely with caching: it makes it easy for caches to automatically invalidate all representations of the resource that has been edited.
Editable v.s. read-only views have some issues when presented as separate resources, because whether or not a user can edit or view a resource is a dynamic property of many different factors that a user cannot always be aware of. For instance, a user might GET
an editable view only to discover by the time they send a request to edit the resource, they no longer have access. Or a user might have access, but be either unwilling or unable to provide credentials in the form requested by the server, such as a server that only supports cookie-based auth and a client that does not implement a cookie store. Or clients may completely ignore any forms or editing controls and construct PUT
requests with a machine-readable format, since the semantics of PUT
claim it always represents an update to a resource. To handle such cases, it's reasonable for a server to always provide editing controls in human-readable representations and rely on metadata on the controls or scripting to only display them when a user can actually use them.
This protocol does not address the package catalog proactively putting package data in CloudFront before it's requested, and it does not ensure that every package has a machine-readable representation in CloudFront's caches to prevent some subset of rarely-requested packages from being unavailable in the event of a catalog outage. I'm not sure if CloudFront has a way to function as a "backup", but it wouldn't be hard to configure Varnish or Squid to do this.
Caching the list of all packages doesn't scale well as the frequency of new packages being published increases. This could be addressed with cache channels, which Squid supports. We already have some of the infrastructure needed to implement cache channels via the Atom feed.
NoScript editing of a package is tricky because HTML does not support sending PUT or DELETE requests via forms or buttons. This could be done by tunneling the form submission over a POST request, but we'd lose the automatic cache invalidation that PUT and DELETE provide.
Cookie-based auth requires the catalog server maintain session state, and raco pkg
probably shouldn't implement support for cookies. The Authorization
header is a more standard and open (in the sense that intermediaries can understand it) way of specifying auth, but NoScript users can only use very basic auth protocols such as Basic
or Digest
which have their own problems.
Right now, for the package
agile
(chosen arbitrarily), the following URLs are used to access information about the package:http://pkgs.racket-lang.org/package/agile
- S3-backed view-only HTML ofagile
package infohttp://pkgd.racket-lang.org/pkgn/package/agile
- Editable HTML ofagile
package not served by S3http://pkgo.racket-lang.org/pkg/agile
http://pkgs.racket-lang.org/pkg/agile
- Raw Racket data ofagile
package not served by S3 and used byraco pkg
toolsAdditionally, every so often (not sure 100% on the exact timing) the package server publishes the HTML of
pkgd
in S3 so it is served bypkgs
. This is a push based model — S3 acts as an independent copy of the catalog that the Racket server is responsible for updating as needed. I think it would be simpler and easier for all systems involved if the catalog used a pull based model with AWS CloudFront, content negotiation, and appropriate use of caching headers.In this proposed setup, there would be one URL per package that served both HTML and s-exp data using the
Accept
andContent-Type
headers, and CloudFront would act as a reverse proxy serving cached copies of package data based on the origin server'sCache-Control
headers. This would remove the need for three different URLs for the same server, as CloudFront could serve cached GET requests while forwarding POST, PUT, and DELETE requests. CloudFront can serve GETs even if the origin server is down, so we'd still be able to offer a read-only view of the catalog during an outage.This would also allow tools other than browser to cache package data and interact with the HTML view of packages. A
raco pkg
user could open an installed package's catalog entry without needing to know anything other than the URL the package was installed from. Scribble could do the same (this came up in discussion of racket/racket#1797) to link package mentions in docs to their catalog entries. Additionally,raco pkg
would be able to reuse the same caching headers exposed to AWS to locally cache package data.