Possible race condition with asset serving during deployments

robertknight commented 3 years ago

When a new deployment happens we do a rolling update with an additional batch. I need to confirm but if it is possible that during a deployment a page load can be serviced by both the old and new versions of the app then we could have a situation where an old asset gets cached under a new URL. This might explain an issue seen by a user where we deployed a bug fix but the user still didn't see the fix some hours later.

The problem scenario is:

LMS page request hits instance running new version of app. This serves up a response with new asset URLs (ie. that have the latest cache-busting query string at the end)
Browser requests asset with new URL. Cloudflare connects to LMS app and hits instance running old version of app. Since the app ignores the query string when serving the asset, it might still serve the request even if the query string is wrong.
Cloudflare edge location caches the old version of the asset under the new URL

To re-iterate, the main condition we need to check here is whether the current deployment process makes it possible that during a single page load requests may go to both the old and new versions of the app during the deployment window.

If this problem does exist then it could affect other apps too. Usage is way up over past years, so it is possible that the issue has long existed but simply been rare.

robertknight commented 3 years ago

A change we might want to make regardless of whether our current deployment setup has this problem is to actually check the query string in asset requests and serve up an error.

robertknight commented 3 years ago

I've started some initial research on this. Some notes on options for avoiding mismatches between the version of the frontend application and the version of the backend:

Our Elastic Beanstalk environments are configured to use an AWS Application Load Balancer to route requests to instances. ALB has support for sticky sessions, which is one way that we could ensure that requests from the frontend to backend after the initial page load always hit the same version of the application: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/sticky-sessions.html. Sticky sessions rely on cookies and adds some complexity to reasoning about how requests are handled.

A less general approach would be to make all versions of assets accessible to all versions of the application, by versioning assets and deploying them separately from the application, eg. to an S3 bucket fronted by a CDN. This would solve the issue here. However it wouldn't solve the problem where client application version X+1 sends an API request to the backend which reaches application version X that doesn't have the backend code for application version X or has an incompatible version. Such API requests are fortunately not cached by Cloudflare, which is the biggest issue here. As a result we might be able to mange the issue with smart deployment practices (eg. not deploying backwards-incompatible API changes in the same release as the frontend change), similar to how we handle DB schema migrations. For API requests we can control the request headers easily, so if there is a way to control which app version handles a request based on a header, we could perhaps use that.

robertknight commented 3 years ago

Not sure if useful yet, but it is also now possible to do header-based traffic routing: https://aws.amazon.com/blogs/aws/new-advanced-request-routing-for-aws-application-load-balancers/

robertknight commented 3 years ago

The asset serving code in h and lms has been extracted into an h-assets package and a fix for this issue has been implemented there to check the cache-busting query string in the asset request URL against the expected value. If they do not match the request returns a 404 which should either not be cached downstream or be cached for a much shorter period of time.

See https://github.com/hypothesis/lms/pull/2919 and https://github.com/hypothesis/h/pull/6752.

These PRs should prevent the problem the user experienced, although it still leaves the possibility of a failing subresource request for a brief period during a deployment. The best way to fix that will probably be to deploy assets to a CDN and serve from there. That could be implemented in future with an extension to h-assets (eg. some kind of switch in the asset environment that causes it to return CDN URLs instead of local URLs).

robertknight commented 3 years ago

The h and lms apps have now been converted to use the h-assets package for asset serving. This will now check the cache-busting query string, if present, in asset URL requests and return a 404 if it is incorrect. Requests without a cache-buster will currently always succeed. This leniency was needed as there are some URLs (eg. sourcemaps and fonts) which are currently missing the cache-buster.

To summarize the comments above, this will prevent a problem where Cloudflare could end up caching old versions of assets under the URLs for new versions for an extended period after a deployment. There is still the possibility that a request at just the wrong time during a deployment could fail to load some assets on a page if the HTML request hits a new version of the app and the asset request then gets handled by an old version of the app. If this happens reloading the page should fix the issue.

We could choose to close this issue for now on the basis that the main problem, the one which can last for some time after a deployment, is fixed. There is still an issue that can happen for a brief time during a deployment however.

robertknight commented 3 years ago

There is still the possibility that a request at just the wrong time during a deployment could fail to load some assets on a page if the HTML request hits a new version of the app and the asset request then gets handled by an old version of the app. If this happens reloading the page should fix the issue.

@lyzadanger encountered an issue after this change went live where a 404 response to an asset request seemingly got cached in her browser for an extended period of time. See https://hypothes-is.slack.com/archives/C1M8NH76X/p1628253730053300.

Making an asset request with an invalid cache buster locally I see that a cache-control: max-age=14400 header is set, which translates to 4 hours.

[I]  ~/h/client > curl -i 'https://lms.hypothes.is/assets/styles/frontend_apps.css?zzz'
HTTP/2 404
date: Wed, 18 Aug 2021 10:27:24 GMT
content-type: text/html; charset=UTF-8
cache-control: max-age=14400
cf-cache-status: MISS
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
strict-transport-security: max-age=15552000; includeSubDomains; preload
x-content-type-options: nosniff
server: cloudflare
cf-ray: 680a730d196b424a-LHR
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

If I re-test the request after a few minutes, I see that the Cloudflare cache has expired, but a browser respecting the cache-control header would continue to serve the stale asset for much longer:

[I]  ~/h/client > curl -i 'https://lms.hypothes.is/assets/styles/frontend_apps.css?zzz'
HTTP/2 404
date: Wed, 18 Aug 2021 10:32:36 GMT
content-type: text/html; charset=UTF-8
cache-control: max-age=14400
cf-cache-status: EXPIRED
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
strict-transport-security: max-age=15552000; includeSubDomains; preload
x-content-type-options: nosniff
server: cloudflare
cf-ray: 680a7aaa8b5106d5-LHR
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

Update: After some more testing it seems that Cloudflare is caching 404-ing asset requests for exactly 3 minutes, which matches some forum comments I found.

robertknight commented 3 years ago

The upstream server is not setting any caching headers on the response, so the cache-control is definitely coming from Cloudflare:

[I]  ~/h/client > curl -isk 'https://lms-prod.4mngmq4cs2.us-west-1.elasticbeanstalk.com/assets/styles/frontend_apps.css?zzz'
HTTP/2 404
date: Wed, 18 Aug 2021 10:42:21 GMT
content-type: text/html; charset=UTF-8
content-length: 25073
server: nginx/1.20.0

The 4 hour time is configured in Cloudflare's Browser Cache TTL setting (see https://developers.cloudflare.com/cache/about/edge-browser-cache-ttl#browser-cache-ttl). Even if the origin sets a shorter time, that is the value that gets used. It seems like this applies regardless of response status code.

The behavior we want is for asset requests that include a cache buster and return a 2xx response to be cached for a long period of time, since these are effectively immutable. If the request does not include a cache buster or returns a 4xx response then the caching should only apply for a much shorter period of time. Cloudflare's docs say that longer cache durations from the origin server are respected, but anything shorter than the Browser Cache TTL setting gets overridden.

robertknight commented 3 years ago

For the h and lms apps that constitute most of our traffic, we should be able to set suitable browser-level caching in h-assets. I'm not sure whether there are other applications that benefit a lot from Cloudflare's Browser Cache TTL settings. It might make sense as a first step to:

Change h-assets to add caching headers for requests with valid cache busters, so that we're not reliant on downstream to do this
Set up Page Rules in Cloudflare to set the Browser Cache TTL setting to respect upstream headers for h and lms.

hypothesis / lms

Possible race condition with asset serving during deployments #2799