nodejs / build

Better build and test infra for Node.
502 stars 165 forks source link

Caching issues breaking fetching binaries #3224

Open domdomegg opened 1 year ago

domdomegg commented 1 year ago

When a new node version is released, the list of versions at https://nodejs.org/dist/index.json is updated and the binaries are uploaded at https://nodejs.org/dist/vVERSION (e.g. https://nodejs.org/dist/v19.8.1/).

However, the website sits behind a Cloudflare cache and sometimes the data seems out of sync: the API responds saying there is a version, but the binaries are not available.

This then causes tools that try to download the latest version using this information to fail, e.g. https://github.com/actions/setup-node/issues/717

I suspect there might be a few things contributing to this:

  1. We should set Cache-Control: no-cache headers on 404 pages. Otherwise, caching proxies in the middle may fallback to default values, resulting in stale cache data. I think this might be what's happening inside the GitHub Actions environment, but I'm not sure.
  2. We should ensure things are being purged correctly: having tried a few links from my own computer with no particular proxy cache inbetween me and Cloudflare, it seems like a few URLs (e.g. https://nodejs.org/dist/v19.8.1/win-x64/node.exe) had taken at least a few hours to update after the binaries should have been there.
  3. (less important, but would be nice) We should ensure the CDN is able to serve the binaries before we update the index.json. Provided the other things are taken care of, this will prevent the issue for the window where Cloudflare might have invalidated the index.json but not the binary URLs.
rvagg commented 1 year ago

index.* and the release files are "promoted" at the same time and a cache flush is triggered afterward.

I think what you're seeing is a flapping load balancer; which continues to happen and really needs to be diagnosed and fixed (I've been suspecting our main server has i/o issues, triggered by rsync, perhaps other things, that gets overloaded and can't respond in time to reply to health checks). So what happens is that CF switches from our primary server to the backup server, but the backup server doesn't strictly say in sync with the main server, that only happens periodically (I don't recall the frequency, something like every 15 minutes probably). So what I think you're experiencing is this:

  1. New Node.js release, (or some other event causing traffic spikes to the backend—which should only be CF edge nodes btw) a flurry of download attempts
  2. Primary backend has one of its pathetic panic attacks (it's not a tiny server, we used to bypass all downloads directly to the server until a couple of years ago)
  3. CF's LB flaps and switches to our backup server
  4. Backup server doesn't yet have new release files, so it's a legitimate 404
  5. Perhaps even when there's a recovery there might be a sticky 404 recorded by CF? I would hope not, but there's no triggering of a cache flush on our even when this all occurs so it might be messy if they assume both our LB backends have the same content.

Even if we fix the LB flapping, it's probably not good to have them out of sync like this. I've thought for a long time that asset promotion by releasers should also have some step that causes a sync to the secondary server.

Maybe the easiest and most sensible thing (regardless of LB issues) to do here would be cause a sync to the secondary server before we request CF purge their cache. Here would be the place to do that: https://github.com/nodejs/build/blob/main/ansible/www-standalone/resources/scripts/cdn-purge.sh.j2 - this script runs privileged, so prior to the curl it could call out to the rsync that gets the backend in sync. I just don't recall whether the rsync is a push from the primary or a pull from the secondary, it might not be as simple as calling another script.

targos commented 1 year ago

https://github.com/nodejs/build/pull/3226 added Cache-Control: no-cache to 404 responses. Without it CF was keeping the response in cache for 3 minutes: https://developers.cloudflare.com/cache/how-to/configure-cache-status-code/#edge-ttl

richardlau commented 1 year ago

I found an issue where one of the index.html files no longer exists after the switch to Next.js, which meant that our check-build-site.sh script always triggered a website rebuild every five minutes, which cannot be helping. https://github.com/nodejs/build/pull/3230

github-actions[bot] commented 6 months ago

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

domdomegg commented 6 months ago

Commenting to avoid stale closure. I think we've mitigated this somewhat, but this is still an issue.