retrospective on download outage

Trott commented 4 years ago

For picking up once the current issue is addressed and behind us, a retrospective to figure out how to make it better next time would be a good thing, I think.

Refs: https://github.com/nodejs/node/issues/32683 Refs: https://github.com/nodejs/node/issues/32683

MattIPv4 commented 4 years ago

An official status page for NodeJS would likely be helpful in situations like this for a couple of reasons:

A canonical place for NodeJS folks to communicate & acknowledge the issue
Make the community aware, reducing the number of issues created in the org & beyond

Created #2265 to look in more detail at the process of creating a status page.

Trott commented 4 years ago

No one has been added to @nodejs/build-infra in literally years. 5 years or something like that? We need to break that choke-point. It's OK if there are only 4 people in the group, but it's got to be 4 highly-active people. Right now, my guess is that we have 1 highly-active person, 2 moderately-active people, and 1 effectively-inactive person.

I would like us to think about who we might trust with this level of access. I'm surprised we haven't added @sam-github to the list, to be honest.

Trott commented 4 years ago

This, of course, also raises the issue of if we are taking a risky approach by relying entirely on volunteers. Assuming the Foundation would be willing to pay to have someone on staff doing the most critical work, should we consider that? (We'll have to carefully define the most critical work.)

Trott commented 4 years ago

On-call schedule? No, wait, hear me out! The idea is that we may not trust (for example) me with the keys to the kingdom, but maybe we trust me with the cell phone numbers of the people who do have those keys. And we trust me to not abuse this and to only notify people in a real crisis. Maybe we can have someone on-call to figure out who is in a not-terrible time zone and try to get them on the phone if we have an outage like this one.

I'm not sold on this being a great idea, but I'm trying to get all the ideas out anyway.

AshCripps commented 4 years ago

I think the crux of the issue with build infra is coverage, we have 1 US, 2 Europe and 1 AUS which gives us decent coverage of time zones - but the European members are often harder to contact. Adding @sam-github would be a good start but still leaves a big void in the middle of the timezones. Obviously this is tricky due to the security concerns around the level of access this role gives.

MattIPv4 commented 4 years ago

Having a proper system for an emergency contact makes sense, either through an on-call system of trusted volunteers (build-infra) or a paid position that comes with the responsibility of being okay with being woken up in the middle of the night if the world falls apart.

sam-github commented 4 years ago

@MattIPv4, it may not be clear from the project activity, https://github.com/nodejs/build/graphs/contributors?from=2020-01-08&to=2020-04-06&type=c (is there an equivalent view for comments on issues and PRs? I wish there was, but it wouldn't substantially change that view), but the top 3 folks at that link are developers, who primarily maintain jenkins and CI, and the 4th (Rod) is an expert, but only 1 person.

We don't lack system for making emergency contacts, IMO, we lack people to contact. There has been talk of getting paid infra support in, this might change that conversation.

cloudfare experts are few and far between in the Node.js project, or at least the number of them volunteering to join the build-wg is aprox zero.

MattIPv4 commented 4 years ago

Totally makes sense, I intended for my comment to imply that if an on-call system is used, part of that is ensuring there are folks on the other end to call that can actually help.

MattIPv4 commented 4 years ago

I should also say that I’m happy to help in any way I can (eg. getting #2265 going), though I would say I’m not an expert on infra.

More than happy to help connect y’all with folks at DO or Cf if you think there’s anything specific we can help with. :)

rvagg commented 4 years ago

Talking about choke points is fine as long as you can prove that something would be different if we added more people, and in this case nothing would be different. Even if I was awake in the middle of the night watching this play out I have no idea what I would have done to "rectify" it. At the moment it looks like a Cloudflare problem that's out of our control.

I know the feeling of impotence is frustrating, but just splashing out more access to more people doesn't necessarily improve the situation. We had this same conversation when we had a disk-full problem ("I can log in to nodejs.org but can't do X"), but even if more people had access, it's not clear that those people would know what to do to fix the problem.

There's (1) access but there's also (2) knowledge of the systems involved and (3) understanding of how it all fits together and how to not break things even further. There's a lot to break and there's a lot a stake.

Let's not try and make this look simpler than it is, you're just going to make it harder for those of us who carry the bulk of the burden because the blame gets focused like a laser -- "but it's so simple, you're at fault because you [haven't done X | did Y]".

rvagg commented 4 years ago

My assessment so far is written up @ https://github.com/nodejs/build/issues/2263#issuecomment-610144865

Reproducing the most interesting part of that:

What can we do better next time? 🤷‍♂ that graph looks to me like it's mostly out of our control and this is a CF hiccup that they haven't acknowledged on their status page. I'll do some more digging but I'm exhausting all but my most outlandish theories.

This does bring into focus our edge cache invalidation process which hasn't evolved since we've pushed to full CF caching of our downloads. There's a lot of room for improvement there and options for replacing the process with an entirely better setup. It'll just take very careful work.

Trott commented 4 years ago

Talking about choke points is fine as long as you can prove that something would be different if we added more people, and in this case nothing would be different. Even if I was awake in the middle of the night watching this play out I have no idea what I would have done to "rectify" it. At the moment it looks like a Cloudflare problem that's out of our control.

I wouldn't say "nothing would be different". Having someone authoritative saying things every few hours along the lines of "We're aware of the problem and actively looking into it" and/or "Looks like a Cloudflare problem that's out of our control, but we'll provide more information when we have it" is a huge help, even if the outage lasts exactly as long as it would otherwise. These things are squishy, but I'd say that the actual outage was only 20% of the problem. A squishy 80% of the problem was not being able to communicate anything meaningful to end users. That looks bad, increases anxiety, and prompts FUD in places like reddit and Twitter.

I appreciate what you're saying, which (if I'm understanding correctly) is that finding qualified people and (perhaps especially) taking the necessary time and effort to do knowledge transfer is vastly harder than my original comments imply. And then, of course, there's also the general issue of trust. It's not enough for someone to be qualified and trained. They need to be trusted as well.

We've talked about how to improve our bus-factor more than once in the past, so perhaps step 1 is for me to go back through old email threads and meeting notes before saying much more about it.

rvagg commented 4 years ago

Cloudflare have responded to the support request that I think @mhdawson submitted. It's a fairly generic response saying that everything seems fine on their end, but does include this:

Below are the logs we are seeing on our end.

┌─requests─┬─URL───────────────────────────────────────────────────────────────────────────────┬─cache_status─┐
│ 10673589 │ nodejs.org/                                                                       │ bypass       │
│   115595 │ nodejs.org/dist/v12.16.1/node-v12.16.1-linux-x64.tar.gz                           │ expired      │
│    50878 │ nodejs.org/dist/v10.19.0/node-v10.19.0-linux-x64.tar.gz                           │ expired      │
│    44930 │ nodejs.org/dist/v12.16.1/node-v12.16.1-x64.msi                                    │ expired      │
│    35850 │ nodejs.org/dist/v13.12.0/node-v13.12.0-x64.msi                                    │ expired      │
│    28275 │ nodejs.org/en/feed/blog.xml                                                       │ dynamic      │
│    26794 │ nodejs.org/dist/v12.16.1/node-v12.16.1-linux-x64.tar.xz                           │ expired      │
│    23355 │ nodejs.org/dist/v13.12.0/node-v13.12.0-linux-x64.tar.gz                           │ expired      │
│    17197 │ nodejs.org/dist/v13.12.0/node-v13.12.0-linux-x64.tar.xz                           │ expired      │
│    14911 │ nodejs.org/dist/v10.19.0/node-v10.19.0-linux-x64.tar.xz                           │ expired      │
│    13489 │ nodejs.org/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js         │ unknown      │
│    11315 │ nodejs.org/dist/v10.15.3/node-v10.15.3-linux-x64.tar.gz                           │ expired      │
│    10703 │ nodejs.org/dist/v8.11.3/node-v8.11.3-linux-x64.tar.gz                             │ expired      │
│     9630 │ nodejs.org/en/feed/releases.xml                                                   │ dynamic      │
│     8920 │ nodejs.org/dist/com/veeva/vault/app/parent/vault-app-parent-v2/maven-metadata.xml │ dynamic      │
│     7111 │ nodejs.org/                                                                       │ dynamic      │
│     7044 │ nodejs.org/en/                                                                    │ dynamic      │
│     6663 │ nodejs.org/dist/v8.11.1/node-v8.11.1-linux-x64.tar.gz                             │ expired      │
│     6631 │ nodejs.org/dist/v13.5.0/node-v13.5.0-linux-x64.tar.xz                             │ expired      │
│     5819 │ nodejs.org/dist/v8.17.0/node-v8.17.0-linux-x64.tar.xz                             │ expired      │
│     5475 │ nodejs.org/dist/v10.16.0/node-v10.16.0-linux-x64.tar.gz                           │ expired      │
│     5459 │ nodejs.org/en/feed/vulnerability.xml                                              │ dynamic      │
│     5455 │ nodejs.org/dist/v13.12.0/ivy.xml                                                  │ dynamic      │
│     5192 │ nodejs.org/dist/v12.16.0/node-v12.16.0-linux-x64.tar.xz                           │ expired      │
│     5036 │ nodejs.org/static/images/favicons/apple-touch-icon.png                            │ unknown      │
│     4813 │ nodejs.org/dist/v12.14.1/node-v12.14.1-linux-x64.tar.gz                           │ expired      │
│     4680 │ nodejs.org/dist/v12.16.1/node-v12.16.1-darwin-x64.tar.gz                          │ expired      │
│     4419 │ nodejs.org/dist/v10.16.3/node-v10.16.3-linux-x64.tar.gz                           │ expired      │
│     4378 │ nodejs.org/dist/v12.13.0/node-v12.13.0-linux-x64.tar.gz                           │ expired      │
│     4354 │ nodejs.org/dist/v8.17.0/node-v8.17.0-linux-x64.tar.gz                             │ expired      │
│     4223 │ nodejs.org/dist/v12.16.1/node-v12.16.1.pkg                                        │ expired      │
│     4000 │ nodejs.org/dist/v13.12.0/node-v13.12.0-x64.msi                                    │ unknown      │
│     3962 │ nodejs.org/dist/v12.16.1/node-v12.16.1-x64.msi                                    │ unknown      │
│     3933 │ nodejs.org/dist/v12.16.1/node-v12.16.1.tar.gz                                     │ expired      │
│     3546 │ nodejs.org/static/images/favicons/favicon-16x16.png                               │ unknown      │
│     3482 │ nodejs.org/favicon.ico                                                            │ unknown      │
│     3306 │ nodejs.org/dist/v10.13.0/node-v10.13.0-linux-x64.tar.gz                           │ expired      │
│     3161 │ nodejs.org/static/images/favicons/favicon.ico                                     │ unknown      │
│     3054 │ nodejs.org/dist/v12.13.1/node-v12.13.1-linux-x64.tar.gz                           │ expired      │
│     3040 │ nodejs.org/dist/v8.9.4/node-v8.9.4-linux-x64.tar.gz                               │ expired      │
│     2926 │ nodejs.org/dist/v12.14.0/node-v12.14.0-linux-x64.tar.gz                           │ expired      │
│     2905 │ nodejs.org/dist/v13.12.0/node-v13.12.0.tar.xz                                     │ expired      │
│     2806 │ nodejs.org/dist/v10.19.0/node-v10.19.0-darwin-x64.tar.gz                          │ expired      │
│     2568 │ nodejs.org/dist/v12.16.1/node-v12.16.1.tar.xz                                     │ expired      │
│     2513 │ nodejs.org/dist/v8.11.3/node-v8.11.3-linux-x64.tar.xz                             │ expired      │
│     2508 │ nodejs.org/dist/v10.19.0/node-v10.19.0-win-x64.7z                                 │ expired      │
│     2484 │ nodejs.org/dist/v10.16.3/node-v10.16.3-linux-x64.tar.xz                           │ expired      │
│     2459 │ nodejs.org/dist/index.json                                                        │ expired      │
│     2328 │ nodejs.org/static/css/styles.css                                                  │ unknown      │
│     2275 │ nodejs.org/dist/v12.16.0/node-v12.16.0-linux-x64.tar.gz                           │ expired      │
└──────────┴───────────────────────────────────────────────────────────────────────────────────┴──────────────┘

┌─requests─┬─URL───────────────────────────────────────────────────────────────┬─cache_status─┐
│ 10699252 │ nodejs.org/                                                       │ hit          │
│ 10680637 │ nodejs.org/en/                                                    │ hit          │
│ 10646355 │ nodejs.org/                                                       │ bypass       │
│  4811229 │ nodejs.org/dist/index.tab                                         │ hit          │
│   374251 │ nodejs.org/dist/index.json                                        │ hit          │
│   365092 │ nodejs.org/dist/                                                  │ hit          │
│   194435 │ nodejs.org/download/release/v12.16.1/SHASUMS256.txt               │ hit          │
│   194297 │ nodejs.org/download/release/v12.16.1/node-v12.16.1-headers.tar.gz │ hit          │
│   185335 │ nodejs.org/download/release/v10.19.0/node-v10.19.0-headers.tar.gz │ hit          │
│   185157 │ nodejs.org/download/release/v10.19.0/SHASUMS256.txt               │ hit          │
│   178814 │ nodejs.org/static/images/favicons/favicon.ico                     │ hit          │
│   162766 │ nodejs.org/static/images/logo.svg                                 │ hit          │
│   150530 │ nodejs.org/static/css/styles.css                                  │ hit          │
│   150105 │ nodejs.org/static/images/logos/js-green.svg                       │ hit          │
│   149195 │ nodejs.org/static/js/main.js                                      │ hit          │
│   149171 │ nodejs.org/static/images/openjs_foundation-logo.svg               │ hit          │
│   115593 │ nodejs.org/dist/v12.16.1/node-v12.16.1-linux-x64.tar.gz           │ expired      │
│   103754 │ nodejs.org/dist/v12.16.1/SHASUMS256.txt                           │ hit          │
│    99305 │ nodejs.org/static/images/favicons/manifest.json                   │ hit          │
│    88246 │ nodejs.org/favicon.ico                                            │ hit          │
│    88009 │ nodejs.org/dist/v10.19.0/SHASUMS256.txt                           │ hit          │
│    82397 │ nodejs.org/dist/v12.16.1/node-v12.16.1-linux-x64.tar.xz           │ hit          │
│    82208 │ nodejs.org/dist/v12.16.1/node-v12.16.1-linux-x64.tar.gz           │ hit          │
│    64676 │ nodejs.org/dist/v10.19.0/node-v10.19.0-linux-x64.tar.gz           │ hit          │
│    63312 │ nodejs.org/dist/v10.19.0/node-v10.19.0-linux-x64.tar.xz           │ hit          │
│    63127 │ nodejs.org/dist/v12.16.1/node-v12.16.1-x64.msi                    │ hit          │
│    54063 │ nodejs.org/download/release/v10.16.3/SHASUMS256.txt               │ hit          │
│    53970 │ nodejs.org/download/release/v10.16.3/node-v10.16.3-headers.tar.gz │ hit          │
│    50878 │ nodejs.org/dist/v10.19.0/node-v10.19.0-linux-x64.tar.gz           │ expired      │
│    50542 │ nodejs.org/download/release/v10.15.3/node-v10.15.3-headers.tar.gz │ hit          │
│    49290 │ nodejs.org/download/release/v10.15.3/SHASUMS256.txt               │ hit          │
│    47824 │ nodejs.org/download/release/v12.13.0/SHASUMS256.txt               │ hit          │
│    47691 │ nodejs.org/download/release/v12.13.0/node-v12.13.0-headers.tar.gz │ hit          │
│    47097 │ nodejs.org/en/download/                                           │ hit          │
│    44939 │ nodejs.org/dist/v13.12.0/SHASUMS256.txt                           │ hit          │
│    44928 │ nodejs.org/dist/v12.16.1/node-v12.16.1-x64.msi                    │ expired      │
│    40518 │ nodejs.org/dist/v8.17.0/SHASUMS256.txt                            │ hit          │
│    39707 │ nodejs.org/download/release/v8.17.0/node-v8.17.0-headers.tar.gz   │ hit          │
│    39496 │ nodejs.org/download/release/v13.12.0/node-v13.12.0-headers.tar.gz │ hit          │
│    39272 │ nodejs.org/download/release/v8.17.0/SHASUMS256.txt                │ hit          │
│    35848 │ nodejs.org/dist/v13.12.0/node-v13.12.0-x64.msi                    │ expired      │
│    35414 │ nodejs.org/download/release/v10.16.0/node-v10.16.0-headers.tar.gz │ hit          │
│    35385 │ nodejs.org/download/release/v13.12.0/SHASUMS256.txt               │ hit          │
│    34950 │ nodejs.org/api/assets/style.css                                   │ hit          │
│    34798 │ nodejs.org/api/assets/sh.css                                      │ hit          │
│    34487 │ nodejs.org/api/assets/sh_main.js                                  │ hit          │
│    34459 │ nodejs.org/api/assets/sh_javascript.min.js                        │ hit          │
│    33177 │ nodejs.org/download/release/v10.16.0/SHASUMS256.txt               │ hit          │
│    31980 │ nodejs.org/download/release/v12.14.1/node-v12.14.1-headers.tar.gz │ hit          │
│    31674 │ nodejs.org/download/release/v12.14.1/SHASUMS256.txt               │ hit          │
└──────────┴───────────────────────────────────────────────────────────────────┴──────────────┘

We will only cache something when a client requests the resource and then all subsequent requests for that resource are served from the cache until the cache TTL expires.

... continued with copypasta and links to help articles

I don't think there's anything here that would explain such an anomaly.

But, as a first course of action out of this, we probably need to do a systematic review of all of the path configurations in CloudFlare and in nginx that control expiry of our resources. Our current practices mean that we can say with a high degree of certainty which resources in download/* should have near-infinite TTLs. There's some things that shouldn't, like index.json, index.tar and directory listings, but we can enumerate those things.

rvagg commented 4 years ago

Edge Cache TTL: a day is what most of our cache rules have in CloudFlare. I think when we switched from bypass for /downoad/ & /dist/ we did 1 day and were intending to come back to this at some point in the future and improving it. But our nightlies already effectively trigger a daily purge anyway so this is all business as usual. It suggests tons of room for improvement, but none of it explains how we might be responsible for such a weird cache anomaly.

rvagg commented 4 years ago

I've been investigating all the paths that we have to requesting a cache purge from Cloudflare to see if there's a way we could have triggered an ongoing purge,purge,purge,purge. I think I've found one, but I can't use it to fully explain that graph.

Background

The way we do a purge is that we have a protected script on our main server that can receive a signal from a queue that there's a request to purge the cache. If it finds that signal it makes the API call. This queue can be modified by almost any user on the server, so we have various other scripts that add requests to the queue. We do this so we can keep the API call and its keys protected and don't have to expose those to whatever script & user might need to make that call. The queue is just a temp file that we touch. There's a cron job that checks the queue every minute and performs the queue if requested, then clears the queue.

The processes that can trigger a purge are:

Website rebuild, triggered by a push to nodejs/nodejs.org via the github webhook we run on the server
Website rebuild, triggered by a cron job that notices that index.json is newer than the website's index.html (i.e. there's been a new release, so the site needs to be updated)
Auto-promotion of assets for /download/, including nightly, v8-canary, rc, test, and a few others.
Manual-promotion of assets for /download/release by a releaser running tools/release.sh in nodejs/node.

Any of these can trigger a full cache purge. (Yes this is not optimal, but making it optimal is going to take very careful work).

Late last year, in an attempt to improve security, I introduced a cron job on the server that changes files in /download/ to root.root ownership after 7 days in existence. Everyone who has "release" permissions has access to log in to the server as user dist and this is the user that owns these files by default. This means that anyone who has that access could tinker with these assets, which is a security concern. I raised the issue of this list of SSH keys growing out of control @ https://github.com/nodejs/Release/issues/499, more work still needed on that front.

So, now with old /download/ files being locked in place, they can't be overwritten by anyone unless they have root access. But, it still is possible to queue assets for auto-promotion (and manual-promotion!) that belong to old releases. e.g. someone could rebuild v10.0.0 today and try and promote that. Thankfully that's not possible, however the scripts don't cleanly handle this case.

The main promote script @ https://github.com/nodejs/build/blob/050fb9733697acb3dd5bb53891c2818e2c335bab/ansible/www-standalone/tools/promote/_promote.sh has an in-built assumption that it can both copy the files from staging to dist and then remove from staging. It's run as user dist, both automatically and manually. But if it tries to promote something to a dist directory that's older than 7 days, it's going to get an error.

The bug

Here's what happened:

v10.20.0-rc.1 was built on the 24th of March, available @ https://nodejs.org/download/rc/v10.20.0-rc.1/
7 days later, that whole directory and its files would have been changed from dist.dist to root.root to lock them in place
A second v10.20.0-rc.1 was requested on the 3rd of April, with a different commit, but I think the error was seen fairly quickly and it was cancelled before all builders completed. But the quick ones got through:

$ ls -al /home/staging/nodejs/custom/v10.20.0-rc.1/
total 66884
drwxrwxr-x  2 staging staging     4096 Apr  3 14:16 .
drwxrwxr-x 66 staging staging     4096 Apr  6 23:45 ..
-rw-rw-r--  1 staging staging 20034139 Apr  3 14:15 node-v10.20.0-rc.1-linux-armv7l.tar.gz
-rw-rw-r--  1 staging staging        0 Apr  3 14:16 node-v10.20.0-rc.1-linux-armv7l.tar.gz.done
-rw-rw-r--  1 staging staging 11781292 Apr  3 14:16 node-v10.20.0-rc.1-linux-armv7l.tar.xz
-rw-rw-r--  1 staging staging        0 Apr  3 14:16 node-v10.20.0-rc.1-linux-armv7l.tar.xz.done
-rw-rw-r--  1 staging staging  2260992 Apr  3 14:16 node-v10.20.0-rc.1-linux-ppc64le.tar.gz
-rw-rw-r--  1 staging staging 21575960 Apr  3 14:14 node-v10.20.0-rc.1-linux-s390x.tar.gz
-rw-rw-r--  1 staging staging        0 Apr  3 14:15 node-v10.20.0-rc.1-linux-s390x.tar.gz.done
-rw-rw-r--  1 staging staging 12819760 Apr  3 14:15 node-v10.20.0-rc.1-linux-s390x.tar.xz
-rw-rw-r--  1 staging staging        0 Apr  3 14:15 node-v10.20.0-rc.1-linux-s390x.tar.xz.done

A new v10.20.0-rc.2 was built quickly after with the same commit sha and it went through fine.
The auto-promote script detected the new v10.20.0-rc.1 and attempted to promote them, but failed because of the new ownership of /home/dist/nodejs/rc/v10.20.0-rc.1/. Manually running the command that cron invokes shows:

$ su dist -c '/home/staging/tools/promote/promote_nightly.sh nodejs'
Promoting v10.20.0-rc.1/node-v10.20.0-rc.1-linux-armv7l.tar.gz...
rm: cannot remove '/home/dist/nodejs/rc/v10.20.0-rc.1/node-v10.20.0-rc.1-linux-armv7l.tar.gz': Permission denied

The promote script didn't properly exit when that failed (no set -e or error signal trap in that script), so it continued on with a "queue cdn purge".
Repeat from step 5, every 5 minutes, queuing a cdn purge which promptly gets executed less than a minute later.

Results

This pattern could explain a purge,purge,purge,purge but it can't explain the specific window of time where we were experiencing trouble. I don't know how to get from this bug to that behaviour, but it's at partial explanation at least.

Fix

I've fixed the immediate problem for now by removing those newly staged v10.20.0-rc.1 files and have rerun promotion and it's run fine and we've now got rc.2 and rc.3 which were lingering in staging waiting for attention, plus some nightlies and v8-canary files.

My first pass at addressing this problem is to make the promotion process more robust. See changes specifically @ #2269 in https://github.com/nodejs/build/pull/2269/files#diff-03699a6e81ab58dc8fa9a218932c0588 which do a set -e and also defer the cdn purge to the very end of the script so it's only done once, if it successfully reaches the end and it needs a purge. I've also introduced logging in that same PR so we can better trace these purges if we have the same pattern again. Logs go to /home/nodejs/cdn-purge.log next to the script (and next to github-webhook.log). I've also slowed down all of these cron processes in #2268. See notes in that PR about the compromises this involves.

These changes are live on the server but review would be appreciated.

This doesn't fully address the problems here. It's still going to error if it has to promote new files for old builds and that will lock up all promotion for that release type, but presumably that'll be noticed by someone who will complain about their builds not appearing. What needs to be done is a more robust "oh, this is not possible, I'm going to skip this one and log and/or email an error and continue to the next one".

MattIPv4 commented 4 years ago

Any of these can trigger a full cache purge. (Yes this is not optimal, but making it optimal is going to take very careful work).

To me, it seems like this work is something that really should be done, so that all the versioned files (eg https://nodejs.org/download/rc/v10.20.0-rc.1/*) can have much more effective caching, considering they never change?

The auto-promote script detected the new v10.20.0-rc.1 and attempted to promote them, but failed because of the new ownership of /home/dist/nodejs/rc/v10.20.0-rc.1/

What's the timeline there? Would this have immediately started on the 3rd? Was it late in the day?

Could it be that there's been a lower cache rate all weekend but it wasn't until Monday mornning when Europe woke up that load increased enough to trip things up?

rvagg commented 4 years ago

To me, it seems like this work is something that really should be done

Yes, I'm not going to argue that but it's "work" and has to be "done" by someone, with enough access and know-how about all of these systems work. Let's please put that aside, it's really not helpful to dwell on here aside from noting that it needs to be done. I, or someone else, will open an issue to explore the options ahead here, including an audit of what we have in place now.

What's the timeline there

Immediately I believe. The dates on those .done files give us the time it would have started. ~Apr 3 14:15 which is UTC. Cron would have loaded up the promote script at 14:15, it could have been a few seconds too late to see those, but by 14:20 it would have properly started.

I don't have access to older data from the Cloudflare API that would show us that timeframe (see graph above). We do have all logs stored in GCP, but they're massive and filtering through them ourselves, calculating cache % around that time would be a massive job. I don't think that's likely to get done.

Could it be that there's been a lower cache rate all weekend but it wasn't until Monday mornning when Europe woke up that load increased enough to trip things up?

Yes, very good point, day of the week is probably a key component. Our traffic patterns are US week-day heavy (very). Not so much Europe. (See how increasinly jagged our metrics have become over time due to this effect: https://nodejs.org/metrics/). So it could be as simple as this putting stress on our primary server but not enough to push it over the edge until Monday US rolls around and boom we hit our same old limits. Maybe that's enough to explain it.

Aside: https://nodejs.org/metrics/ is interesting if you look at the far right of the graphs - they're still running on the server logs but don't account for Cloudflare logs anymore (WIP), but they show a big spike as traffic is slipping through from Cloudflare. That demonstrates the effect in my graph above with more and more traffic getting bypassed.

MattIPv4 commented 4 years ago

but by 14:20 it would have properly started.

Hm, I would probably have expected this issue to surface on Friday then, as the US wakes up and goes to work. I think it's probably still likely that this is the (only) source of the issue and it just took a while to reach the breaking point.

ErisDS commented 4 years ago

It seems the bottom line is, this is a simple service managed by volunteers, and its difficult and potentially unreasonable to run this service with the availability/reliability people (wrongly or rightly) expect with those volunteers.

I've seen mention of introducing someone paid to manage infra, but has there been any consideration of not hosting these files internally/manually, but rather having an agreement with an established provider?

The obvious one to me would be GitHub, after all they built one of the tools which was heavily affected, however they're charging for it. They're already talking now about implementing a cache, but maybe a mirror, or fully taking over would make more sense. I've no idea how possible or complex that it would be to do, but wanted to suggest it and at the very least I think it'd be worth getting open comms if there isn't already.

BethGriggs commented 4 years ago

I can confirm that's what happened

A second v10.20.0-rc.1 was requested on the 3rd of April, with a different commit, but I think the error was seen fairly quickly and it was cancelled before all builders completed. But the quick ones got through.

Thank you for digging in @rvagg

MattIPv4 commented 4 years ago

Shared this on Slack earlier -- I've gone through the whole of nodejs.org/download/ and categorised what's immutable & can be set to a very long cache life, and what needs to have a short life.

https://docs.google.com/spreadsheets/d/1a4ZOeHxuAXejcp0q5n753FVs0OLMx-p8Ers1aMEwFN4/edit#gid=0

I'm hoping that this will help enable the changes needed to set much longer caching on assets and to avoid purging the entire cache for each new asset promotion.

rvagg commented 4 years ago

Still trying to make this timeline fit with the events I laid out above and I can't, quite. This graph shows bytes downloaded off the main server. It's lumped per day so a bit hard to pull apart but you can see our normal weekly pattern and then the spike really hits on the 6th. Before that it seems quite normal. A trigger in the middle of the 3rd doesn't match that graph, nor the Cloudflare data previously posted.

I think we know main areas for improvement, and I feel fairly confident I've uncovered bug that needs to be properly fixed (for now it's patched and is workable but not ideal going forward). But I'm not confident asserting that a particular sequence lead to what we saw go down during this incident.

targos commented 4 years ago

Another thing that doesn't seem explainable by the bug on our side is that the incident resolved itself without our intervention.

Trott commented 4 years ago

Something we probably could have done during this was to put a banner on the website.

MattIPv4 commented 4 years ago

Something we probably could have done during this was to put a banner on the website.

If StatusPage is implemented, they provide a nice little widget that could be added to the site, which when there is an incident on the status page shows a popup on the website on every page where the widget script is included.

(https://community.atlassian.com/t5/Statuspage-articles/Introducing-Statuspage-s-newest-feature-Status-Embed/ba-p/1102058)

sam-github commented 4 years ago

Interestingly, that github actions have a fragile dependency on nodejs download availability has been known since last October, 2019.

Also known for Azure, same timeframe, assumedly the same incident.

I found these as top links while trying to find out more about the downstream consumers are that are so impacted by this outages, expecting to find issues from the last couple days, not last year.

rvagg commented 4 years ago

7-day and 30-day graphs from the primary www backend from DO just now. Anomaly is clear, but we've pushed bandwidth and load way down since taking the measures to slow down possible purging rate and plug possible holes.

CPU sage in the 30 days dropping in April is a bit weird, no idea what that might be.

rvagg commented 4 years ago

Interesting now to see how often we purge cdn now that we have logging in place:

2020-04-07T08:10:01+00:00, nodejs, promote 
2020-04-07T10:35:01+00:00, nodejs, promote 
2020-04-07T11:35:01+00:00, nodejs, promote 
2020-04-07T12:35:01+00:00, nodejs, promote 
2020-04-07T14:35:01+00:00, nodejs, promote 
2020-04-07T15:05:01+00:00, nodejs, promote 
2020-04-07T15:35:01+00:00, nodejs, promote 
2020-04-07T16:00:01+00:00, nodejs, build-site build-site 
2020-04-07T16:35:01+00:00, nodejs, promote 
2020-04-07T18:40:01+00:00, nodejs, build-site 
2020-04-07T18:45:01+00:00, nodejs, build-site 
2020-04-07T21:35:01+00:00, nodejs, promote 
2020-04-07T22:05:02+00:00, nodejs, promote 
2020-04-07T22:35:01+00:00, nodejs, promote 
2020-04-08T03:05:01+00:00, nodejs, promote 
2020-04-08T06:35:01+00:00, nodejs, promote 
2020-04-08T07:05:01+00:00, nodejs, promote 
2020-04-08T07:35:01+00:00, nodejs, promote 
2020-04-08T10:35:01+00:00, nodejs, promote 
2020-04-08T11:35:01+00:00, nodejs, promote 
2020-04-08T12:35:01+00:00, nodejs, promote 
2020-04-08T14:05:01+00:00, nodejs, build-site build-site 
2020-04-08T16:10:01+00:00, nodejs, promote resha_release 
2020-04-08T16:15:01+00:00, nodejs, build-site 
2020-04-08T16:25:01+00:00, nodejs, build-site build-site 
2020-04-08T19:25:01+00:00, nodejs, promote resha_release 
2020-04-08T19:30:01+00:00, nodejs, build-site 
2020-04-08T19:35:01+00:00, nodejs, build-site build-site 
2020-04-09T06:35:01+00:00, nodejs, promote 
2020-04-09T07:05:01+00:00, nodejs, promote 
2020-04-09T07:35:01+00:00, nodejs, promote 
2020-04-09T10:35:01+00:00, nodejs, promote 
2020-04-09T11:05:02+00:00, nodejs, promote 
2020-04-09T11:35:01+00:00, nodejs, promote 
2020-04-09T12:35:01+00:00, nodejs, promote 
2020-04-09T17:35:01+00:00, nodejs, promote 
2020-04-09T17:55:01+00:00, nodejs, build-site 
2020-04-09T18:00:01+00:00, nodejs, build-site 
2020-04-09T18:05:01+00:00, nodejs, promote 
2020-04-09T18:35:01+00:00, nodejs, promote 
2020-04-09T19:05:01+00:00, nodejs, promote 
2020-04-09T19:35:01+00:00, nodejs, promote

Plain "promote" are nightlies, v8-canaries and a few RCs flowing in. The ones with "resha_release" are release builds, so that would be 10.20.0 and the 12.x that went out this week. "build-site" are triggered either by updates to nodejs.org (release blog entries added and maybe other tweaks by that team) and auto checks that the site was build after the last release was made.

Still too often though. We're not maximising our Cloudflare potential.

Re the "promote" entries which are the bulk of these, I think that's mainly caused by the different speeds at which builds get pumped out. If you look at a nightly, e.g. https://nodejs.org/download/nightly/v14.0.0-nightly20200408c849f2d4f8/ and scan the timestamps, you can see that they'd be dropping in to staging over the course of ~1.5h, and we now have the cron job checking at 30 minute intervals (#2268). v8-canary builds are even worse, spanning ~2.5h, so potentially 5 purges while they dribble in each day.

A possible solution to this might be for auto-promote jobs to have a "debounce" mechanism that checks for last-drop time and waits for last+30m (or some number) before taking action rather than just taking action when something's there. It might get frustrating for people building RCs and test builds, but they're relatively infrequent and maybe we can special-case those.

MattIPv4 commented 4 years ago

My completely honest $0.02 would be that the whole idea of purging that cache on promotion should be scrapped. The only things changing are the index files, everything else is new so there isn't even a cache to clear there. It seems the entire nodejs.org cache gets purged just so that a couple of index files get updated a bit sooner?

I personally think that there should just be a slightly shorter cache life set on those, with a really long cache life on everything else. Sure, they might not be completely up-to-date every minute of every day, but they'll update themselves once the cache life set on them invalidates.

Edit: Certain promotions (anything going into latest), as well as site builds still need to purge. However, these should really use targeted cache purges with a list of specific assets given, so that the entire cache for everything else that remained unchanged isn't affected.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

nodejs / build

retrospective on download outage #2264