Closed DeeDeeG closed 1 year ago
For some reason, iojs.org redirected to nodejs.org, causing errors with nvm
(https://github.com/SheetJS/sheetjs/runs/5564331772 test failed when trying to fetch https://iojs.org/dist/v3.3.1/iojs-v3.3.1-linux-x64.tar.xz)
I wonder if the site/Cloudflare are just getting overwhelmed with traffic for people downloading the latest release?
Node v16.14.1 was released about an hour ago... https://github.com/nodejs/nodejs.org/pull/4494 / https://github.com/nodejs/node/releases/tag/v16.14.1
It's the latest LTS release, so everybody in the world looking for the default version of Node is downloading uncached right now, basically. (Other than as apparently cached by Cloudflare).
Should settle down in a few hours, or by tomorrow??
Note: I have nothing to do with Node's web servers or site, I'm just a regular person speculating at this point.
@SheetJSDev I think the https://iojs.org/dist site is backed by https://nodejs.org/dist behind the scenes. (Using the Network pane of browser Dev Tools to inspect the traffic, the responses are served from nodejs.org, even when visiting iojs.org/dist.)
Edit: Actually just parts of it show as coming from nodejs.org, some resources say they're from io.js. :shrug:
Edit again: I note that the root of the io.js site (https://iojs.org/) redirects to nodejs.org... But also, I can't see where in your CI run it visibly redirected to nodejs.org. I just see that the server error caused nvm to fail to download and install io.js.
And at that... I'll hide my own comment as off-topic.
Closing as the issue isn't happening right now, but also pinging @nodejs/build in case there's anything to add here or something to be aware of.
People are still reporting this over in the OpenJS Slack, so I'm going to re-open this until that gets sorted.
I have CloudFlare LB error notices spanning from ~2:40 am UTC to ~3:40 am UTC, which I suppose correlates with the latest 16.x release. Overloaded primary server, switching to the backup. Users may have got the 500 while hitting the primary before it switched.
I don't have a good explanation beyond that, I'm not sure what's caused the overload, I don't believe that server actually has trouble serving, but we have witnessed some weird I/O issues connected with rsync that maybe all happened at the same time. Perhaps during the next release someone should be on the server watching for weirdness.
This seems to be happening again; https://iojs.org/dist/index.json is 500ing, and v12.22.11 just went out.
(after about 5-10 minutes, the 500s seemed to stop, but it's still worth looking into)
FWIW I am still seeing either "500 internal server error" or very slow file downloads (less than 50 kilobytes a second. Often less than 10 kilobytes a second, especially at the beginning of a download.)
After promoting 12.22.11 I went to rerun the same release script to promote 14.19.1 and it just hung during the bit when it fetches the promotable builds. Same behaviour from another machine on a different network. Weirdly I was able to ssh into the machine in an interactive session and manually run the command the promotion script was trying to run 😕.
I don't have a good explanation beyond that, I'm not sure what's caused the overload, I don't believe that server actually has trouble serving, but we have witnessed some weird I/O issues connected with rsync that maybe all happened at the same time. Perhaps during the next release someone should be on the server watching for weirdness.
I logged into the machine, ran ps -ef
and saw we had two rsyncs in progress
root 12174 12137 0 00:08 ? 00:00:00 rsync --server --sender -logDtprze.iLsfxC . /home/dist/iojs/
nodejs 27646 27638 0 00:00 ? 00:00:00 ssh benchmark rsync --server --sender -logDtprze.iLsfx . coverage-out/out/
The first of those, running as root, is from the backup server -- I've left that alone. I killed the second one, which is the coverage data, and suddenly my other terminal running the promotion script got unstuck. This would kind of suggest the coverage data syncing is partly, or wholly, responsible. In the past either @mhdawson or I would do an annoying manual clear up of old coverage data to reduce the volume of stuff being synced but I'm going to recommend now we turn off running coverage on Jenkins and the associated data shuffling to populate coverage.nodejs.org and switch exclusively to codecov.io.
Actually as I typed the above paragraph my promotion execution has broken 😢 so there's still something up with the machine:
Are you sure you want to promote the v14.19.1 assets? [y/n] y
client_loop: send disconnect: Broken pipe
Currently there are no rsync processes -- but running the promotion script is "hung" again for me 😞.
These are the top things via top
:
top - 00:33:22 up 195 days, 16:59, 1 user, load average: 0.72, 0.80, 0.84
Tasks: 165 total, 2 running, 163 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.7 us, 3.3 sy, 0.0 ni, 83.6 id, 0.1 wa, 0.0 hi, 8.2 si, 0.1 st
KiB Mem : 16432136 total, 1251072 free, 1222728 used, 13958336 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 14391196 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25151 www-data 20 0 595256 456632 375512 S 9.0 2.8 510:58.65 nginx
25149 www-data 20 0 595796 456892 375408 S 8.3 2.8 529:56.28 nginx
25152 www-data 20 0 597244 458740 375596 S 8.0 2.8 524:21.40 nginx
25150 www-data 20 0 594556 456176 375488 S 7.3 2.8 505:23.46 nginx
25153 www-data 20 0 596728 458372 375512 S 7.0 2.8 517:06.30 nginx
25154 www-data 20 0 595496 456944 375552 S 6.3 2.8 515:04.23 nginx
23 root 20 0 0 0 0 R 5.0 0.0 238:39.12 ksoftirqd/3
12273 telegraf 20 0 6316820 38820 7492 S 3.0 0.2 17:58.94 telegraf
58 root 20 0 0 0 0 S 2.0 0.0 2773:53 kswapd0
7 root 20 0 0 0 0 S 0.7 0.0 595:13.23 rcu_sched
18 root 20 0 0 0 0 S 0.3 0.0 12:58.42 ksoftirqd/2
16257 root 20 0 42092 3784 3128 R 0.3 0.0 0:00.13 top
I think I'm going to bounce nginx.
Happening same here. Know this type of comments are not useful in most cases, but in this case at least serves to know that there are users still being affected 😬
Edit: worked now for me
I did systemctl restart nginx
. I'm still seeing rsync processes from the backup server still running.
This is being reported again, probably due to an increase in traffic from the Node.js 12 that went out earlier. This is from the Cloudflare Load balancing activity analytics:
Grafana for the main server (DO):
top
:
top - 14:15:36 up 214 days, 6:42, 1 user, load average: 1.38, 1.00, 1.04
Tasks: 168 total, 4 running, 164 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.0 us, 2.7 sy, 0.0 ni, 75.5 id, 0.3 wa, 0.0 hi, 6.1 si, 10.3 st
KiB Mem : 16432136 total, 885560 free, 1270772 used, 14275804 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 14359928 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23 root 20 0 0 0 0 R 24.3 0.0 269:33.01 ksoftirqd/3
6072 www-data 20 0 598296 443264 359044 S 17.3 2.7 737:22.52 nginx
6071 www-data 20 0 599072 444024 359044 S 14.3 2.7 717:50.13 nginx
6074 www-data 20 0 594416 439496 359044 R 14.3 2.7 751:33.93 nginx
6069 www-data 20 0 599124 444240 359044 S 12.0 2.7 752:04.15 nginx
6073 www-data 20 0 597752 442640 359044 S 11.6 2.7 748:45.83 nginx
6070 www-data 20 0 598220 442908 359044 R 9.0 2.7 731:50.99 nginx
7 root 20 0 0 0 0 S 1.3 0.0 722:24.29 rcu_sched
18 root 20 0 0 0 0 S 0.7 0.0 14:54.21 ksoftirqd/2
58 root 20 0 0 0 0 S 0.7 0.0 3122:16 kswapd0
673 root 20 0 187004 42708 42360 S 0.3 0.3 92:42.30 systemd-journal
1220 root 20 0 42092 3628 3004 R 0.3 0.0 0:00.09 top 1 root 20 0 185104 4900 3136 S 0.0 0.0 26:18.81 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:07.94 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 4:51.64 ksoftirqd/0
No idea if this is related - but wanted to report it here just in case. I've found the download server has been incredibly slow for the past 30 mins or so, with some requests hanging and never completing.
I spotted this during some citgm-smoker-nobuild
jobs that I was running:
02:24:08 + curl -O https://nodejs.org/download/release/v17.9.0/node-v17.9.0-aix-ppc64.tar.gz
02:24:08 % Total % Received % Xferd Average Speed Time Time Time Current
02:24:08 Dload Upload Total Spent Left Speed
02:24:08 55 52.2M 55 29.0M 0 0 33587 0 0:27:12 0:15:06 0:12:06 42751
55 52.2M 55 29.0M 0 0 33550 0 0:27:14 0:15:07 0:12:07 12811
55 52.2M 55 29.0M 0 0 33538 0 0:27:14 0:15:08 0:12:06 10273Terminated
02:39:17 Build was aborted
Also seen in https://ci.nodejs.org/job/citgm-smoker-nobuild/1190/nodes=rhel8-s390x/console
(I hope the citgm-smoker-nobuild
jobs that I've been running are not the cause here 😅)
I also tried locally and was getting 20-30 KB/s 🐢 . After ~30 mins or so new requests started completing as normal. I did have to abort the jobs in Jenkins to stop them hanging.
I have a flurry of notifications from CF around 6 hours ago about timeouts. The interesting thing this time is that both servers are getting timeout notifications. I'm not sure what that means. We should probably do a test to push all traffic to the backup server and invalidate cache to see what sort of load it gets and whether it can handle it. It'd be surprising to me if it can't. This might be deeper, or could be a CF issue.
And having just written that, it's flapping again right now. I'm on the main server and it's pretty sluggish, load <3 though. What I can see happening in parallel is an xz
compressing log files thanks to logrotate while at the same time nightlies are being uploaded, and therefore CF cache is getting invalidated. Both servers have nginx
at the top of their process lists sorted by CPU.
I've reniced the xz
process to 19, it's annoying that it's 0
and I'm not sure there's a good way to deprioritise compression jobs when it's running. So what I've done in /etc/cron.daily/logrotate.conf is prefix the logrotate
call with nice -n 19 ionice -c3
. It seems to have calmed down after I manually reniced the running process, no flapping since then.
Another thing I'm noticing as I tail the logs is that the Cloudflare-Traffic-Manager is getting really chatty. Each of their edge locations is pinging both servers to check their health, but that means that as Cloudflare expands their network, and we maintain a fairly frequent health check, that even at idle they're serving quite a bit of traffic. Nowhere near their capacity, it just means that background traffic is quite high, and then cache invalidation shoots it right up.
So, I'm going to try something, currently our health check is: every 30 seconds, timeout after 4 seconds, retry 3 times. I'm bumping it to: every 60 seconds, timeout after 8 seconds, retry 2 times. Anyone got a better suggestion?
I seem to have run into this a few minutes ago. It's working again now.
Same; it was happening a bunch for me within the last hour.
unable to get to any of the documentation now
Seems to be resolved now, if having a comment like this with a timestamp helps anyone look through relevant logs or whatever.
@Trott a few links are still 500ing for me
It's interesting we did not get reports after 18.x was launched. I'd expect if it was driven by downloads we might have seen something yesterday.
It definitely seems more driven by the build/release process itself than by downloads later.
Makes me think that it might be driven more by CI runners always pulling down the latest for a release, so there is a massive swarm at the beginning of the release at the same time as the cache invalidation
Wonder if delaying the index.tab update till the cache is hydrated might alleviate it
Makes me think that it might be driven more by CI runners always pulling down the latest for a release, so there is a massive swarm at the beginning of the release at the same time as the cache invalidation
I also think the global cache invalidation as part of a trigger is the cause. Too much spiky traffic.
Got this issue today:
After two trials of re-running the pipeline, the download succeeded then at the third try.
FWIW I've captured this from CloudFlare
(note that the graph is in my local time which is +1 ahead of UTC that is used in the table)
I am having an issue with browsing/downloading the files at the moment.
I had one or two intermittent "500 internal server error" messages in my browser trying to view 16.x latest dir. And downloads have been slower than usual it seems for the past minutes, for any file download I tried (16.x tarball, 17.x tarball, 18.x tarball...).
(Mentioning in case anyone in NodeJS org wants to take a look at it and gather some more diagnostic data? Any insight you can add is much appreciated, maybe it can improve the reliability of the downloads.)
cc @richardlau @Trott it might actually not be a bad idea to have possibly the downloads also being done on Vercel or some other CDN.... (I also remember @mcollina was interested in this topic)
All downloads are served via cloudflare.
@mcollina are we sure about that? Because these 500 errors are coming from NGINX, how are they served by Cloudflare? Are they cached on Cloudflare, or what?
Yes, everything is cached on cloudflare, might be some misconfiguration along the way, but the traffic served by our nginx should be minimal.
Everything on https://nodejs.org is server via CloudFlare CDN. One of the issues we have though is that we have a very crude CloudFlare CDN cache invalidation process -- basically every time the website is updated (e.g. something merged to main) or a set of builds is promoted from staging to download we invalidate the entire cache. This is obviously less than efficient -- there was some discussion over this that stalled out https://github.com/nodejs/build/issues/2375.
FWIW this is yesterday's logs of when our tooling requested a CloudFlare CDN purge. The "promote" entries are nightly builds (the four entries are two each for the nightly from nodejs/node main
and the V8 canary builds). The "build-site" entries are caused by a website rebuild. Each entry here requests invalidation of our entire CloudFlare CDN cache.
2023-03-13T06:35:01+00:00, nodejs, promote
2023-03-13T07:05:01+00:00, nodejs, promote
2023-03-13T10:35:01+00:00, nodejs, promote
2023-03-13T11:05:01+00:00, nodejs, promote
2023-03-13T19:05:01+00:00, nodejs, build-site
2023-03-13T19:15:01+00:00, nodejs, build-site
2023-03-13T19:25:01+00:00, nodejs, build-site
2023-03-13T19:45:02+00:00, nodejs, build-site
2023-03-13T20:55:01+00:00, nodejs, build-site
2023-03-13T21:05:01+00:00, nodejs, build-site
2023-03-13T21:25:01+00:00, nodejs, build-site
2023-03-13T21:35:01+00:00, nodejs, build-site
We can close this issue. We did an enormous amount of work with #5149 and I'm convinced this issue is largely solved for now. And for the main website, this will also (pretty much) not be a thing anymore.
We are still continuously working with small changes here and there to improve things and figure out other ideas to better our infra.
With the latest changes to the current website build infrastructure (and the way how we serve the website) (https://github.com/nodejs/build/issues/3366) this issue should now have minimal impact and probably not even surface anymore.
Seeing node-gyp failures while installing some NodeJS packages.
gyp ERR! stack Error: 500 status code downloading checksum
The error is reproducing with wget:
wget https://nodejs.org/download/release/v8.17.0/SHASUMS256.txt
--2023-07-31 18:38:31-- https://nodejs.org/download/release/v8.17.0/SHASUMS256.txt
Resolving nodejs.org (nodejs.org)... 104.20.23.46, 104.20.22.46
Connecting to nodejs.org (nodejs.org)|104.20.23.46|:443... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
2023-07-31 18:38:52 ERROR 500: Internal Server Error.
As some have noticed, these issues still tend to happen; this is mostly due to the nature of canary releases and nightly releases of Node.js triggering a complete cache purge of the CDN. (This issue is being tracked)
Although the Website migration should have reduced the pain, the pain still seems notorious enough.
Also seeing intermittent failures in our CI builds:
14:04:14 npm ERR! gyp http GET https://nodejs.org/download/release/v15.14.0/node-v15.14.0-headers.tar.gz
14:04:14 npm ERR! gyp http 500 https://nodejs.org/download/release/v15.14.0/node-v15.14.0-headers.tar.gz
We've identified an issue on Cloudflare that is simply disabling cache for anything /download
we're waiting someone with write
access to make the changes on Cloudflare.
The fix got deployed around 1 hour ago; Meaning that requests towards nodejs.org/download
should be cached again.
Seems to work nominally now.
One suggestion: since this issue was ongoing for some time, and the status page was not showing any anomaly when the issue occurred, I would suggest adding a check on some random binary versions so in case the issue occurs again it will be easily spotted and reflected on the status page.
I'm getting a 500 on https://nodejs.org/dist/v19.9.0/node-v19.9.0-linux-x64.tar.gz
This file downloaded file as recently as yesterday: https://github.com/PolicyEngine/policyengine-app/actions/runs/5774738256/job/15652016833#step:3:10
No longer getting a 500 on https://nodejs.org/dist/v19.9.0/node-v19.9.0-linux-x64.tar.gz
Hello! I'm seeing intermittent 500s or very long downloads on https://nodejs.org/dist/v16.20.2/node-v16.20.2.tar.gz.
Problem still exists, downloading is too slow:
$ wget https://nodejs.org/download/release/v17.9.0/node-v17.9.0.tar.xz
--2023-08-30 11:44:53-- https://nodejs.org/download/release/v17.9.0/node-v17.9.0.tar.xz
Resolving nodejs.org (nodejs.org)... 104.20.23.46, 104.20.22.46
Connecting to nodejs.org (nodejs.org)|104.20.23.46|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35778144 (34M) [application/x-xz]
Saving to: ‘node-v17.9.0.tar.xz’
node-v17.9.0.tar.xz 23%[=======> ] 8.17M 15.5KB/s in 5m 12s
2023-08-30 11:50:06 (26.8 KB/s) - Connection closed at byte 8568562. Retrying.
--2023-08-30 11:50:07-- (try: 2) https://nodejs.org/download/release/v17.9.0/node-v17.9.0.tar.xz
Connecting to nodejs.org (nodejs.org)|104.20.23.46|:443... connected.
HTTP request sent, awaiting response...
Hi. I'm getting 500s on https://nodejs.org/dist/v16.13.2/node-v16.13.2-win-x64.7z
We're seeing 500s on https://nodejs.org/dist/v18.17.1/node-v18.17.1-darwin-arm64.tar.gz
I actually just filed a parallel ticket for that one @pughpugh before I saw this: https://github.com/nodejs/nodejs.org/issues/5818 . Not sure why Cloudflare is caching 500s
Edited by the Node.js Website Team
⚠️ PLEASE AVOID CREATING DUPLICATED ISSUES
Learn more about this incident at https://nodejs.org/en/blog/announcements/node-js-march-17-incident
tl;dr
: The Node.js website team is aware of ongoing issues with intermittent download instability.More Details: https://github.com/nodejs/build/issues/1993#issuecomment-1699231827
Original Issue Below
When trying to get files off of
nodejs.org/dist/...
ornodejs.org/download/...
, I get a server error.(error page served by nginx)
Full error message page (HTML snippet, click to expand)
```html500 Internal Server Error
Browsing around the dirs, like https://nodejs.org/dist/latest-v16.x/, seems to work. Also, Downloading really small files such as https://nodejs.org/dist/latest-v16.x/SHASUMS256.txt seems to work sporadically, whereas downloading tarballs doesn't seem to work.
Given that the outage seems sporadic: Maybe it's a resource exhaustion issue over at the server? Running out of RAM or something?? I don't know.
Edit to add: The error message page seems to be served by Cloudflare. (According to theActually that's probably not what that means.server: cloudflare
response header, when looking in browser dev tools). So I guess this is a Cloudflare issue?