richardlau commented 11 months ago

It doesn't look like we have a tracking issue for this, although much has been discussed spread out over several Slack discussion threads. See also https://github.com/nodejs/TSC/issues/1416.

Summary

Our DigitalOcean hosted droplet for our www server (one of two servers behind a Cloudflare load balancer) has become very unreliable this year, seemingly getting worse to the extent that over the last few weeks it "works" for about a day (or less) and then runs out of file descriptors (error messages visible in the nginx error logs) and Cloudflare believes the server to be unhealthy (=cannot reach the /traffic-manager endpoint?) and switches over to the other server (called Joyent but now resides on Equinix Metal).

Prior to the last few weeks the "unhealthy" state was temporary -- eventually CF would decide the server was healthy again and switch back to the DO server but now it appears that the DO server remains unhealthy in CF until we restart nginx.

richardlau commented 11 months ago

Over the last few weeks, due to the DO server not automatically recovering without intervention, we've been predominantly serving from Equinix Metal (Joyent).

AFAICT the Equinix Metal is not suffering the same issues as the DO server. While we are occasionally getting load balancer alert emails through to the build alias from CF it's nowhere near the frequency we were getting them for the DO server.

I don't think we've reflected all the nginx tweaks that have been made on the DO server to the Equinix Metal one so it might be worth looking at the differences there. In particular I think the connection limits are lower/not set on the Equinix Metal server. Other differences between the two servers are that nightly/v8-canary/release builds are pushed (from our release machines via scp) to the DO server -- an rsyncmirror.service runs on the Equinix Metal server periodically pulling things from the DO one.

richardlau commented 11 months ago

Oh and while I have no evidence that suggests it would solve/address any of the current issues, we really should plan how we're going to update the server to a later OS (and probably nginx as I assume the one in the apt repository is old). It may be worth considering creating a replacement server from scratch vs a risky upgrade of the existing server(s).

targos commented 11 months ago

It may be worth considering creating a replacement server from scratch vs a risky upgrade of the existing server(s).

Absolutely agree.

ovflowd commented 11 months ago

(=cannot reach the /traffic-manager endpoint?) and switches over to the other server)

Which is even worse because that endpoint is a pure HTTP response with no file access, and for not being able to handle that...

ovflowd commented 11 months ago

It may be worth considering creating a replacement server from scratch vs a risky upgrade of the existing server(s).

Big +1

MoLow commented 11 months ago

add to the build agenda so we can discuss how to proceed on this

targos commented 11 months ago

(=cannot reach the /traffic-manager endpoint?) and switches over to the other server)

Which is even worse because that endpoint is a pure HTTP response with no file access, and for not being able to handle that...

As I understand it, the problem is that nginx reaches the maximum open files limit and cannot accept new connections (including those that come from the CF load balancer health checks).

richardlau commented 11 months ago

Just on the point re. creating a new server -- our existing server was created five years ago and is on the basic plan (perhaps that was all that was available then?). Theoretically it has a 2 Gbps maximum network throughput but I don't think I've seen the droplet hit that, even when we raised the open file limit on the droplet.

"CPU-Optimized Droplets with Premium CPUs" have a higher throughput limit of 10 Gbps but will cost more. I don't have access to (nor do I particular want access to) billing for our DO account so I don't know what our current droplet is costing vs our credits. I don't know what our credits are on the DO account either but I do know that we've run out in the past. If we decide to go with a larger droplet then we should loop in the OpenJS Foundation.

targos commented 7 months ago

I forgot to mention somewhere that when we upgraded the DO server (https://github.com/nodejs/build/issues/3564), we also bumped it to Premium Intel CPUs.

nodejs / build

DigitalOcean www server #3424

Summary