502 temporary server errors on deploy

serverpod / serverpod

Serverpod is a next-generation app and web server, explicitly built for the Flutter and Dart ecosystem.

BSD 3-Clause "New" or "Revised" License

2.55k stars 238 forks source link

502 temporary server errors on deploy #2525

Open lukehutch opened 3 months ago

lukehutch commented 3 months ago

When I run the deploy workflow from GitHub, I get 502 server errors for a few minutes when trying to connect to the server from my app.

The deploy workflow is supposed to leave the old server running until the new server has finished starting, so this shouldn't happen. (There's a Terraform setting for this, I can't find it right now, but I saw before that that option was set...)

This means I have to be very strategic about when I restart my server (based on the time when the fewest users are online), even just to update website content :-( This is not a good situation.

Also, I wonder if this will affect the autoscaler...

Is there a policy change that can be made to minimize or eliminate downtime?

vlidholt commented 3 months ago

I did not experience this when I set up and tested the Terraform scripts. Are you using GCP? Did you do any modifications to the scripts that can potentially affect this?

lukehutch commented 3 months ago

I am using GCP, and I have neither made changes to the terraform scripts, nor manually configured anything.

lukehutch commented 3 months ago

This is what these errors look like, for the record:

statusCode = 502, ServerpodClientException: Unknown error, data: 
<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>

They can last 5 or 10 minutes after deployment.

lukehutch commented 3 months ago

@vlidholt I really need a zero-downtime way to deploy server updates. Currently server unavailability ranges from 5-15 minutes after deploy from GitHub. Occasionally it takes much longer than that. Is there anything that can be tweaked in the Terraform scripts to reduce or eliminate downtime?

lukehutch commented 3 months ago

@vlidholt two hours after the last server deploy, I am left without any VM instances at all!

This is a very serious problem... how do I get to the bottom of what went wrong?