GCP deployment doc fixes

serverpod / serverpod

Serverpod is a next-generation app and web server, explicitly built for the Flutter and Dart ecosystem.

BSD 3-Clause "New" or "Revised" License

2.3k stars 200 forks source link

GCP deployment doc fixes #2272

Open lukehutch opened 2 weeks ago

lukehutch commented 2 weeks ago

I'm going through https://docs.serverpod.dev/deployments/deploying-to-gce-terraform , and I'll catalog any more issues I find here, with the goal of hopefully helping improve the docs...

(1) "On the right-hand side, click on the Deploy to GCP item" should say left-hand side, not right-hand side.

(2) When I tried to deploy the Docker container, I got:

Error: An error occurred trying to start process '/usr/bin/bash' with working directory '/home/runner/work/click/click/projectname_server'. No such file or directory

(3) terraform init spells "deploying" wrong:

var.DATABASE_PASSWORD_STAGING The staging database password, you can find it in the config/passwords.yaml file (no need to specify if you aren't deployning a staging environment).

(4) The docs need more info on how to ensure that production and staging environments have redis enabled (setting enable_redis = true in main.tf, and also a Redis password needs to be added to passwords.yaml in the production and staging sections, I think? Back when I set up Serverpod, it only added a Redis password to the development section. Also, this needs to be done before the passwords are installed as GitHub secrets). Honestly I'm not even sure how I'm supposed to verify that there is a Redis instance running with my server.

In particular, trying to enable it gives this error:

Error: Error creating Instance: googleapi: Error 403: Google Cloud Memorystore for Redis API has not been used in project clicksocial-app before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/redis.googleapis.com/overview?project=clicksocial-app then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.

(5) Serverpod is still using Postgres 14... probably it's worth looking forward to upgrading to 16, or at least mention in the docs how to change the Postgres version used. Also, the server Docker image is built with Dart 3.2.5, and I'm using Dart 3.4 -- I don't know how this dependency is determined by the Docker building process, but I want to change it.

(6) In addition to verifying the main domain, I had to verify these two:

╷
│ Error: googleapi: Error 403: Another user owns the domain storage.clicksocial.app or a parent domain. You can either verify domain ownership at https://search.google.com/search-console/welcome?new_domain_name=storage.clicksocial.app or find the current owner and ask that person to create the bucket for you, forbidden
│ 
│   with module.serverpod_production.google_storage_bucket.public[0],
│   on .terraform/modules/serverpod_production/storage.tf line 3, in resource "google_storage_bucket" "public":
│    3: resource "google_storage_bucket" "public" {
│ 
╵

╷
│ Error: googleapi: Error 403: Another user owns the domain private-storage.clicksocial.app or a parent domain. You can either verify domain ownership at https://search.google.com/search-console/welcome?new_domain_name=private-storage.clicksocial.app or find the current owner and ask that person to create the bucket for you, forbidden
│ 
│   with module.serverpod_production.google_storage_bucket.private[0],
│   on .terraform/modules/serverpod_production/storage.tf line 33, in resource "google_storage_bucket" "private":
│   33: resource "google_storage_bucket" "private" {
│ 
╵

(7) There are no instructions as to what to do to retry the deployment if it fails with errors like this. I assume I can just do terraform apply again, but given how much work it does, most of which succeeded, I'm scared to run it again!

lukehutch commented 2 weeks ago

On the last point:

https://community.gruntwork.io/t/cleanup-of-terraform-apply-partial-fails/420

lukehutch commented 2 weeks ago

(8) OK, so to properly re-run Terraform I had to set deletion_protection to false in terraform.tfstate, then run terraform destroy (this should be documented too).

(9) Terraform asks for the Postgres password, but not the Redis password -- not sure if this is a problem...

lukehutch commented 2 weeks ago

On (6), I had to manually add my service account email address serverpod-terraform@clicksocial-app.iam.gserviceaccount.com as an owner of the three domains clicksocial.app, storage.clicksocial.app, and private-storage.clicksocial.app.

That resolved the errors in (6), but I had also tried to manually create the storage buckets, and I got this error, so I had to remove the buckets and re-run the script (probably just my bad, but I wanted to add this for completeness):

╷
│ Error: googleapi: Error 409: Your previous request to create the named bucket succeeded and you already own it., conflict
│ 
│   with module.serverpod_production.google_storage_bucket.public[0],
│   on .terraform/modules/serverpod_production/storage.tf line 3, in resource "google_storage_bucket" "public":
│    3: resource "google_storage_bucket" "public" {
│ 
╵
╷
│ Error: googleapi: Error 409: Your previous request to create the named bucket succeeded and you already own it., conflict
│ 
│   with module.serverpod_production.google_storage_bucket.private[0],
│   on .terraform/modules/serverpod_production/storage.tf line 33, in resource "google_storage_bucket" "private":
│   33: resource "google_storage_bucket" "private" {
│ 
╵

lukehutch commented 2 weeks ago

(10) The domain database.clicksocial.app was not created automatically, and I'm not sure how to set that up...

(11) I copied the database server IP address from SQL > Connections > Summary > Public IP address, but psql cannot connect to that IP (I assume Postgresql is supposed to be running on the standard port on that server?).

Yes, I added my local machine's public IP address as an authorized network in the previous step, but that didn't work. I had to add the suffix /32 to the IP address to get it to take. (Maybe I should have used /0? Not sure...)

lukehutch commented 2 weeks ago

(12) The database creation commands definition.sql do not include the directive to switch to the correct database, as specified in production.yaml, under database: name:, which means that the tables are created in the wrong database, postgres, in my case (my database name is click). At a minimum the instructions need to tell the user to manually create the database using create database DBNAME; and then switch to that database before running definition.sql, either using a database frontend, or the psql directive \c DBNAME.

lukehutch commented 2 weeks ago

On (10), it turns out the domain name database.clicksocial.app was in fact created under Cloud DNS > Zone details, but it does not seem to be propagating to the DNS (Google DNS changes are usually very fast, and I waited a long time for this to propagate, but it hasn't...)

The only domain name out of this list that is working is clicksocial.app.

I verified in Google Domains that the domain clicksocial.app is pointing to the Google Cloud nameservers, which they were. There were no subdomains listed in Google Domains, e.g. the database. subdomain...

Also storage.clicksocial.app is listed above, but not the private-storage.clicksocial.app that I was asked to verify ownership of before.

lukehutch commented 2 weeks ago

(13) I can't push changes in the server code to GCE, because it tells me (hours after launching the VM instance) that the instance group is "transforming":

Clicking on the name tells me the VM is being recreated:

I see the note at the end that some things may take hours to complete, but it says especially DNS... probably the same note should be added about VM creation...

However I'm not convinced that this is an instance of just not waiting long enough. SSH'ing into the server through the GCE UI and running docker ps tells me the docker container is up, so the VM and the database are running. But the health check keeps failing:

The health check is just checking port 8080 is open on the server, and that is failing, so I checked what the firewall rules are:

$ gcloud compute firewall-rules list
NAME                               NETWORK                       DIRECTION  PRIORITY  ALLOW                         DENY  DISABLED
default-allow-icmp                 default                       INGRESS    65534     icmp                                False
default-allow-internal             default                       INGRESS    65534     tcp:0-65535,udp:0-65535,icmp        False
default-allow-rdp                  default                       INGRESS    65534     tcp:3389                            False
default-allow-ssh                  default                       INGRESS    65534     tcp:22                              False
serverpod-production-instance      serverpod-production-network  INGRESS    1000      tcp:8080-8082                       False
serverpod-production-instance-ssh  serverpod-production-network  INGRESS    1000      tcp:22                              False

The port looks like it's open. But if I try to connect to port 8080 from another machine, using the IP address rather than the domain name for , I can't connect...

In fact if I try to connect to any of the domains other than database.clicksocial.app, using telnet, port 8080 and 8081 are not open, but 80 and 443 at least say

$ telnet api.clicksocial.app 80
Trying 34.117.183.214...
Connected to api.clicksocial.app.
Escape character is '^]'.
Connection closed by foreign host.

which means the port is technically open.

This doesn't seem right for the API server, which is purported to run on port 8080.

Also, the API server's VM instance has a different IP address than any of the IP addresses that the domains are mapped to in the zone details. If I SSH to that server, I can confirm that there is a process running on port 8080, and no process bound to port 80:

$ sudo ss -tulpn
Netid           State            Recv-Q           Send-Q                       Local Address:Port                      Peer Address:Port           Process                                              
udp             UNCONN           0                0                                127.0.0.1:323                            0.0.0.0:*               users:(("chronyd",pid=496,fd=5))                    
udp             UNCONN           0                0                               127.0.0.54:53                             0.0.0.0:*               users:(("systemd-resolve",pid=356,fd=16))           
udp             UNCONN           0                0                            127.0.0.53%lo:53                             0.0.0.0:*               users:(("systemd-resolve",pid=356,fd=14))           
udp             UNCONN           0                0                          10.128.0.2%eth0:68                             0.0.0.0:*               users:(("systemd-network",pid=221,fd=18))           
udp             UNCONN           0                0                                    [::1]:323                               [::]:*               users:(("chronyd",pid=496,fd=6))                    
tcp             LISTEN           0                4096                         127.0.0.53%lo:53                             0.0.0.0:*               users:(("systemd-resolve",pid=356,fd=15))           
tcp             LISTEN           0                4096                            127.0.0.54:53                             0.0.0.0:*               users:(("systemd-resolve",pid=356,fd=17))           
tcp             LISTEN           0                128                                0.0.0.0:22                             0.0.0.0:*               users:(("sshd",pid=564,fd=3))                       
tcp             LISTEN           0                4096                             127.0.0.1:35335                          0.0.0.0:*               users:(("containerd",pid=434,fd=8))                 
tcp             LISTEN           0                4096                               0.0.0.0:8082                           0.0.0.0:*               users:(("docker-proxy",pid=909,fd=4))               
tcp             LISTEN           0                4096                               0.0.0.0:8081                           0.0.0.0:*               users:(("docker-proxy",pid=929,fd=4))               
tcp             LISTEN           0                4096                               0.0.0.0:8080                           0.0.0.0:*               users:(("docker-proxy",pid=949,fd=4))               
tcp             LISTEN           0                128                                   [::]:22                                [::]:*               users:(("sshd",pid=564,fd=4))                       
tcp             LISTEN           0                4096                                  [::]:8082                              [::]:*               users:(("docker-proxy",pid=915,fd=4))               
tcp             LISTEN           0                4096                                  [::]:8081                              [::]:*               users:(("docker-proxy",pid=935,fd=4))               
tcp             LISTEN           0                4096                                  [::]:8080                              [::]:*               users:(("docker-proxy",pid=955,fd=4))

So this is not just a case of the true IP address being behind a load balancer -- in fact, api.clicksocial.app seems to be pointed to the wrong VM instance or something.

lukehutch commented 2 weeks ago

(14) the TTL of 60 seconds (from the screenshot a couple of comments back) is very, very short... shouldn't it be 1800 or 3600 by default?

lukehutch commented 2 weeks ago

OK, after many hours of searching, I think I found the problem. The instance template has health check traffic disabled (as well as HTTP and HTTPS, although the ports that need to be open are 8080-8082). I presume I need to now figure out where in the tf scripts this is set up, and fix it...

lukehutch commented 2 weeks ago

Sorry for the play-by-play...

OK, I finally found the firewall rule, and it has the GFE load balancer ranges of 130.211.0.0/22 and 35.191.0.0/16.

lukehutch commented 2 weeks ago

(15) It would help a lot if there was some general info in the docs about useful parts of the GCP console to look at for info -- for example, I finally found the logs explorer, under Cloud > VM Instances > Logs.

lukehutch commented 2 weeks ago

(16) Issue with terraform destroy, it can't delete everything without manual steps:

╷
│ Error: Error when reading or editing ManagedZone: googleapi: Error 400: The resource named 'serverpod-production-private' cannot be deleted because it contains one or more 'resource records'., containerNotEmpty
│ 
│ 
╵
╷
│ Error: Error, failed to deleteuser postgres in instance serverpod-production-database: googleapi: Error 400: Invalid request: failed to delete user postgres: . role "postgres" cannot be dropped because some objects depend on it Details: owner of database click
│ 84 objects in database click., invalid
│ 
│ 
╵

lukehutch commented 2 weeks ago

Based on the logs, the api server process was refusing to start up with the error

Failed to connect to the database. Retrying in 10 seconds. SocketException: Failed host lookup: 'database.clicksocial.app' (OS Error: Name or service not known, errno = -2)

and was retrying every 10 seconds. So maybe that's why port 8080 wasn't reachable, and the health checks were failing...

After many hours waiting for DNS to propagate, it still hasn't. I know DNS changes can take 48 or even 72 hours, but usually with Google DNS, it's a matter of seconds or maybe a minute. So there may be an issue there (or maybe I'm not patient enough).

I went ahead and manually created the A records for each of the following in Google Domains though. These changes don't seem to want to propagate from Cloud DNS to Google Domains. Any idea why? If I manually added these, will it screw up auto-updates in future?

$ gcloud dns record-sets list --zone=clicksocial
NAME                       TYPE  TTL    DATA
clicksocial.app.           NS    21600  ns-cloud-d1.googledomains.com.,ns-cloud-d2.googledomains.com.,ns-cloud-d3.googledomains.com.,ns-cloud-d4.googledomains.com.
clicksocial.app.           SOA   21600  ns-cloud-d1.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300
api.clicksocial.app.       A     60     34.117.151.252
app.clicksocial.app.       A     60     34.102.129.203
database.clicksocial.app.  A     60     34.134.239.243
insights.clicksocial.app.  A     60     34.117.183.214
storage.clicksocial.app.   A     60     34.36.229.8

After manually adding these A records, the error changes to:

Failed to connect to the database. Retrying in 10 seconds. SocketException: Connection timed out, host: database.clicksocial.app, port: 5432

So does this mean that the API server is not part of serverpod-production-network, which is the managed network range that the Postgres instance allows connections from?

Isakdl commented 1 week ago

I will try to answer as many of these as I can :)

(2). There were some problems with our old Dockerfile, we have made a new one here, unfortunately these will not automatically update for you, but you can copy paste the code into your project.

(3). Fixed today.

(4). Sounds like we are missing redis.googleapis.com in the enable api section in terraform. Will make a PR for that.

(5). I believe we are fully compatible with PostgreSQL 15 and 16, at least our test suit passes. As for dart you can pick any dart version in the container you want, no problem. You get one from the template when creating a project then you are responsible for it.

(7). You should always be able to just run terraform apply again, assuming you have the access to the terraform state (file). Sometimes things bug out anyway but that is mostly a bug within a terraform provider if that happens.

(10). I have to double check but I believe the database domain is supposed to be for the internal network only. (12). If you are creating a new database you will have to specify it in the production.yaml file and then run the server with mode production when applying the migration. (This is the intended workflow anyway).

(14). Yes that sounds short, have to verify what we are doing there.

(16). Question: did you manually add things in the GCP console for these? Terraform does not play nice when mixing manual edits with terraform setups.

Hmm looks like the DNS setup for the database domain might not be correct, ill have to dive into the terraform scripts a bit to understand what is going on.

It very much looks like we can improve the documentation here, ideally all you should have to do is run terraform apply and be up and running (after waiting a bit).

lukehutch commented 1 week ago

Thanks, I'll work through your responses soon.

I spent a whole day trying to solve the last couple of issues and couldn't figure it out. Any chance I could grab an hour of your time on a screenshare to work through it, in the name of making sure other users don't hit these issues in future?

Isakdl commented 1 week ago

For the database connection: try to use this domain instead:

database.private-${var.runmode}.${var.top_domain}

e.g.

database.private-production.clicksocial.app

ill ping you on LinkedIn.

lukehutch commented 1 week ago

I will try to answer as many of these as I can :)

Thanks! I was hoping the point-by-point may help you identify how to improve the docs..

(2). There were some problems with our old Dockerfile, we have made a new one here, unfortunately these will not automatically update for you, but you can copy paste the code into your project.

Yes, I updated this previously.

(4). Sounds like we are missing redis.googleapis.com in the enable api section in terraform. Will make a PR for that.

Thanks.

(5). I believe we are fully compatible with PostgreSQL 15 and 16, at least our test suit passes. As for dart you can pick any dart version in the container you want, no problem. You get one from the template when creating a project then you are responsible for it.

I have been running without issue on Postgres 16 for months now. There shouldn't be any problems. It's probably worth staying more recent as your default version, because Postgres has some big optimization changes in the pipeline.

(7). You should always be able to just run terraform apply again, assuming you have the access to the terraform state (file). Sometimes things bug out anyway but that is mostly a bug within a terraform provider if that happens.

OK, but please comment on this in the docs.

(10). I have to double check but I believe the database domain is supposed to be for the internal network only. (12). If you are creating a new database you will have to specify it in the production.yaml file and then run the server with mode production when applying the migration. (This is the intended workflow anyway).

Yes, that's what I'm doing, but I have to provide the database host there, and I had assumed that I was supposed to use the domain that had an actual DNS entry registered. This needs to be better documented. In particular, you really need an overview in the docs about exactly what structure you are setting up: GFE load balancers, frontend servers, database server (with read-only replicas?), endpoint servers (with what sort of elastic scaling setup), Redis server... and there needs to be info about which machines are set up on which domains; which IPs are visible and which endpoints accessible to which machines; etc. -- the entire network topology needs to be spelled out. I tried and could not figure all this out.

(16). Question: did you manually add things in the GCP console for these? Terraform does not play nice when mixing manual edits with terraform setups.

Hmm looks like the DNS setup for the database domain might not be correct, ill have to dive into the terraform scripts a bit to understand what is going on.

Thanks, please reach out if you figure this out, since I'm blocked right now!

lukehutch commented 1 week ago

A couple more...

(17) The Postgres server, unloaded, is consuming 10% CPU constantly...

(18) In addition to the API health monitor not being able to connect to port 8080, which eventually causes the API instance to be killed and restarted over and over again, the API server process keeps restarting every 10 seconds, because it can't resolve the database server URL. These may actually be the same problem, I'm not sure. (I'm going to try the database domain you suggested.)

lukehutch commented 1 week ago

(19) terraform destroy fails on a couple of tasks...

╷
│ Error: Error when reading or editing ManagedZone: googleapi: Error 400: The resource named 'serverpod-production-private' cannot be deleted because it contains one or more 'resource records'., containerNotEmpty
│ 
│ 
╵
╷
│ Error: Error, failed to deleteuser postgres in instance serverpod-production-database: googleapi: Error 400: Invalid request: failed to delete user postgres: . role "postgres" cannot be dropped because some objects depend on it Details: owner of database click
│ 86 objects in database click., invalid
│ 
│ 
╵

lukehutch commented 1 week ago

(19) The GitHub deploy steps don't seem to update the running API server instance. I tried updating my passwords.yaml file to use this domain, as you suggested:

database.private-production.clicksocial.app

Then I updated the copy of passwords.yaml in GitHub secrets, and ran the deployment workflow. However, the running server never shut down and restarted, and it is still spitting out these errors every 25 seconds, with the old domain name:

startup-script: Failed to connect to the database. Retrying in 10 seconds. SocketException: Connection timed out, host: database.clicksocial.app, port: 5432

(Also, it's every 25 seconds, despite the error message saying it would retry in 10 seconds)

The old domain is in the tfstate as part of a managed zone -- does this need to be updated to the domain you suggested? I'm not sure if I'm supposed to touch anything in tfstate...

    {
      "module": "module.serverpod_production",
      "mode": "managed",
      "type": "google_dns_record_set",
      "name": "database",
      "provider": "provider[\"registry.terraform.io/hashicorp/google\"]",
      "instances": [
        {
          "schema_version": 0,
          "attributes": {
            "id": "projects/clicksocial-app/managedZones/clicksocial/rrsets/database.clicksocial.app./A",
            "managed_zone": "clicksocial",
            "name": "database.clicksocial.app.",
            "project": "clicksocial-app",
            "routing_policy": [],
            "rrdatas": [
              <redacted>
            ],
            "ttl": 60,
            "type": "A"
          },
          "sensitive_attributes": [],
          "private": <redacted>,
          "dependencies": [
            "module.serverpod_production.google_compute_global_address.private-ip",
            "module.serverpod_production.google_compute_network.serverpod",
            "module.serverpod_production.google_service_networking_connection.private-vpc-connection",
            "module.serverpod_production.google_sql_database_instance.serverpod"
          ]
        }
      ]
    },

I tried terraform destroy then terraform apply after pushing the GitHub deployment, and that did not fix the problem...

It's really a shame that this is all so complex. I'm pulling out my hair about this. I'm wondering if it wouldn't be better to just launch my backend on a VPS service like VPSDime, without all this ridiculously complex Google machinery on top.

Isakdl commented 1 week ago

(17). Without anything connected to the db? Not sure what is going on there, but I would direct you too google for help on that. All we do is create a database instance so no different if you would do that manually in the GUI.

(18). Yes indeed. The server is not starting if the database cannot be reached, so it will not listen for traffic on any port if it is stuck on this step. Would probably be a good idea to actually listen for these checks even if we cannot connect to the db. But the health check should still fail considering the instance is not healthy if we are unable to connect to the db.

(19.1). In the DNS did you create entries manually? If you did you would have to delete them manually too before running terraform destroy. Same logic on the database. I believe this is why you get these errors.

(19.2) The automatic deploy is not enabled by default in the deploy script, you can uncommet this job and it should reboot your instance with the new container.

As for modifying the terraform statefile, you should not do that manually. You can sometimes delete things from the state but then you should know what you are doing and why so I recommend to not do that. The address I gave you should exist in your state already but configured in the private VPC.

lukehutch commented 1 week ago

(17). Without anything connected to the db? Not sure what is going on there, but I would direct you too google for help on that. All we do is create a database instance so no different if you would do that manually in the GUI.

Yes, correct, nothing was connected to the db instance, and the CPU usage was a constant 10%. I couldn't figure out what could be causing that.

(18). Yes indeed. The server is not starting if the database cannot be reached, so it will not listen for traffic on any port if it is stuck on this step. Would probably be a good idea to actually listen for these checks even if we cannot connect to the db. But the health check should still fail considering the instance is not healthy if we are unable to connect to the db.

Makes sense. But changing the database domain in the API server to the one you suggested didn't fix the problem, because the GitHub deployment step wasn't updating the API server code for some reason (the API server was still reporting that it was trying to connect to the database.clicksocial.app domain, even after the change).

(19.1). In the DNS did you create entries manually? If you did you would have to delete them manually too before running terraform destroy. Same logic on the database. I believe this is why you get these errors.

I waited for the DNS entries to propagate from Cloud DNS to Google DNS, and they didn't, so yes, I created the entries manually, using the IP address values that the gcloud dns record-sets list --zone=clicksocial command returned. This allowed the API server to actually look up the domain name, but still did not allow it to connect. I deleted these manual entries again because it didn't fix the problem.

Also, the gcloud dns record-sets list --zone=clicksocial command did not provide an IP address for the domain database.private-production.clicksocial.app

(19.2) The automatic deploy is not enabled by default in the deploy script, you can uncommet this job and it should reboot your instance with the new container.

I don't want to manually push a new version of the server to production on every GitHub checkin, that would be disruptive to the running servers, since I have one GitHub repo for frontend and backend.

What I was saying is failing is my manual initiation of the deployment action on GitHub. It showed as successful on GitHub, but the code changes never made it into the API server VM.

lukehutch commented 1 week ago

(17). Without anything connected to the db? Not sure what is going on there, but I would direct you too google for help on that. All we do is create a database instance so no different if you would do that manually in the GUI.

Yes, correct, nothing was connected to the db instance, and the CPU usage was a constant 10%. I couldn't figure out what could be causing that.

A screenshot of this...

This is in a brand new database, after terraform destroy then terraform apply...

There is nothing useful in the logs. But it looks like the activity could be something to do with the cloudsqladmin database...

Something is fetching 70 rows per second, even though there aren't really any rows in the database:

However I enabled Query Insights, and it does not show any queries being processed:

And I'm not sure about this one, but it looks like transaction ids are being leaked: