status-im / infra-eth-cluster

Infrastructure for Status-go fleets
https://github.com/status-im/status-go
0 stars 0 forks source link

Google Cloud prod hosts unavailable at times #47

Closed jakubgs closed 2 years ago

jakubgs commented 2 years ago

I've been seeing some flapping and unavailability from Google Cloud host son prod fleet.

image

The errors are timeouts:

Error waiting for mailserver response package=status-go/cmd/node-canary error="timed out waiting for mailserver response"
jakubgs commented 2 years ago

If we look at mail-01.gc-us-central1-a.eth.prod average load we can see a gradual growth over last 30 days:

image

Considering the instance has just one core that's quite a lot.

jakubgs commented 2 years ago

This is also interesting, we used to have a lot of disk utilization, then it died down for a while, and now we're back:

image

To nearly constant 100% utilization.

jakubgs commented 2 years ago

And mail-02.gc-us-central1-a.eth.prod sees pretty much constant 100% utilization:

image

And same goes for mail-03.gc-us-central1-a.eth.prod.

jakubgs commented 2 years ago

Digital Ocean hosts don't have this issue, although they have seen some disk utilization spikes as well:

image

jakubgs commented 2 years ago

The bandwidth used on DO hosts isn't huge, hovers around ~150 MB/s reads:

image

While on GC the bandwith of read son the disk is almost an order of magnitude lower:

image

jakubgs commented 2 years ago

By default we use pg-balanced type of data volume:

variable "data_vol_type" {
  description = "Type of the extra data volume."
  type        = string
  default     = "pd-balanced"
  /* Use: gcloud compute disk-types list */
}

https://github.com/status-im/infra-tf-google-cloud/blob/85077c7503d9ea1044747808b7d485ddc88b2bc8/variables.tf#L35-L40

Here are the available options:

 > gcloud compute disk-types list | grep us-central1-a
local-ssd    us-central1-a              375GB-375GB
pd-balanced  us-central1-a              10GB-65536GB
pd-extreme   us-central1-a              500GB-65536GB
pd-ssd       us-central1-a              10GB-65536GB
pd-standard  us-central1-a              10GB-65536GB
jakubgs commented 2 years ago

We can see the difference here: https://cloud.google.com/compute/docs/disks#introduction


The following table shows maximum sustained IOPS for zonal persistent disks:

  Zonal standard PD Zonal balanced PD Zonal SSD PD Zonal extreme PD Zonal SSD PD multi-writer mode
IOPS per GB 1.5 6 30 30
Read IOPS per instance 7,500* 80,000* 100,000* 120,000* 100,000*
Write IOPS per instance 15,000* 80,000* 100,000* 120,000* 100,000*

The following table shows maximum sustained throughput for zonal persistent disks:

  Zonal standard PD Zonal balanced PD Zonal SSD PD Zonal extreme PD Zonal SSD PD multi-writer mode
Throughput per GB 0.12 0.28 0.48 0.48
Read throughput per instance 1,200* 1,200* 1,200* 2,200** 1,200**
Write throughput per instance 400** 1,200* 1,200* 2,200** 1,200**
jakubgs commented 2 years ago

The cost difference isn't huge:

Type Price (monthly in USD)
Standard provisioned space $0.040 per GB
SSD provisioned space $0.170 per GB
Balanced provisioned space $0.100 per GB
Extreme provisioned space $0.125 per GB
Extreme provisioned IOPS $0.065 per IOPS provisioned

https://cloud.google.com/compute/disks-image-pricing#disk

jakubgs commented 2 years ago

I've made data volume type parametrizeable in the mult-provider role:

I'm going to try using pd-ssd type for the data volume.

jakubgs commented 2 years ago

Unfortunately changing volume type cause sit to be replaced:

  # module.mail.module.gc-us-central1-a[0].google_compute_disk.host["mail-02.gc-us-central1-a.eth.prod"] must be replaced
-/+ resource "google_compute_disk" "host" {
      ~ creation_timestamp        = "2020-12-07T01:55:58.156-08:00" -> (known after apply)
      ~ id                        = "projects/russia-servers/zones/us-central1-a/disks/data-mail-02-gc-us-central1-a-eth-prod" -> (known after apply)
      ~ label_fingerprint         = "42WmSpB8rSM=" -> (known after apply)
      - labels                    = {} -> null
      ~ last_attach_timestamp     = "2020-12-07T01:56:11.880-08:00" -> (known after apply)
      + last_detach_timestamp     = (known after apply)
        name                      = "data-mail-02-gc-us-central1-a-eth-prod"
      ~ physical_block_size_bytes = 4096 -> (known after apply)
      ~ project                   = "russia-servers" -> (known after apply)
      - provisioned_iops          = 0 -> null
      ~ self_link                 = "https://www.googleapis.com/compute/v1/projects/russia-servers/zones/us-central1-a/disks/data-mail-02-gc-us-central1-a-eth-prod" -> (known after apply)
      + source_image_id           = (known after apply)
      + source_snapshot_id        = (known after apply)
      ~ type                      = "pd-balanced" -> "pd-ssd" # forces replacement
      ~ users                     = [
          - "https://www.googleapis.com/compute/v1/projects/russia-servers/zones/us-central1-a/instances/mail-02-gc-us-central1-a-eth-prod",
        ] -> (known after apply)
        # (2 unchanged attributes hidden)
    }

Which is pretty sad, because on AWS you can change volume times on the fly. It just changes their limits.

jakubgs commented 2 years ago

What's even more annoying, is that attaching it is part of the google_compute_instance resource:

  # module.mail.module.gc-us-central1-a[0].google_compute_instance.host["mail-01.gc-us-central1-a.eth.prod"] will be updated in-place
  ~ resource "google_compute_instance" "host" {
        id                        = "projects/russia-servers/zones/us-central1-a/instances/mail-01-gc-us-central1-a-eth-prod"
        name                      = "mail-01-gc-us-central1-a-eth-prod"
        tags                      = [
            "eth",
            "mail",
            "mail-eth-prod",
            "prod",
        ]
        # (18 unchanged attributes hidden)

      + attached_disk {
          + device_name = "data-mail-01-gc-us-central1-a-eth-prod"
          + mode        = "READ_WRITE"
          + source      = "https://www.googleapis.com/compute/v1/projects/russia-servers/zones/us-central1-a/disks/data-mail-01-gc-us-central1-a-eth-prod"
        }

        # (3 unchanged blocks hidden)
    }

But targeting it with --target still pulls in the volumes for other instances. So we can't replace them on at a time with Terraform.

jakubgs commented 2 years ago

I managed to attach the drive manually through the Google Cloud Web Console.

image

jakubgs commented 2 years ago

Performance results for pd-balanced:

 > sudo hdparm -tT /dev/sdb
/dev/sdb:
 Timing cached reads:   13810 MB in  1.99 seconds = 6936.54 MB/sec
 Timing buffered disk reads: 440 MB in  3.00 seconds = 146.56 MB/sec

And performance for pd-ssd:

 > sudo hdparm -tT /dev/sdb
/dev/sdb:
 Timing cached reads:   14468 MB in  1.99 seconds = 7267.27 MB/sec
 Timing buffered disk reads: 740 MB in  3.00 seconds = 246.55 MB/sec
jakubgs commented 2 years ago

This disk usage looks better but not terribly better:

image

image

But load average looks much better:

image

jakubgs commented 2 years ago

Changed the volume type on all 3 prod hosts:

We'll see what that does in the long term.

jakubgs commented 2 years ago

Looks better than before, but still, the periodic spikes to 100% are bad:

mail-01.gc-us-central1-a.eth.prod

image

mail-02.gc-us-central1-a.eth.prod

image

mail-03.gc-us-central1-a.eth.prod

image


For some reason mail-03 is barely used. It's weird.

jakubgs commented 2 years ago

The traffic we see on all hosts is roughly the same, so this I/O is quite weird:

mail-01.gc-us-central1-a.eth.prod

image

mail-02.gc-us-central1-a.eth.prod

image

mail-03.gc-us-central1-a.eth.prod

image

jakubgs commented 2 years ago

We can see that mail-01 is receiving the normal amount of envelopes:

image

jakubgs commented 2 years ago

Oh, but look at this:

mail-01.gc-us-central1-a.eth.prod

image

mail-02.gc-us-central1-a.eth.prod

image

mail-03.gc-us-central1-a.eth.prod

image


This is weird. mail-01 and mail-02 count archived envelopes in millions, while mail-03 in thousands. Why?

The shape of the graphs is weird too.

jakubgs commented 2 years ago

The mailserver_archived_envelopes_total metric is updated in two places for PostgreSQL:

func (i *PostgresDB) updateArchivedEnvelopesCount() {
    if count, err := i.envelopesCount(); err != nil {
        log.Warn("db query for envelopes count failed", "err", err)
    } else {
        archivedEnvelopesGauge.WithLabelValues(i.name).Set(float64(count))
    }
}

https://github.com/status-im/status-go/blob/530f3c7a/mailserver/mailserver_db_postgres.go#L83-L89

    archivedEnvelopesGauge.WithLabelValues(i.name).Inc()
    archivedEnvelopeSizeMeter.WithLabelValues(i.name).Observe(
        float64(waku.EnvelopeHeaderLength + env.Size()))

https://github.com/status-im/status-go/blob/530f3c7a/mailserver/mailserver_db_postgres.go#L283

Based on the shape I'd say mail-03 metrics come from calling Inc(), while for the other ones from Set().

jakubgs commented 2 years ago

I tried to do some debugging with @Samyoul and we managed to narrow down the issue to a flurry of SQL queries that happen periodically. This query is created by the PostgresDB.BuildIterator method here: https://github.com/status-im/status-go/blob/8191f24ef3bfed320db7efa9e57f5f682fb861b2/mailserver/mailserver_db_postgres.go#L128-L164

But while trying to add some more debugging to reveal the SQL query parameters we saw something interesting:

image

The highlighted part is the develop based debug Docker image running, which clearly fixed the spikes issue, which is even more visible if we look at just read delays:

image

But now after returning to 10 months old deploy-prod node is stuck starting at:

INFO [07-26|12:22:38.760] Connecting to postgres database

And unresponsive.

jakubgs commented 2 years ago

While working on debugging this I fixed a few things and also added query metrics:

jakubgs commented 2 years ago

I've decided to try a fleet upgrade, and I've tagged a v0.104.0 release of status-go: https://github.com/status-im/status-go/releases/tag/v0.104.0

So far so good.

image

I tested messaging and history between mobile devices and it works fine. Desktop doesn't seem to see messages from mobile, but the same is true on eth.prod, so it's not a regression.

jakubgs commented 2 years ago

The nodes are looking quite healthy after the upgrade:

image

image

But part of that is definitely due to purging the database.

jakubgs commented 2 years ago

What's interesting is that the nodes show these semi-regular - every 5-7 minutes - queries for full history and full filter:

image

The gap in the middle was due to me blocking all DevP2P traffic on the firewall, so the queries are not self-inflicted.

jakubgs commented 2 years ago

I have pushed v0.104.0 to eth.staging. So far so good:

image

jakubgs commented 2 years ago

Nice drop-off on CPU load and memory usage:

image

jakubgs commented 2 years ago

I think the 30 day CPU usage graphs shows a really favorable picture:

image

jakubgs commented 2 years ago

Looks like after 21 days of using v0.104.0 release on eth.staging no issues were found by the Test Team. Both Mobile and Desktop was tested and no problems were found.

I will deploy manually v0.104.0 to all 01 nodes on eth.prod as a start, and give it a few days.

jakubgs commented 2 years ago

I've deployed the deploy-staging image to 6 hosts: node-01 and mail-01 from all 3 DCs:

 > a mail -o -a '/docker/statusd-mail/rpc.sh admin_nodeInfo | jq -r .result.name' 
mail-01.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-03.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-01.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-03.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-01.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-03.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15

Once I confirm no issues were caused by this I'll do the rest of the fleet next week.

jakubgs commented 2 years ago

The mail-01 hosts look good, especially the Google Cloud ones:

mail-01.gc-us-central1-a.eth.prod

image

image

mail-02.gc-us-central1-a.eth.prod

image

image


The CPU load and disk utilization metrics certainly look promising. Although we can see some spikes still showing up.

jakubgs commented 2 years ago

The IO spikes appear to have no correlation with query metrics:

image

jakubgs commented 2 years ago

Upgraded *-02 nodes:

 > a mail -o -a '/docker/statusd-mail/rpc.sh admin_nodeInfo | jq -r .result.name' 
mail-01.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-01.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-01.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
jakubgs commented 2 years ago

As far as I can tell everything is working fine:

image

So I will be completing the upgrade today.

jakubgs commented 2 years ago

All nodes have been upgraded:

 > a whisper -o -a '/docker/statusd-whisper/rpc.sh admin_nodeInfo | jq -r .result.name'
node-01.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-02.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-03.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-04.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-05.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-06.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-07.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-08.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-01.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-02.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-03.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-04.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-05.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-06.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-07.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-08.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-01.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-02.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-03.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-04.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-05.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-06.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-07.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-08.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4

 > a mail -o -a '/docker/statusd-mail/rpc.sh admin_nodeInfo | jq -r .result.name'       
mail-01.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-01.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-01.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
jakubgs commented 2 years ago

Definitely improvement in terms of disk usage for the whole fleet:

image

Same for load average:

image

Pretty good.

jakubgs commented 2 years ago

The I/O on Alibaba Cloud hosts definitely looks better:

image

Google Cloud hosts look slightly better, but the spikes still appear:

image

I think we can close this... for now.