Closed jakubgs closed 2 years ago
If we look at mail-01.gc-us-central1-a.eth.prod
average load we can see a gradual growth over last 30 days:
Considering the instance has just one core that's quite a lot.
This is also interesting, we used to have a lot of disk utilization, then it died down for a while, and now we're back:
To nearly constant 100% utilization.
And mail-02.gc-us-central1-a.eth.prod
sees pretty much constant 100% utilization:
And same goes for mail-03.gc-us-central1-a.eth.prod
.
Digital Ocean hosts don't have this issue, although they have seen some disk utilization spikes as well:
The bandwidth used on DO hosts isn't huge, hovers around ~150 MB/s reads:
While on GC the bandwith of read son the disk is almost an order of magnitude lower:
By default we use pg-balanced
type of data volume:
variable "data_vol_type" {
description = "Type of the extra data volume."
type = string
default = "pd-balanced"
/* Use: gcloud compute disk-types list */
}
Here are the available options:
> gcloud compute disk-types list | grep us-central1-a
local-ssd us-central1-a 375GB-375GB
pd-balanced us-central1-a 10GB-65536GB
pd-extreme us-central1-a 500GB-65536GB
pd-ssd us-central1-a 10GB-65536GB
pd-standard us-central1-a 10GB-65536GB
We can see the difference here: https://cloud.google.com/compute/docs/disks#introduction
The following table shows maximum sustained IOPS for zonal persistent disks:
Zonal standard PD | Zonal balanced PD | Zonal SSD PD | Zonal extreme PD | Zonal SSD PD multi-writer mode | |
---|---|---|---|---|---|
IOPS per GB | 1.5 | 6 | 30 | – | 30 |
Read IOPS per instance | 7,500* | 80,000* | 100,000* | 120,000* | 100,000* |
Write IOPS per instance | 15,000* | 80,000* | 100,000* | 120,000* | 100,000* |
The following table shows maximum sustained throughput for zonal persistent disks:
Zonal standard PD | Zonal balanced PD | Zonal SSD PD | Zonal extreme PD | Zonal SSD PD multi-writer mode | |
---|---|---|---|---|---|
Throughput per GB | 0.12 | 0.28 | 0.48 | – | 0.48 |
Read throughput per instance | 1,200* | 1,200* | 1,200* | 2,200** | 1,200** |
Write throughput per instance | 400** | 1,200* | 1,200* | 2,200** | 1,200** |
The cost difference isn't huge:
Type | Price (monthly in USD) |
---|---|
Standard provisioned space | $0.040 per GB |
SSD provisioned space | $0.170 per GB |
Balanced provisioned space | $0.100 per GB |
Extreme provisioned space | $0.125 per GB |
Extreme provisioned IOPS | $0.065 per IOPS provisioned |
I've made data volume type parametrizeable in the mult-provider role:
I'm going to try using pd-ssd
type for the data volume.
Unfortunately changing volume type cause sit to be replaced:
# module.mail.module.gc-us-central1-a[0].google_compute_disk.host["mail-02.gc-us-central1-a.eth.prod"] must be replaced
-/+ resource "google_compute_disk" "host" {
~ creation_timestamp = "2020-12-07T01:55:58.156-08:00" -> (known after apply)
~ id = "projects/russia-servers/zones/us-central1-a/disks/data-mail-02-gc-us-central1-a-eth-prod" -> (known after apply)
~ label_fingerprint = "42WmSpB8rSM=" -> (known after apply)
- labels = {} -> null
~ last_attach_timestamp = "2020-12-07T01:56:11.880-08:00" -> (known after apply)
+ last_detach_timestamp = (known after apply)
name = "data-mail-02-gc-us-central1-a-eth-prod"
~ physical_block_size_bytes = 4096 -> (known after apply)
~ project = "russia-servers" -> (known after apply)
- provisioned_iops = 0 -> null
~ self_link = "https://www.googleapis.com/compute/v1/projects/russia-servers/zones/us-central1-a/disks/data-mail-02-gc-us-central1-a-eth-prod" -> (known after apply)
+ source_image_id = (known after apply)
+ source_snapshot_id = (known after apply)
~ type = "pd-balanced" -> "pd-ssd" # forces replacement
~ users = [
- "https://www.googleapis.com/compute/v1/projects/russia-servers/zones/us-central1-a/instances/mail-02-gc-us-central1-a-eth-prod",
] -> (known after apply)
# (2 unchanged attributes hidden)
}
Which is pretty sad, because on AWS you can change volume times on the fly. It just changes their limits.
What's even more annoying, is that attaching it is part of the google_compute_instance
resource:
# module.mail.module.gc-us-central1-a[0].google_compute_instance.host["mail-01.gc-us-central1-a.eth.prod"] will be updated in-place
~ resource "google_compute_instance" "host" {
id = "projects/russia-servers/zones/us-central1-a/instances/mail-01-gc-us-central1-a-eth-prod"
name = "mail-01-gc-us-central1-a-eth-prod"
tags = [
"eth",
"mail",
"mail-eth-prod",
"prod",
]
# (18 unchanged attributes hidden)
+ attached_disk {
+ device_name = "data-mail-01-gc-us-central1-a-eth-prod"
+ mode = "READ_WRITE"
+ source = "https://www.googleapis.com/compute/v1/projects/russia-servers/zones/us-central1-a/disks/data-mail-01-gc-us-central1-a-eth-prod"
}
# (3 unchanged blocks hidden)
}
But targeting it with --target
still pulls in the volumes for other instances. So we can't replace them on at a time with Terraform.
I managed to attach the drive manually through the Google Cloud Web Console.
Performance results for pd-balanced
:
> sudo hdparm -tT /dev/sdb
/dev/sdb:
Timing cached reads: 13810 MB in 1.99 seconds = 6936.54 MB/sec
Timing buffered disk reads: 440 MB in 3.00 seconds = 146.56 MB/sec
And performance for pd-ssd
:
> sudo hdparm -tT /dev/sdb
/dev/sdb:
Timing cached reads: 14468 MB in 1.99 seconds = 7267.27 MB/sec
Timing buffered disk reads: 740 MB in 3.00 seconds = 246.55 MB/sec
This disk usage looks better but not terribly better:
But load average looks much better:
Changed the volume type on all 3 prod hosts:
We'll see what that does in the long term.
Looks better than before, but still, the periodic spikes to 100% are bad:
mail-01.gc-us-central1-a.eth.prod
mail-02.gc-us-central1-a.eth.prod
mail-03.gc-us-central1-a.eth.prod
For some reason mail-03
is barely used. It's weird.
The traffic we see on all hosts is roughly the same, so this I/O is quite weird:
mail-01.gc-us-central1-a.eth.prod
mail-02.gc-us-central1-a.eth.prod
mail-03.gc-us-central1-a.eth.prod
We can see that mail-01
is receiving the normal amount of envelopes:
Oh, but look at this:
mail-01.gc-us-central1-a.eth.prod
mail-02.gc-us-central1-a.eth.prod
mail-03.gc-us-central1-a.eth.prod
This is weird. mail-01
and mail-02
count archived envelopes in millions, while mail-03
in thousands. Why?
The shape of the graphs is weird too.
The mailserver_archived_envelopes_total
metric is updated in two places for PostgreSQL:
func (i *PostgresDB) updateArchivedEnvelopesCount() {
if count, err := i.envelopesCount(); err != nil {
log.Warn("db query for envelopes count failed", "err", err)
} else {
archivedEnvelopesGauge.WithLabelValues(i.name).Set(float64(count))
}
}
https://github.com/status-im/status-go/blob/530f3c7a/mailserver/mailserver_db_postgres.go#L83-L89
archivedEnvelopesGauge.WithLabelValues(i.name).Inc()
archivedEnvelopeSizeMeter.WithLabelValues(i.name).Observe(
float64(waku.EnvelopeHeaderLength + env.Size()))
https://github.com/status-im/status-go/blob/530f3c7a/mailserver/mailserver_db_postgres.go#L283
Based on the shape I'd say mail-03
metrics come from calling Inc()
, while for the other ones from Set()
.
I tried to do some debugging with @Samyoul and we managed to narrow down the issue to a flurry of SQL queries that happen periodically. This query is created by the PostgresDB.BuildIterator
method here:
https://github.com/status-im/status-go/blob/8191f24ef3bfed320db7efa9e57f5f682fb861b2/mailserver/mailserver_db_postgres.go#L128-L164
But while trying to add some more debugging to reveal the SQL query parameters we saw something interesting:
The highlighted part is the develop
based debug Docker image running, which clearly fixed the spikes issue, which is even more visible if we look at just read delays:
But now after returning to 10 months old deploy-prod
node is stuck starting at:
INFO [07-26|12:22:38.760] Connecting to postgres database
And unresponsive.
While working on debugging this I fixed a few things and also added query metrics:
I've decided to try a fleet upgrade, and I've tagged a v0.104.0
release of status-go
:
https://github.com/status-im/status-go/releases/tag/v0.104.0
So far so good.
I tested messaging and history between mobile devices and it works fine.
Desktop doesn't seem to see messages from mobile, but the same is true on eth.prod
, so it's not a regression.
The nodes are looking quite healthy after the upgrade:
But part of that is definitely due to purging the database.
What's interesting is that the nodes show these semi-regular - every 5-7 minutes - queries for full history and full filter:
The gap in the middle was due to me blocking all DevP2P traffic on the firewall, so the queries are not self-inflicted.
I have pushed v0.104.0
to eth.staging
. So far so good:
Nice drop-off on CPU load and memory usage:
I think the 30 day CPU usage graphs shows a really favorable picture:
Looks like after 21 days of using v0.104.0
release on eth.staging
no issues were found by the Test Team.
Both Mobile and Desktop was tested and no problems were found.
I will deploy manually v0.104.0
to all 01
nodes on eth.prod
as a start, and give it a few days.
I've deployed the deploy-staging
image to 6 hosts: node-01
and mail-01
from all 3 DCs:
> a mail -o -a '/docker/statusd-mail/rpc.sh admin_nodeInfo | jq -r .result.name'
mail-01.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-03.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-01.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-03.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-01.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-03.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
Once I confirm no issues were caused by this I'll do the rest of the fleet next week.
The mail-01
hosts look good, especially the Google Cloud ones:
mail-01.gc-us-central1-a.eth.prod
mail-02.gc-us-central1-a.eth.prod
The CPU load and disk utilization metrics certainly look promising. Although we can see some spikes still showing up.
The IO spikes appear to have no correlation with query metrics:
Upgraded *-02
nodes:
> a mail -o -a '/docker/statusd-mail/rpc.sh admin_nodeInfo | jq -r .result.name'
mail-01.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-01.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
mail-01.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.89.2/linux-amd64/go1.13.15
As far as I can tell everything is working fine:
So I will be completing the upgrade today.
All nodes have been upgraded:
> a whisper -o -a '/docker/statusd-whisper/rpc.sh admin_nodeInfo | jq -r .result.name'
node-01.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-02.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-03.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-04.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-05.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-06.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-07.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-08.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-01.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-02.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-03.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-04.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-05.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-06.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-07.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-08.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-01.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-02.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-03.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-04.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-05.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-06.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-07.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
node-08.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
> a mail -o -a '/docker/statusd-mail/rpc.sh admin_nodeInfo | jq -r .result.name'
mail-01.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-01.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-01.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-02.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
mail-03.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) Statusd/v0.104.0/linux-amd64/go1.18.4
Definitely improvement in terms of disk usage for the whole fleet:
Same for load average:
Pretty good.
The I/O on Alibaba Cloud hosts definitely looks better:
Google Cloud hosts look slightly better, but the spikes still appear:
I think we can close this... for now.
I've been seeing some flapping and unavailability from Google Cloud host son prod fleet.
The errors are timeouts: