Closed jakubgs closed 8 months ago
Some improvements to our PostgreSQL HA role:
infra-role-postgres-ha#8fcbe906
- defaults: fix variable name for DB UIDinfra-role-postgres-ha#bbdb294e
- databases: use initdb folder to create DBs and usersinfra-role-postgres-ha#a38d2c70
- config: add admin.sh wrapper scriptinfra-role-postgres-ha#35568e2e
- hba: wait for DB port to be availableMore fixes for the Postgres role:
infra-role-postgres-ha#716732b1
- hba: use docker_compose task to restartinfra-role-postgres-ha#8381b623
- backup.sh: move to scripts dirinfra-role-postgres-ha#a96ac903
- admin.sh: quote extra arguments to fix use of -einfra-role-postgres-ha#97f525e9
- consul: add health.sh to check replica statusEven more Postgres improvements:
infra-role-postgres-ha#1ee52ded
- meta: add missing infra roles to dependenciesinfra-role-postgres-ha#1ff92d5f
- hba: make optional when there's no replicainfra-role-postgres-ha#6b296be1
- consul: add handling DB without replica to health.shHere's infra-status
changes:
infra-status#19b15072
- nodes.tf: rename to node.tf, and module to nodeinfra-status#482bb74f
- node.tf: rename group to status-nodeinfra-status#8acf2ace
- add single db-01.do-ams3.status.test hostinfra-status#b277afa7
- requirements: use full names of infra rolesinfra-status#f7c4b8a7
- outputs.tf: add to print list of hosta after TF runinfra-status#10437a6c
- status-db: add PostgreSQL DB configurationThe db-01.do-ams3.status.test
host is up and running. The status.prod
fleet won't be touched for a while.
I also had to enable -d:postgres
flag for postgres build.
nwaku#7a376f59
- chore: Adding -d:postgres flag when creating a Docker image for release and PRs (#2076)It was enabled by default but the job didn't pick it up.
I tried configuring the node on node-01.do-ams3.status.test
to use the DB but it's failing with:
ERR 2023-11-23 12:37:19.449+00:00 4/7 Mounting protocols failed topics="wakunode main" tid=1 file=wakunode2.nim:89
error="failed to mount waku archive protocol: error in mountArchive: failed execution of retention policy: failed to get Page size: "
No idea what that's about, looks like a bug in master
, I will try to build v0.22.0
release.
I've built v0.22.0
but it has the same issue. Building v0.21.0
did resolve it though. Opened an issue:
The shards.test
fleet is running aeb77a3e
which is current master but doesn't have this problem:
admin@store-01.do-ams3.shards.test:~ % d
CONTAINER ID NAMES IMAGE CREATED STATUS
920badb37699 nim-waku-store wakuorg/nwaku:deploy-shards-test 20 hours ago Up 20 hours (healthy)
admin@store-01.do-ams3.shards.test:~ % d inspect wakuorg/nwaku:deploy-shards-test | grep commit
"commit": "aeb77a3e",
Turns out:
We have a bug related to the "size" retention policy that doesn't work if the database doesn't exist. In order to allow the node to start, let's change to either the "time" or "capacity" retention policies:
capacity:20000000
Indeed, I can confirm setting the retention not based on size
does fix the issue at startup:
admin@node-01.do-ams3.status.test:~ % d
CONTAINER ID NAMES IMAGE CREATED STATUS
3d6868ff0701 nim-waku wakuorg/nwaku:deploy-status-test 2 minutes ago Up 2 minutes (healthy)
infra-status#e6646aaa
- status-node: drop use of broken size retention policyNotes from call we had today with John, Hanno, Andrea, and Ivan:
status.test
and if it works we'll do the same for status.prod
.shards.test
fleet setup with more than one waku node per DB in DC solves this problem.
shards.dogfood
fleet to eliminate variables, as a stable reference point.
All nodes in status.test
fleet are currently using the single PostgreSQL database:
> a status-node -a 'grep db-url /docker/nim-waku/docker-compose.yml | sed "s/:.*@/:PASSWORD@/"'
node-01.do-ams3.status.test | CHANGED | rc=0 >>
--store-message-db-url=postgres:PASSWORD@db-01.do-ams3.status.test.wg:5432/nim-waku
node-01.gc-us-central1-a.status.test | CHANGED | rc=0 >>
--store-message-db-url=postgres:PASSWORD@db-01.do-ams3.status.test.wg:5432/nim-waku
node-01.ac-cn-hongkong-c.status.test | CHANGED | rc=0 >>
--store-message-db-url=postgres:PASSWORD@db-01.do-ams3.status.test.wg:5432/nim-waku
@Ivansete-status you can start your research into performance of this layout. I will be also monitoring metrics for this fleet.
I changed the default pubsub topic used by the status.test
fleet so that we don't spam the status.prod
fleet.
from: /waku/2/default-waku/proto
to: /waku/2/default-waku-test/proto
We need to create a v0.22.1
version that at least contains the following commit: https://github.com/waku-org/nwaku/commit/aeb77a3e7595b5ca6f5db15c78503c1c0e01ee5b
With the current version used by status.test
rows can't be added to the database due to:
ERR 2023-11-28 18:36:44.089+01:00 failed to insert message topics="waku archive" tid=197497 file=archive.nim:111 err="failed to insert into database: ERROR: null value in column \"messagehash\" of relation \"messages\" violates not-null constraint
You killed the DB storage dude:
It's full:
admin@db-01.do-ams3.status.test:~ % df -h /data
Filesystem Size Used Avail Use% Mounted on
/dev/sda 40G 38G 0 100% /data
The cause is obvious:
Weekly Update
status.test
with one single database. The report with results is: https://www.notion.so/Migrate-status-test-to-PostgreSQL-e108b89fd9d34de2be13d10f42c92185The database host is full again:
admin@db-01.do-ams3.status.test:~ % df -h /docker
Filesystem Size Used Avail Use% Mounted on
/dev/sda 40G 38G 0 100% /docker
And all nodes are unavailable:
Due to inability to access the DB:
ERR 2023-12-04 10:52:49.012+00:00 4/7 Mounting protocols failed
topics="wakunode main" tid=1 file=wakunode2.nim:89
error="failed to setup archive driver: error creating table: error in createMessageTable: connRes.isErr in query: failed to stablish a new connection: exception opening new connection: connection to server at \"db-01.do-ams3.status.test.wg\" (10.10.0.199), port 5432 failed: FATAL: the database system is not yet accepting connections\nDETAIL: Consistent recovery state has not been yet reached.\n"
Gone in ~9 hours on the 2nd:
@jakubgs - I'll review the retention policy to enhance the database size issues.
On the other hand, the study concludes that latency doesn't play a big role in the performance issues because the three nodes (AMS, HK, US) behave equally regarding Store timing and inserts (more details at https://www.notion.so/Migrate-status-test-to-PostgreSQL-e108b89fd9d34de2be13d10f42c92185 .)
The major bottleneck is the current db-01.do-ams3.status.test
configuration (Digital Ocean s-1vcpu-2gb)
Considering we didn't see much swapping I don't think memory is the main bottleneck:
I have increased the host size to s-2vcpu-2gb
, adding on vCPU, but not increasing memory size:
infra-status#38540fd1
- status.test: upgrade DB instance to s-2vcpu-2gbLets see what that does.
The performance was enhanced after increasing the database machine to s-2vcpu-2gb
.
Looks to me like the latency of queries in non-Amsterdam DCs is very bad, if I'm reading this correctly.
Most queries are in 2.5s and 5s buckets. That's not really acceptable for an interactive application.
I don't think benchmark is useful since it appears you are just overwhelming the host with the most traffic you can generate:
All you're testing this way is "if I throw everything I can at the host will it work?", no, it won't, that's not a useful learning.
We want to see how the performance looks like with low traffic, normal traffic, and then traffic on the edge, but just generating maximum amount possible and then saying "look, it was overwhelmed" is not useful. A proper benchmark should show how a system behaves under different levels of stress, low, normal, high(or more).
What you're doing is like testing a home water installation by pumping water into it at 50 Bar and then being surprised it bursts.
Also, remember that the performance of queries isn't only dependent on the DB instance performance, but also on waku nodes.
Thanks so much for the brilliant insights! I'm performing a more realistic test now, considering the current use of status.prod
fleet.
With that, I'll configure the waku-store-request-generator
to have 10 users performing 1 Store req / second, which turns into ~600 Store_req / minute
On the other hand, and considering the above image, I'll configure a nwaku
node to publish one message every 0.06 seconds, aiming to generate a traffic of ~1000 msgs / minute, with an average size.
With the configuration described in my previous comment, it is evident that the Sore / Insert times are affected by the distance between the datacenters and the database. Notice that the performance is better with s-2vcpu-2gb
but the network distance brings a big impact.
The TERM
signal's you've been seeing are leftovers of a restart service that were not cleaned up:
~/work/infra-role-nim-waku master
> g sl | grep restart
a93a85b 2023-02-15 14:21:20 +0100 Jakub Sokołowski > drop temporary container restarts
686ee11 2022-10-11 09:26:12 +0200 Jakub Sokołowski > docker: restart every 6 hours instead
b22e070 2022-10-06 12:20:07 +0200 Jakub Sokołowski > docker: add temporary restart every 12 hours
admin@node-01.do-ams3.status.test:~ % s cat restart-nim-waku.timer
# /etc/systemd/system/restart-nim-waku.timer
[Unit]
After=multi-user.target
[Timer]
Persistent=yes
OnCalendar=00/06:00
RandomizedDelaySec=11600
[Install]
WantedBy=timers.target
I have now removed it from all status.test
hosts.
Also removed from node-01.gc-us-central1-a.status.prod
to see if we can remove it from status.prod
without issues.
Originally this was a temporary "solution" to socket leaks:
infra-role-nim-waku#b22e0705
- docker: add temporary restart every 12 hoursBumped DB node to s-6vcpu-16gb
, but this is just a temporary change to do benchamrking:
infra-status#4d77e063
- status.test: upgrade DB instance to s-6vcpu-16gbThe DigitalOcean prices are not practical for long-term, and if we need big DB hosts we're going to have to start using physical hosts to save money. But i'm happy to leave this host as is till new year's and then downsize it, or replace with a physical one.
New test is launched by re-configuring the database with the following settings (https://pgtune.leopard.in.ua/)
-- DB Version: 15
-- OS Type: linux
-- DB Type: web
-- Total Memory (RAM): 16 GB
-- CPUs num: 6
-- Data Storage: ssd
ALTER SYSTEM SET
max_connections = '200';
ALTER SYSTEM SET
shared_buffers = '4GB';
ALTER SYSTEM SET
effective_cache_size = '12GB';
ALTER SYSTEM SET
maintenance_work_mem = '1GB';
ALTER SYSTEM SET
checkpoint_completion_target = '0.9';
ALTER SYSTEM SET
wal_buffers = '16MB';
ALTER SYSTEM SET
default_statistics_target = '100';
ALTER SYSTEM SET
random_page_cost = '1.1';
ALTER SYSTEM SET
effective_io_concurrency = '200';
ALTER SYSTEM SET
work_mem = '6990kB';
ALTER SYSTEM SET
huge_pages = 'off';
ALTER SYSTEM SET
min_wal_size = '1GB';
ALTER SYSTEM SET
max_wal_size = '4GB';
ALTER SYSTEM SET
max_worker_processes = '6';
ALTER SYSTEM SET
max_parallel_workers_per_gather = '3';
ALTER SYSTEM SET
max_parallel_workers = '6';
ALTER SYSTEM SET
max_parallel_maintenance_workers = '3';
After upgrading to a more powerful database machine, we still can see low Store / Archive performance on both US and HK nodes:
( cc @jakubgs )
Yes, I think now we can clearly conclude that the single DB used from different data centers is not a viable solution.
Our options are:
shards.test
fleet with multiple store nodes per DB don't solve this issue.Thanks for researching this. I am going on holiday break till 2nd of January so I will not be working on this for a while. If help is needed please ask Anton or Alexis, but please don't make major changes to fleets without my input.
Since the benchmarks revealed that the single-DB cross-DC setup does not work I'm upgrading the layout to use multiple DBs:
infra-status#02b75ec6
- switch to multi-DB layout with one DB per DCThe new setup has been deployed and works:
Will do status.prod
tomorrow.
I have switched status.prod
to use multiple hosts with PostgreSQL:
infra-status#fe58d7f6
- workspaces.tf: fix instance type for GC DB hostsinfra-status#ff28064f
- status: add DB hosts for prod fleetThere's an issue with DB initialization. Will debug tomorrow.
Found the issue, lack of stage in DB hostname:
infra-status#e8b5ff25
- status-node: fix missing stage for DB hostnameHad to also build a fresh deploy-status-prod-trace
image:
https://ci.infra.status.im/job/nim-waku/job/deploy-status-prod-trace/3/
But for some reason it was just restarting without an error, so i manually switched it to a normal one.
The fleet is back up and working:
The SQLite database is unmaintainable due to being single threaded, blocking queries when pruning, and vacuuming requiring twice the size of the DB to complete.
We need to introduce PostgreSQL DB to this fleet gradually, to eventually reflect the same setup as
infra-shards
fleet.status.test
fleet to measure performance and latency.status.prod
if performance is satisfactory.status.test
fleet.status.prod
fleet to multiple Postgres instances with database sync.