Switch Status fleet to use PostgreSQL DB

jakubgs commented 10 months ago

The SQLite database is unmaintainable due to being single threaded, blocking queries when pruning, and vacuuming requiring twice the size of the DB to complete.

We need to introduce PostgreSQL DB to this fleet gradually, to eventually reflect the same setup as infra-shards fleet.

Introduce a single database to status.test fleet to measure performance and latency.
Introduce a single database to status.prod if performance is satisfactory.
Develop a way to synchronize multiple Postgres instances across data centers on status.test fleet.
Switch status.prod fleet to multiple Postgres instances with database sync.
Eventually switch to synchronization managed by Waku nodes themselves.

jakubgs commented 10 months ago

Some improvements to our PostgreSQL HA role:

infra-role-postgres-ha#8fcbe906 - defaults: fix variable name for DB UID
infra-role-postgres-ha#bbdb294e - databases: use initdb folder to create DBs and users
infra-role-postgres-ha#a38d2c70 - config: add admin.sh wrapper script
infra-role-postgres-ha#35568e2e - hba: wait for DB port to be available

jakubgs commented 10 months ago

More fixes for the Postgres role:

infra-role-postgres-ha#716732b1 - hba: use docker_compose task to restart
infra-role-postgres-ha#8381b623 - backup.sh: move to scripts dir
infra-role-postgres-ha#a96ac903 - admin.sh: quote extra arguments to fix use of -e
infra-role-postgres-ha#97f525e9 - consul: add health.sh to check replica status

jakubgs commented 10 months ago

Even more Postgres improvements:

infra-role-postgres-ha#1ee52ded - meta: add missing infra roles to dependencies
infra-role-postgres-ha#1ff92d5f - hba: make optional when there's no replica
infra-role-postgres-ha#6b296be1 - consul: add handling DB without replica to health.sh

Here's infra-status changes:

infra-status#19b15072 - nodes.tf: rename to node.tf, and module to node
infra-status#482bb74f - node.tf: rename group to status-node
infra-status#8acf2ace - add single db-01.do-ams3.status.test host
infra-status#b277afa7 - requirements: use full names of infra roles
infra-status#f7c4b8a7 - outputs.tf: add to print list of hosta after TF run
infra-status#10437a6c - status-db: add PostgreSQL DB configuration

The db-01.do-ams3.status.test host is up and running. The status.prod fleet won't be touched for a while.

jakubgs commented 10 months ago

I also had to enable -d:postgres flag for postgres build.

nwaku#7a376f59 - chore: Adding -d:postgres flag when creating a Docker image for release and PRs (#2076)

It was enabled by default but the job didn't pick it up.

jakubgs commented 10 months ago

I tried configuring the node on node-01.do-ams3.status.test to use the DB but it's failing with:

 ERR 2023-11-23 12:37:19.449+00:00 4/7 Mounting protocols failed              topics="wakunode main" tid=1 file=wakunode2.nim:89
    error="failed to mount waku archive protocol: error in mountArchive: failed execution of retention policy: failed to get Page size: "

No idea what that's about, looks like a bug in master, I will try to build v0.22.0 release.

jakubgs commented 10 months ago

I've built v0.22.0 but it has the same issue. Building v0.21.0 did resolve it though. Opened an issue:

https://github.com/waku-org/nwaku/issues/2242

jakubgs commented 10 months ago

The shards.test fleet is running aeb77a3e which is current master but doesn't have this problem:

admin@store-01.do-ams3.shards.test:~ % d
CONTAINER ID   NAMES            IMAGE                              CREATED        STATUS
920badb37699   nim-waku-store   wakuorg/nwaku:deploy-shards-test   20 hours ago   Up 20 hours (healthy)

admin@store-01.do-ams3.shards.test:~ % d inspect wakuorg/nwaku:deploy-shards-test | grep commit
                "commit": "aeb77a3e",

jakubgs commented 10 months ago

Turns out:

We have a bug related to the "size" retention policy that doesn't work if the database doesn't exist. In order to allow the node to start, let's change to either the "time" or "capacity" retention policies:
capacity:20000000

jakubgs commented 10 months ago

Indeed, I can confirm setting the retention not based on size does fix the issue at startup:

admin@node-01.do-ams3.status.test:~ % d
CONTAINER ID   NAMES      IMAGE                              CREATED         STATUS
3d6868ff0701   nim-waku   wakuorg/nwaku:deploy-status-test   2 minutes ago   Up 2 minutes (healthy)

infra-status#e6646aaa - status-node: drop use of broken size retention policy

jakubgs commented 10 months ago

Notes from call we had today with John, Hanno, Andrea, and Ivan:

App devs and QA have issues with historical messages appearing very later, even days later.
- This is most probably due to split-brain between SQLite DBs between nodes and connecting to different nodes.
We need to solve this problem in 3 steps:
1. Use one DB across multiple DCs to see if it works, since it's the simplest.
2. Switch to some kind of multi-master or sync or master-slave with LB as a middle ground.
3. Wait for waku to implement actual syncing between nodes on protocol level.
All of this is necessary since we have Status app dogfooding coming soon.
Jakub is testing single DB in status.test and if it works we'll do the same for status.prod.
Hanno believes the shards.test fleet setup with more than one waku node per DB in DC solves this problem.
- This is actually not confirmed and is kinda speculation.
Andrea wants a single DC shards.dogfood fleet to eliminate variables, as a stable reference point.
- Jakub mostly agrees with this, as a good way to establish what works.
Jakub would prefer if Hannos version was correct, and we could have separate DBs in different DCs.
- If this doesn't work we'll need some kind of multi-master or syncing between DCs until waku implements it.

jakubgs commented 10 months ago

All nodes in status.test fleet are currently using the single PostgreSQL database:

 > a status-node -a 'grep db-url /docker/nim-waku/docker-compose.yml | sed "s/:.*@/:PASSWORD@/"'
node-01.do-ams3.status.test | CHANGED | rc=0 >>
      --store-message-db-url=postgres:PASSWORD@db-01.do-ams3.status.test.wg:5432/nim-waku
node-01.gc-us-central1-a.status.test | CHANGED | rc=0 >>
      --store-message-db-url=postgres:PASSWORD@db-01.do-ams3.status.test.wg:5432/nim-waku
node-01.ac-cn-hongkong-c.status.test | CHANGED | rc=0 >>
      --store-message-db-url=postgres:PASSWORD@db-01.do-ams3.status.test.wg:5432/nim-waku

@Ivansete-status you can start your research into performance of this layout. I will be also monitoring metrics for this fleet.

Ivansete-status commented 10 months ago

I changed the default pubsub topic used by the status.test fleet so that we don't spam the status.prod fleet. from: /waku/2/default-waku/proto to: /waku/2/default-waku-test/proto

Ivansete-status commented 10 months ago

We need to create a v0.22.1 version that at least contains the following commit: https://github.com/waku-org/nwaku/commit/aeb77a3e7595b5ca6f5db15c78503c1c0e01ee5b

With the current version used by status.test rows can't be added to the database due to:

ERR 2023-11-28 18:36:44.089+01:00 failed to insert message                   topics="waku archive" tid=197497 file=archive.nim:111 err="failed to insert into database: ERROR:  null value in column \"messagehash\" of relation \"messages\" violates not-null constraint

jakubgs commented 9 months ago

You killed the DB storage dude:

It's full:

admin@db-01.do-ams3.status.test:~ % df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda         40G   38G     0 100% /data

jakubgs commented 9 months ago

The cause is obvious:

Ivansete-status commented 9 months ago

Weekly Update

achieved: dashboard created (https://grafana.infra.status.im/d/svkljKHSz/comparing-store-nodes?orgId=1) to allow easy Store / Archive performance comparison for status.test with one single database. The report with results is: https://www.notion.so/Migrate-status-test-to-PostgreSQL-e108b89fd9d34de2be13d10f42c92185
next: change database configuration/layout to get better results.

jakubgs commented 9 months ago

The database host is full again:

admin@db-01.do-ams3.status.test:~ % df -h /docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda         40G   38G     0 100% /docker

And all nodes are unavailable:

Due to inability to access the DB:

ERR 2023-12-04 10:52:49.012+00:00 4/7 Mounting protocols failed
  topics="wakunode main" tid=1 file=wakunode2.nim:89
  error="failed to setup archive driver: error creating table: error in createMessageTable: connRes.isErr in query: failed to stablish a new connection: exception opening new connection: connection to server at \"db-01.do-ams3.status.test.wg\" (10.10.0.199), port 5432 failed: FATAL:  the database system is not yet accepting connections\nDETAIL:  Consistent recovery state has not been yet reached.\n"

jakubgs commented 9 months ago

Gone in ~9 hours on the 2nd:

Ivansete-status commented 9 months ago

@jakubgs - I'll review the retention policy to enhance the database size issues.

On the other hand, the study concludes that latency doesn't play a big role in the performance issues because the three nodes (AMS, HK, US) behave equally regarding Store timing and inserts (more details at https://www.notion.so/Migrate-status-test-to-PostgreSQL-e108b89fd9d34de2be13d10f42c92185 .)

The major bottleneck is the current db-01.do-ams3.status.test configuration (Digital Ocean s-1vcpu-2gb)

jakubgs commented 9 months ago

Considering we didn't see much swapping I don't think memory is the main bottleneck:

I have increased the host size to s-2vcpu-2gb, adding on vCPU, but not increasing memory size:

infra-status#38540fd1 - status.test: upgrade DB instance to s-2vcpu-2gb

Lets see what that does.

Ivansete-status commented 9 months ago

The performance was enhanced after increasing the database machine to s-2vcpu-2gb.

jakubgs commented 9 months ago

Looks to me like the latency of queries in non-Amsterdam DCs is very bad, if I'm reading this correctly.

Most queries are in 2.5s and 5s buckets. That's not really acceptable for an interactive application.

jakubgs commented 9 months ago

I don't think benchmark is useful since it appears you are just overwhelming the host with the most traffic you can generate:

All you're testing this way is "if I throw everything I can at the host will it work?", no, it won't, that's not a useful learning.

We want to see how the performance looks like with low traffic, normal traffic, and then traffic on the edge, but just generating maximum amount possible and then saying "look, it was overwhelmed" is not useful. A proper benchmark should show how a system behaves under different levels of stress, low, normal, high(or more).

What you're doing is like testing a home water installation by pumping water into it at 50 Bar and then being surprised it bursts.

jakubgs commented 9 months ago

Also, remember that the performance of queries isn't only dependent on the DB instance performance, but also on waku nodes.

Ivansete-status commented 9 months ago

Thanks so much for the brilliant insights! I'm performing a more realistic test now, considering the current use of status.prod fleet.

With that, I'll configure the waku-store-request-generator to have 10 users performing 1 Store req / second, which turns into ~600 Store_req / minute

On the other hand, and considering the above image, I'll configure a nwaku node to publish one message every 0.06 seconds, aiming to generate a traffic of ~1000 msgs / minute, with an average size.

Ivansete-status commented 9 months ago

With the configuration described in my previous comment, it is evident that the Sore / Insert times are affected by the distance between the datacenters and the database. Notice that the performance is better with s-2vcpu-2gb but the network distance brings a big impact.

jakubgs commented 9 months ago

The TERM signal's you've been seeing are leftovers of a restart service that were not cleaned up:

~/work/infra-role-nim-waku master
 > g sl | grep restart
a93a85b  2023-02-15 14:21:20 +0100 Jakub Sokołowski     > drop temporary container restarts 
686ee11  2022-10-11 09:26:12 +0200 Jakub Sokołowski     > docker: restart every 6 hours instead 
b22e070  2022-10-06 12:20:07 +0200 Jakub Sokołowski     > docker: add temporary restart every 12 hours

admin@node-01.do-ams3.status.test:~ % s cat restart-nim-waku.timer  
# /etc/systemd/system/restart-nim-waku.timer
[Unit]
After=multi-user.target

[Timer]
Persistent=yes
OnCalendar=00/06:00
RandomizedDelaySec=11600

[Install]
WantedBy=timers.target

I have now removed it from all status.test hosts.

jakubgs commented 9 months ago

Also removed from node-01.gc-us-central1-a.status.prod to see if we can remove it from status.prod without issues.

Originally this was a temporary "solution" to socket leaks:

infra-role-nim-waku#b22e0705 - docker: add temporary restart every 12 hours

jakubgs commented 9 months ago

Bumped DB node to s-6vcpu-16gb, but this is just a temporary change to do benchamrking:

infra-status#4d77e063 - status.test: upgrade DB instance to s-6vcpu-16gb

The DigitalOcean prices are not practical for long-term, and if we need big DB hosts we're going to have to start using physical hosts to save money. But i'm happy to leave this host as is till new year's and then downsize it, or replace with a physical one.

Ivansete-status commented 9 months ago

New test is launched by re-configuring the database with the following settings (https://pgtune.leopard.in.ua/)

-- DB Version: 15
-- OS Type: linux
-- DB Type: web
-- Total Memory (RAM): 16 GB
-- CPUs num: 6
-- Data Storage: ssd

ALTER SYSTEM SET
 max_connections = '200';
ALTER SYSTEM SET
 shared_buffers = '4GB';
ALTER SYSTEM SET
 effective_cache_size = '12GB';
ALTER SYSTEM SET
 maintenance_work_mem = '1GB';
ALTER SYSTEM SET
 checkpoint_completion_target = '0.9';
ALTER SYSTEM SET
 wal_buffers = '16MB';
ALTER SYSTEM SET
 default_statistics_target = '100';
ALTER SYSTEM SET
 random_page_cost = '1.1';
ALTER SYSTEM SET
 effective_io_concurrency = '200';
ALTER SYSTEM SET
 work_mem = '6990kB';
ALTER SYSTEM SET
 huge_pages = 'off';
ALTER SYSTEM SET
 min_wal_size = '1GB';
ALTER SYSTEM SET
 max_wal_size = '4GB';
ALTER SYSTEM SET
 max_worker_processes = '6';
ALTER SYSTEM SET
 max_parallel_workers_per_gather = '3';
ALTER SYSTEM SET
 max_parallel_workers = '6';
ALTER SYSTEM SET
 max_parallel_maintenance_workers = '3';

Ivansete-status commented 9 months ago

After upgrading to a more powerful database machine, we still can see low Store / Archive performance on both US and HK nodes:

( cc @jakubgs )

jakubgs commented 9 months ago

Yes, I think now we can clearly conclude that the single DB used from different data centers is not a viable solution.

Our options are:

Run a setup with just one DC to completely avoid the split-brain failure state. Not a real solution.
Figure out if the new shards.test fleet with multiple store nodes per DB don't solve this issue.
Figure out a way to sync DBs across data centers. With some form of mulit-master setup or other sync method.

Thanks for researching this. I am going on holiday break till 2nd of January so I will not be working on this for a while. If help is needed please ask Anton or Alexis, but please don't make major changes to fleets without my input.

jakubgs commented 8 months ago

Since the benchmarks revealed that the single-DB cross-DC setup does not work I'm upgrading the layout to use multiple DBs:

infra-status#02b75ec6 - switch to multi-DB layout with one DB per DC

The new setup has been deployed and works:

Will do status.prod tomorrow.

jakubgs commented 8 months ago

I have switched status.prod to use multiple hosts with PostgreSQL:

infra-status#fe58d7f6 - workspaces.tf: fix instance type for GC DB hosts
infra-status#ff28064f - status: add DB hosts for prod fleet

There's an issue with DB initialization. Will debug tomorrow.

jakubgs commented 8 months ago

Found the issue, lack of stage in DB hostname:

infra-status#e8b5ff25 - status-node: fix missing stage for DB hostname

Had to also build a fresh deploy-status-prod-trace image: https://ci.infra.status.im/job/nim-waku/job/deploy-status-prod-trace/3/ But for some reason it was just restarting without an error, so i manually switched it to a normal one.

The fleet is back up and working:

status-im / infra-status-legacy

Switch Status fleet to use PostgreSQL DB #37