status-im / infra-status-legacy

Infrastructure for old Status fleet
https://github.com/status-im/nim-waku
1 stars 3 forks source link

Switch Status fleet to use PostgreSQL DB #37

Closed jakubgs closed 8 months ago

jakubgs commented 10 months ago

The SQLite database is unmaintainable due to being single threaded, blocking queries when pruning, and vacuuming requiring twice the size of the DB to complete.

We need to introduce PostgreSQL DB to this fleet gradually, to eventually reflect the same setup as infra-shards fleet.

  1. Introduce a single database to status.test fleet to measure performance and latency.
  2. Introduce a single database to status.prod if performance is satisfactory.
  3. Develop a way to synchronize multiple Postgres instances across data centers on status.test fleet.
  4. Switch status.prod fleet to multiple Postgres instances with database sync.
  5. Eventually switch to synchronization managed by Waku nodes themselves.
jakubgs commented 10 months ago

Some improvements to our PostgreSQL HA role:

jakubgs commented 10 months ago

More fixes for the Postgres role:

jakubgs commented 10 months ago

Even more Postgres improvements:

Here's infra-status changes:

The db-01.do-ams3.status.test host is up and running. The status.prod fleet won't be touched for a while.

jakubgs commented 10 months ago

I also had to enable -d:postgres flag for postgres build.

It was enabled by default but the job didn't pick it up.

jakubgs commented 10 months ago

I tried configuring the node on node-01.do-ams3.status.test to use the DB but it's failing with:

 ERR 2023-11-23 12:37:19.449+00:00 4/7 Mounting protocols failed              topics="wakunode main" tid=1 file=wakunode2.nim:89
    error="failed to mount waku archive protocol: error in mountArchive: failed execution of retention policy: failed to get Page size: "

No idea what that's about, looks like a bug in master, I will try to build v0.22.0 release.

jakubgs commented 10 months ago

I've built v0.22.0 but it has the same issue. Building v0.21.0 did resolve it though. Opened an issue:

jakubgs commented 10 months ago

The shards.test fleet is running aeb77a3e which is current master but doesn't have this problem:

admin@store-01.do-ams3.shards.test:~ % d
CONTAINER ID   NAMES            IMAGE                              CREATED        STATUS
920badb37699   nim-waku-store   wakuorg/nwaku:deploy-shards-test   20 hours ago   Up 20 hours (healthy)

admin@store-01.do-ams3.shards.test:~ % d inspect wakuorg/nwaku:deploy-shards-test | grep commit
                "commit": "aeb77a3e",
jakubgs commented 10 months ago

Turns out:

We have a bug related to the "size" retention policy that doesn't work if the database doesn't exist. In order to allow the node to start, let's change to either the "time" or "capacity" retention policies:

capacity:20000000
jakubgs commented 10 months ago

Indeed, I can confirm setting the retention not based on size does fix the issue at startup:

admin@node-01.do-ams3.status.test:~ % d
CONTAINER ID   NAMES      IMAGE                              CREATED         STATUS
3d6868ff0701   nim-waku   wakuorg/nwaku:deploy-status-test   2 minutes ago   Up 2 minutes (healthy)
jakubgs commented 10 months ago

Notes from call we had today with John, Hanno, Andrea, and Ivan:

jakubgs commented 10 months ago

All nodes in status.test fleet are currently using the single PostgreSQL database:

 > a status-node -a 'grep db-url /docker/nim-waku/docker-compose.yml | sed "s/:.*@/:PASSWORD@/"'
node-01.do-ams3.status.test | CHANGED | rc=0 >>
      --store-message-db-url=postgres:PASSWORD@db-01.do-ams3.status.test.wg:5432/nim-waku
node-01.gc-us-central1-a.status.test | CHANGED | rc=0 >>
      --store-message-db-url=postgres:PASSWORD@db-01.do-ams3.status.test.wg:5432/nim-waku
node-01.ac-cn-hongkong-c.status.test | CHANGED | rc=0 >>
      --store-message-db-url=postgres:PASSWORD@db-01.do-ams3.status.test.wg:5432/nim-waku

@Ivansete-status you can start your research into performance of this layout. I will be also monitoring metrics for this fleet.

Ivansete-status commented 10 months ago

I changed the default pubsub topic used by the status.test fleet so that we don't spam the status.prod fleet. from: /waku/2/default-waku/proto to: /waku/2/default-waku-test/proto

Ivansete-status commented 10 months ago

We need to create a v0.22.1 version that at least contains the following commit: https://github.com/waku-org/nwaku/commit/aeb77a3e7595b5ca6f5db15c78503c1c0e01ee5b

With the current version used by status.test rows can't be added to the database due to:

ERR 2023-11-28 18:36:44.089+01:00 failed to insert message                   topics="waku archive" tid=197497 file=archive.nim:111 err="failed to insert into database: ERROR:  null value in column \"messagehash\" of relation \"messages\" violates not-null constraint
jakubgs commented 9 months ago

You killed the DB storage dude:

image

It's full:

admin@db-01.do-ams3.status.test:~ % df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda         40G   38G     0 100% /data
jakubgs commented 9 months ago

The cause is obvious:

image

Ivansete-status commented 9 months ago

Weekly Update

jakubgs commented 9 months ago

The database host is full again:

admin@db-01.do-ams3.status.test:~ % df -h /docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda         40G   38G     0 100% /docker

And all nodes are unavailable:

image

Due to inability to access the DB:

ERR 2023-12-04 10:52:49.012+00:00 4/7 Mounting protocols failed
  topics="wakunode main" tid=1 file=wakunode2.nim:89
  error="failed to setup archive driver: error creating table: error in createMessageTable: connRes.isErr in query: failed to stablish a new connection: exception opening new connection: connection to server at \"db-01.do-ams3.status.test.wg\" (10.10.0.199), port 5432 failed: FATAL:  the database system is not yet accepting connections\nDETAIL:  Consistent recovery state has not been yet reached.\n"
jakubgs commented 9 months ago

Gone in ~9 hours on the 2nd:

image

Ivansete-status commented 9 months ago

@jakubgs - I'll review the retention policy to enhance the database size issues.

On the other hand, the study concludes that latency doesn't play a big role in the performance issues because the three nodes (AMS, HK, US) behave equally regarding Store timing and inserts (more details at https://www.notion.so/Migrate-status-test-to-PostgreSQL-e108b89fd9d34de2be13d10f42c92185 .)

image

image

The major bottleneck is the current db-01.do-ams3.status.test configuration (Digital Ocean s-1vcpu-2gb)

jakubgs commented 9 months ago

Considering we didn't see much swapping I don't think memory is the main bottleneck:

image

I have increased the host size to s-2vcpu-2gb, adding on vCPU, but not increasing memory size:

Lets see what that does.

Ivansete-status commented 9 months ago

The performance was enhanced after increasing the database machine to s-2vcpu-2gb.

image image

jakubgs commented 9 months ago

Looks to me like the latency of queries in non-Amsterdam DCs is very bad, if I'm reading this correctly.

Most queries are in 2.5s and 5s buckets. That's not really acceptable for an interactive application.

jakubgs commented 9 months ago

I don't think benchmark is useful since it appears you are just overwhelming the host with the most traffic you can generate:

image

image

All you're testing this way is "if I throw everything I can at the host will it work?", no, it won't, that's not a useful learning.

We want to see how the performance looks like with low traffic, normal traffic, and then traffic on the edge, but just generating maximum amount possible and then saying "look, it was overwhelmed" is not useful. A proper benchmark should show how a system behaves under different levels of stress, low, normal, high(or more).

What you're doing is like testing a home water installation by pumping water into it at 50 Bar and then being surprised it bursts.

jakubgs commented 9 months ago

Also, remember that the performance of queries isn't only dependent on the DB instance performance, but also on waku nodes.

Ivansete-status commented 9 months ago

Thanks so much for the brilliant insights! I'm performing a more realistic test now, considering the current use of status.prod fleet.

image With that, I'll configure the waku-store-request-generator to have 10 users performing 1 Store req / second, which turns into ~600 Store_req / minute

image

On the other hand, and considering the above image, I'll configure a nwaku node to publish one message every 0.06 seconds, aiming to generate a traffic of ~1000 msgs / minute, with an average size.

Ivansete-status commented 9 months ago

With the configuration described in my previous comment, it is evident that the Sore / Insert times are affected by the distance between the datacenters and the database. Notice that the performance is better with s-2vcpu-2gb but the network distance brings a big impact. image image

jakubgs commented 9 months ago

The TERM signal's you've been seeing are leftovers of a restart service that were not cleaned up:

~/work/infra-role-nim-waku master
 > g sl | grep restart
a93a85b  2023-02-15 14:21:20 +0100 Jakub Sokołowski     > drop temporary container restarts 
686ee11  2022-10-11 09:26:12 +0200 Jakub Sokołowski     > docker: restart every 6 hours instead 
b22e070  2022-10-06 12:20:07 +0200 Jakub Sokołowski     > docker: add temporary restart every 12 hours 
admin@node-01.do-ams3.status.test:~ % s cat restart-nim-waku.timer  
# /etc/systemd/system/restart-nim-waku.timer
[Unit]
After=multi-user.target

[Timer]
Persistent=yes
OnCalendar=00/06:00
RandomizedDelaySec=11600

[Install]
WantedBy=timers.target

I have now removed it from all status.test hosts.

jakubgs commented 9 months ago

Also removed from node-01.gc-us-central1-a.status.prod to see if we can remove it from status.prod without issues.

Originally this was a temporary "solution" to socket leaks:

jakubgs commented 9 months ago

Bumped DB node to s-6vcpu-16gb, but this is just a temporary change to do benchamrking:

The DigitalOcean prices are not practical for long-term, and if we need big DB hosts we're going to have to start using physical hosts to save money. But i'm happy to leave this host as is till new year's and then downsize it, or replace with a physical one.

Ivansete-status commented 9 months ago

New test is launched by re-configuring the database with the following settings (https://pgtune.leopard.in.ua/)

-- DB Version: 15
-- OS Type: linux
-- DB Type: web
-- Total Memory (RAM): 16 GB
-- CPUs num: 6
-- Data Storage: ssd

ALTER SYSTEM SET
 max_connections = '200';
ALTER SYSTEM SET
 shared_buffers = '4GB';
ALTER SYSTEM SET
 effective_cache_size = '12GB';
ALTER SYSTEM SET
 maintenance_work_mem = '1GB';
ALTER SYSTEM SET
 checkpoint_completion_target = '0.9';
ALTER SYSTEM SET
 wal_buffers = '16MB';
ALTER SYSTEM SET
 default_statistics_target = '100';
ALTER SYSTEM SET
 random_page_cost = '1.1';
ALTER SYSTEM SET
 effective_io_concurrency = '200';
ALTER SYSTEM SET
 work_mem = '6990kB';
ALTER SYSTEM SET
 huge_pages = 'off';
ALTER SYSTEM SET
 min_wal_size = '1GB';
ALTER SYSTEM SET
 max_wal_size = '4GB';
ALTER SYSTEM SET
 max_worker_processes = '6';
ALTER SYSTEM SET
 max_parallel_workers_per_gather = '3';
ALTER SYSTEM SET
 max_parallel_workers = '6';
ALTER SYSTEM SET
 max_parallel_maintenance_workers = '3';
Ivansete-status commented 9 months ago

After upgrading to a more powerful database machine, we still can see low Store / Archive performance on both US and HK nodes: image

( cc @jakubgs )

jakubgs commented 9 months ago

Yes, I think now we can clearly conclude that the single DB used from different data centers is not a viable solution.

Our options are:

Thanks for researching this. I am going on holiday break till 2nd of January so I will not be working on this for a while. If help is needed please ask Anton or Alexis, but please don't make major changes to fleets without my input.

jakubgs commented 8 months ago

Since the benchmarks revealed that the single-DB cross-DC setup does not work I'm upgrading the layout to use multiple DBs:

The new setup has been deployed and works:

image

Will do status.prod tomorrow.

jakubgs commented 8 months ago

I have switched status.prod to use multiple hosts with PostgreSQL:

There's an issue with DB initialization. Will debug tomorrow.

jakubgs commented 8 months ago

Found the issue, lack of stage in DB hostname:

Had to also build a fresh deploy-status-prod-trace image: https://ci.infra.status.im/job/nim-waku/job/deploy-status-prod-trace/3/ But for some reason it was just restarting without an error, so i manually switched it to a normal one.

The fleet is back up and working:

image