Periods of high API response time with Citus

Description

There is a period all rest apis have high response time (in seconds).

It's observed that there is very uneven CPU load on the three citus worker nodes, with two at 2.5 cpu cores and shard2 at 7.6 hitting the resource limits. There are also very frequent db connection establishing and tearing down logs for user mirror_rest. Most of those sessions appear to be short lived and suspicious.

The number of mirror_rest user connections are low from the two coordinator nodes pg_stat_activity table, around 15 and 19, respectively.

There is a lot of such error logs in pgbouncer:

2024-07-10 19:53:42.574 UTC [852006] LOG C-0x213e630: mirror_node/mirror_rest@127.0.0.1:53838 login attempt: db=mirror_node user=mirror_rest tls=no
2024-07-10 19:53:42.575 UTC [852006] LOG C-0x213e630: mirror_node/mirror_rest@127.0.0.1:53838 closing because: password authentication failed (age=0s)

pgbouncer stats may give us more insights however it's hard to get.

At the end, restarting the pgbouncer container seems to fix the issue though the root cause is still unknown.

Grafana dashboard

Steps to reproduce

Check the description

Additional context

No response

Hedera network

other

Version

v0.110.0-SNAPSHOT

Operating system

None

hashgraph / hedera-mirror-node