hashgraph / hedera-mirror-node

Hedera Mirror Node archives data from consensus nodes and serves it via an API
Apache License 2.0
147 stars 111 forks source link

Periods of high API response time with Citus #8750

Closed xin-hedera closed 2 months ago

xin-hedera commented 3 months ago

Description

There is a period all rest apis have high response time (in seconds).

It's observed that there is very uneven CPU load on the three citus worker nodes, with two at 2.5 cpu cores and shard2 at 7.6 hitting the resource limits. There are also very frequent db connection establishing and tearing down logs for user mirror_rest. Most of those sessions appear to be short lived and suspicious.

The number of mirror_rest user connections are low from the two coordinator nodes pg_stat_activity table, around 15 and 19, respectively.

There is a lot of such error logs in pgbouncer:

2024-07-10 19:53:42.574 UTC [852006] LOG C-0x213e630: mirror_node/mirror_rest@127.0.0.1:53838 login attempt: db=mirror_node user=mirror_rest tls=no
2024-07-10 19:53:42.575 UTC [852006] LOG C-0x213e630: mirror_node/mirror_rest@127.0.0.1:53838 closing because: password authentication failed (age=0s)

pgbouncer stats may give us more insights however it's hard to get.

At the end, restarting the pgbouncer container seems to fix the issue though the root cause is still unknown.

Grafana dashboard

Steps to reproduce

Check the description

Additional context

No response

Hedera network

other

Version

v0.110.0-SNAPSHOT

Operating system

None

xin-hedera commented 2 months ago

We have made the following changes to optimize the citus cluster to be more performant

We have not observed the issue at the same or higher TPS with all the above measurements.