[Bug] High subgraph query throughput causes excessive traffic on the database primary node

Bug report

Hello,

We have deployed TheGraph internally and are using it as the blockchain data source for some of our services. It works great, thanks a lot for building such a great product! Lately, we've been scaling up our infrastructure to answer more queries. A part of this effort included adding a read replica to lower the load on our primary database instance (the goal is to have all query nodes get data from the read replicas, so that the primary instance can focus on indexing tasks). Configuration went pretty smoothly after reading the docs and we are indeed seeing some traffic targeting our read replicas. However, we're still seeing surprisingly high usage on our primary database instance, and this usage increases as we get more queries (while we only expect read replicas to be impacted since this is querying data that does not need to perform any writes).

Some analysis of the issue with Performance Insights reveals an elevated rate of queries on the primary instance that seem to correlate with incoming queries given the high throughput. We think these should be targeted to read replicas to scale up properly. Here are some queries along with their rate and aggregated CPU usage metric (10% means the primary instance spends 10% of all its CPU time processing queries like this).

CPU %	Rate (calls/s)	SQL
42	268.64	`SELECT "chain1"."blocks"."number", coalesce(data->'block'->>'timestamp', data->>'timestamp') FROM "chain1"."blocks" WHERE "chain1"."blocks"."hash" = $1 LIMIT $2`
11	1348.71	`SELECT 1`
6	268.64	SELECT EXISTS (SELECT "subgraphs"."subgraph_error"."vid", "subgraphs"."subgraph_error"."id", "subgraphs"."subgraph_error"."subgraph_id", "subgraphs"."subgraph_error"."message", "subgraphs"."subgraph_error"."block_hash", "subgraphs"."subgraph_error"."handler", "subgraphs"."subgraph_error"."deterministic", "subgraphs"."subgraph_error"."block_range" FROM "subgraphs"."subgraph_error" WHERE "subgraphs"."subgraph_error"."subgraph_id" = $1 AND "subgraphs"."subgraph_error"."deterministic" AND block_range @> $2)
4	537.27	`SELECT "subgraphs"."subgraph_deployment"."deployment", "subgraphs"."subgraph_deployment"."reorg_count", "subgraphs"."subgraph_deployment"."max_reorg_depth", "subgraphs"."subgraph_deployment"."latest_ethereum_block_number", "subgraphs"."subgraph_deployment"."latest_ethereum_block_hash", "subgraphs"."subgraph_deployment"."earliest_block_number" FROM "subgraphs"."subgraph_deployment" WHERE "subgraphs"."subgraph_deployment"."deployment" = $1 LIMIT $2`
2	268.65	`SELECT "subgraphs"."subgraph_version"."deployment" FROM ("subgraphs"."subgraph_version" INNER JOIN "subgraphs"."subgraph" ON "subgraphs"."subgraph"."current_version" = "subgraphs"."subgraph_version"."id") WHERE "subgraphs"."subgraph"."name" = $1 LIMIT $2`

For reference, at the time of the measurement, we were processing approximately 270 reqs/s.

Besides directing these queries to the database read replicas, caching some of their results for a few seconds seems sensible as well and should not have a noticeable impact on the behaviour of thegraph.

Thanks in advance for your help

Relevant log output

No response

IPFS hash

No response

Subgraph name or link to explorer

No response

Some information to help us out

[ ] Tick this box if this bug is caused by a regression found in the latest release.
[ ] Tick this box if this bug is specific to the hosted service.
[X] I have searched the issue tracker to make sure this issue is not a duplicate.

OS information

Deployed version: 0.33.0 Linux

graphprotocol / graph-node