500 blocks lag (after 20 hours) when using graph-node docker image to index private POA, 1 second per block Ethereum blockchain

4tal commented 5 years ago

Reporting after short discussion with Leo.

After deploy graph-node to a private setup it works fine for around 20~ and then the memory and CPU starts to rise until it's reach a certain point which it start to stay behind the current mined block so for example it current block is 1000 so last block indexed will be 400, 600 blocks behind and after 3 minutes it will be ~1180 and the index will be on ~780 and it will continue like that will constant lagging - and will not recover.

DB - AWS, RDS, t2.micro (can hanlde 60 connections)
Docker image - DockerHub/graph-node.
Private POA Ethterum blockchain which index block every second.

In the following link attached: Summary in Evernote

Image which show the increase in memory/CPU and instant collapse.
Attempt_chain_head_update time.
Log image where you can see the point of failure when block start to stack behind.
Log file from after the failure.

The graph-node works perfectly for 20 hours and then, it's happening.

Leo gave me a tip to run DELETE FROM ethereum_blocks; and it did make my graph-node recover and index it as expected.

4tal commented 5 years ago

In the note attached as well the RPC and subgraph specifications.

leoyvens commented 5 years ago

Thanks for the detailed report! @lutter since you've benchmarked attempt_chain_head_update before, what's your take on this issue?

4tal commented 5 years ago

After execute DELETE FROM ethereum_blocks it's recovers for ~30 hours, and then it's start to lag again, till another execution - Recovery time (index without lagging) is ~90 second from command execution.

lutter commented 5 years ago

If I did my math right, 30 hours are about 100,000 blocks; that's not that many. But seeing how this is on a t2.micro, the first thing to try would be to use a larger instance with more memory and more sustained IOPS.

For comparison, in the hosted service, we have about 12M rows in ethereum_blocks, and are seeing attempt_chain_head_update times of 2.5ms on average, and 1.1s maximum.

It would be useful to look at pg_stat_statements after running for a while to see what the query timings look like. Connect to your database with psql -x and run

  select query,  calls, max_time, mean_time, rows
  from pg_stat_statements
 where query like '%attempt_chain_head_update%';

If that gives you an error, you'll have to create the pg_stat_statements extension and wait for a while for it to gather statistics.

You can also get a list of the slowest queries with

select query,  calls, max_time, mean_time, rows
  from pg_stat_statements
 order by total_time desc
 limit 10;

But beefing up your EC2 instance would be the first thing I would try.

lutter commented 5 years ago

Oh wait .. I misread your setup, and didn't see that you are on RDS. So changing your instance won't do anything. I am not familiar with RDS, but my advice is still to increase IOPS and possibly database memory.

4tal commented 5 years ago

This is from the normal bot channel (updating about graph-node indexing status):

This is from the alerts, as you see every 8-14 hours it executes the delete * from Ethereum blocks -

I'll try to play with DB configuration later or maybe next week. please update if you have any other idea.

4tal commented 5 years ago

Another update: I've switched to t2.small from t3,micro - more CPU, memory, connections, read IOPS, write IOPS.

The new graph-node setup worked fine for 4 days, running 2 subgraph.

Before switching to the t2 setup it used to fail every 12 hours, my recovery solution was to automatically clear the cache (Ethereum block) while it detects a lag and then it worked for another 12 hours.

The new setup survives for 4 straight days, until 5 hours ago. Then it started to lag and the automatic clearing cache didn't solve it.

T2 DB state for the pass 4 days:

lutter commented 5 years ago

Both t2.small and t3.micro are mentioned as 'burstable' in the RDS docs; could it be that they are being throttled because you went over some quota? I would try with something like an m5.large, especially since it says it's 'EBS optimized' which I assume means that it can sustain IOPS better.

4tal commented 5 years ago

Will try to upgrade it tonight, and will update here.

4tal commented 5 years ago

1 - Show the previous crash of the graph-node when we used t2.small (Burstable) instance for the graph-node service. Then I restarted the service and the CPU resets and started to accumulate again.

2 - Switched DB instance to m5.large (EBS Optimized + not burstable) - It seems that it accumulates CPU consumption faster than before.

Will update if it stabilizes or continues to climb.

4tal commented 5 years ago

- The moment we switched to m5.large. still not solve that problem.

lutter commented 5 years ago

Seems like there's something else going on, too. Is there any way that I could log into your db and have a look at what pg_stat_statements says about slow queries?

4tal commented 5 years ago

Sure, please check the query results and if not enough we'll get you in :)

lutter commented 5 years ago

Nothing really jumps out to me from these times, other than that there are multiple entries for the same queries. Are you running several graph-node instances against this DB, or did you drop and recreate the graph-node database against that instance?

The next thing to look at would be to do another run on indexing with a fresh DB; before starting the indexing, execute select pg_stat_statements_reset(); to delete old statistics. Also, could you set log_min_duration_statement to 500 in your database config? That will log any queries that take longer than 500ms.

When things are lagging again, can you send me the result of the above pg_stat_statements sorted by max_time desc and the postgresql.log that was produced during the run?

4tal commented 5 years ago

Yeap, I'm running two instances of indexing nodes one for staging and one for testing and the DBs are both on that one RDS DB instance. Should I split them into two RDS instances, is that a problem?

lutter commented 5 years ago

In general, it should be fine, but since we don't know what's causing your issue, it's better if we can look at it in isolation.

graphprotocol / support

500 blocks lag (after 20 hours) when using graph-node docker image to index private POA, 1 second per block Ethereum blockchain #28