Closed 4tal closed 1 year ago
In the note attached as well the RPC and subgraph specifications.
Thanks for the detailed report! @lutter since you've benchmarked attempt_chain_head_update
before, what's your take on this issue?
After execute DELETE FROM ethereum_blocks it's recovers for ~30 hours, and then it's start to lag again, till another execution - Recovery time (index without lagging) is ~90 second from command execution.
If I did my math right, 30 hours are about 100,000 blocks; that's not that many. But seeing how this is on a t2.micro, the first thing to try would be to use a larger instance with more memory and more sustained IOPS.
For comparison, in the hosted service, we have about 12M rows in ethereum_blocks
, and are seeing attempt_chain_head_update
times of 2.5ms on average, and 1.1s maximum.
It would be useful to look at pg_stat_statements
after running for a while to see what the query timings look like. Connect to your database with psql -x
and run
select query, calls, max_time, mean_time, rows
from pg_stat_statements
where query like '%attempt_chain_head_update%';
If that gives you an error, you'll have to create the pg_stat_statements
extension and wait for a while for it to gather statistics.
You can also get a list of the slowest queries with
select query, calls, max_time, mean_time, rows
from pg_stat_statements
order by total_time desc
limit 10;
But beefing up your EC2 instance would be the first thing I would try.
Oh wait .. I misread your setup, and didn't see that you are on RDS. So changing your instance won't do anything. I am not familiar with RDS, but my advice is still to increase IOPS and possibly database memory.
This is from the normal bot channel (updating about graph-node indexing status):
This is from the alerts, as you see every 8-14 hours it executes the delete * from Ethereum blocks -
I'll try to play with DB configuration later or maybe next week. please update if you have any other idea.
Another update: I've switched to t2.small from t3,micro - more CPU, memory, connections, read IOPS, write IOPS.
The new graph-node setup worked fine for 4 days, running 2 subgraph.
Before switching to the t2 setup it used to fail every 12 hours, my recovery solution was to automatically clear the cache (Ethereum block) while it detects a lag and then it worked for another 12 hours.
The new setup survives for 4 straight days, until 5 hours ago. Then it started to lag and the automatic clearing cache didn't solve it.
T2 DB state for the pass 4 days:
Both t2.small and t3.micro are mentioned as 'burstable' in the RDS docs; could it be that they are being throttled because you went over some quota? I would try with something like an m5.large, especially since it says it's 'EBS optimized' which I assume means that it can sustain IOPS better.
Will try to upgrade it tonight, and will update here.
1 - Show the previous crash of the graph-node when we used t2.small (Burstable) instance for the graph-node service. Then I restarted the service and the CPU resets and started to accumulate again.
2 - Switched DB instance to m5.large (EBS Optimized + not burstable) - It seems that it accumulates CPU consumption faster than before.
Will update if it stabilizes or continues to climb.
Seems like there's something else going on, too. Is there any way that I could log into your db and have a look at what pg_stat_statements
says about slow queries?
Sure, please check the query results and if not enough we'll get you in :)
Nothing really jumps out to me from these times, other than that there are multiple entries for the same queries. Are you running several graph-node
instances against this DB, or did you drop and recreate the graph-node
database against that instance?
The next thing to look at would be to do another run on indexing with a fresh DB; before starting the indexing, execute select pg_stat_statements_reset();
to delete old statistics. Also, could you set log_min_duration_statement
to 500
in your database config? That will log any queries that take longer than 500ms.
When things are lagging again, can you send me the result of the above pg_stat_statements
sorted by max_time desc
and the postgresql.log
that was produced during the run?
Yeap, I'm running two instances of indexing nodes one for staging and one for testing and the DBs are both on that one RDS DB instance. Should I split them into two RDS instances, is that a problem?
In general, it should be fine, but since we don't know what's causing your issue, it's better if we can look at it in isolation.
Reporting after short discussion with Leo.
After deploy graph-node to a private setup it works fine for around 20~ and then the memory and CPU starts to rise until it's reach a certain point which it start to stay behind the current mined block so for example it current block is 1000 so last block indexed will be 400, 600 blocks behind and after 3 minutes it will be ~1180 and the index will be on ~780 and it will continue like that will constant lagging - and will not recover.
In the following link attached: Summary in Evernote
The graph-node works perfectly for 20 hours and then, it's happening.
Leo gave me a tip to run
DELETE FROM ethereum_blocks;
and it did make my graph-node recover and index it as expected.