Open sync-by-unito[bot] opened 1 year ago
➤ James Chien commented:
Blockscout dev suggested upgrade to the latest backend version ( https://github.com/blockscout/blockscout/issues/8888 ). I checked the related code ( https://github.com/blockscout/blockscout/blob/1c1d143c2975487722d102e888d3c3c930458b38/apps/ethereum_jsonrpc/lib/ethereum_jsonrpc/geth.ex#L278-L293 ) and it seems the latest version did add handling of STOP call type which might be the root cause of the issue.
However there are a lot of changes in the newer version. I tried to include only related changes but there are more issues coming with it. Currently I'm trying to upgrade to the newer version.
➤ James Chien commented:
Similar errors also happens in the upstream master version, with a completely new db setup. Need to investigate further
backend | {"time":"2023-11-28T05:48:42.717Z","severity":"info","message":"Postgrex.Protocol (#PID<0.4188.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.5197.0> exited","metadata":{}}
➤ Bofu Chen commented:
James Chien 我的理解正確嗎
➤ James Chien commented:
Bofu Chen
應該不是,因為 stop call type update 完了(不再跳 warning)還是一樣狀況
目前新版 explorer 我也 run 不起來,遇到一樣問題
還在看是哪個地方會發生 database disconnect
➤ Bofu Chen commented:
James Chien 如果不是 (你的意思是 #1 不成立?),是因為新舊 explorers 都因為不明原因導致 database disconnect 嗎?
Random idea: 有可能是 instance 自動升級 PostgreSQL 導致的相容性問題嗎?
➤ James Chien commented:
Bofu Chen Yes
應該不是 Postgres 版本,現在的 postgres 是由 docker compose 執行,不會有自動升級問題
另外 psql 可以正常連進去,db 重啟也未解決問題,目前推測應該是 blockscout backend 在做了某些跟 jsonrpc 有關的動作之後遇到 exception ,沒有相應的 error logging ,所以只看到 postgres disconnect error ,現在在找這部分
➤ Bofu Chen commented:
ChatGPT's input ( https://chat.openai.com/share/235095fe-3068-4e88-8079-fad2d5cfd826 ) 給你多少作點參考,雖然看起來不是太有幫助
➤ Olga commented:
James Chien I found this log in the database:
2023-11-28 08:25:48.012 UTC [39233] ERROR: canceling statement due to user request
2023-11-28 08:25:48.012 UTC [39233] STATEMENT: SELECT DISTINCT ON (f1."number") f1."number" FROM "blocks" AS b0 RIGHT OUTER JOIN (
SELECT distinct b1.number
FROM generate_series(($1)::integer, ($2)::integer) AS b1(number)
WHERE NOT EXISTS
(SELECT 1 FROM blocks b2 WHERE b2.number=b1.number AND b2.consensus)
ORDER BY b1.number DESC
)
AS f1 ON b0."number" = f1."number" ORDER BY f1."number"
2023-11-28 08:26:00.098 UTC [39402] ERROR: canceling statement due to user request
2023-11-28 08:26:00.098 UTC [39402] STATEMENT: SELECT DISTINCT ON (f1."number") f1."number" FROM "blocks" AS b0 RIGHT OUTER JOIN (
SELECT distinct b1.number
FROM generate_series(($1)::integer, ($2)::integer) AS b1(number)
WHERE NOT EXISTS
(SELECT 1 FROM blocks b2 WHERE b2.number=b1.number AND b2.consensus)
ORDER BY b1.number DESC
)
AS f1 ON b0."number" = f1."number" ORDER BY f1."number"
It seems like the data is growing, and the SQL query is taking too much time to run.
➤ James Chien commented:
Olga I noticed this log too, but the frequency of this error does not match the database disconnect error, and the query doesn't seem to really take long time
Unique (cost=29975.84..29976.84 rows=200 width=4) (actual time=254.137..254.140 rows=0 loops=1)
-> Sort (cost=29975.84..29976.34 rows=200 width=4) (actual time=254.136..254.139 rows=0 loops=1)
Sort Key: b1.number
Sort Method: quicksort Memory: 25kB
-> Nested Loop Left Join (cost=29112.12..29968.20 rows=200 width=4) (actual time=254.131..254.133 rows=0 loops=1)
-> Sort (cost=29111.70..29112.20 rows=200 width=4) (actual time=254.130..254.132 rows=0 loops=1)
Sort Key: b1.number DESC
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=29102.05..29104.05 rows=200 width=4) (actual time=254.126..254.128 rows=0 loops=1)
Group Key: b1.number
Batches: 1 Memory Usage: 40kB
-> Hash Anti Join (cost=18106.55..28660.99 rows=176424 width=4) (actual time=254.122..254.123 rows=0 loops=1)
Hash Cond: (b1.number = b2.number)
-> Function Scan on generate_series b1 (cost=0.00..3528.49 rows=352849 width=4) (actual time=27.607..58.443 rows=352849 loops=1)
-> Hash (cost=12316.93..12316.93 rows=352849 width=8) (actual time=93.922..93.923 rows=352849 loops=1)
Buckets: 131072 Batches: 8 Memory Usage: 2754kB
-> Index Only Scan using one_consensus_block_at_height on blocks b2 (cost=0.42..12316.93 rows=352849 width=8) (actual time=0.031..39.602 rows=352849 loops=1)
Heap Fetches: 4064
-> Index Only Scan using blocks_number_index on blocks b0 (cost=0.42..4.26 rows=1 width=8) (never executed)
Index Cond: (number = b1.number)
Heap Fetches: 0
Planning Time: 0.262 ms
Execution Time: 255.480 ms
➤ James Chien commented:
Lots of idle activate transactions
datid | 16384
datname | blockscout
pid | 50048
leader_pid |
usesysid | 10
usename | postgres
application_name |
client_addr | 172.31.85.175
client_hostname |
client_port | 34048
backend_start | 2023-11-28 09:19:31.612246+00
xact_start |
query_start | 2023-11-28 09:20:51.30981+00
state_change | 2023-11-28 09:20:51.309846+00
wait_event_type | Client
wait_event | ClientRead
state | idle
backend_xid |
backend_xmin |
query_id |
query | SELECT b0."number", b0."hash" FROM "blocks" AS b0 WHERE ((b0."consensus" = TRUE) AND b0."number" = ANY($1))
backend_type | client backend
➤ James Chien commented:
I think mainnet websocket endpoint is down. I'm checking this.
websockets.exceptions.InvalidStatusCode: server rejected WebSocket connection: HTTP 502
➤ James Chien commented:
Bofu Chen Can you help check if it is normal that on numbers-mainnet-validator-1 (10.128.0.9), there are no avalanche-go process running?
I think this issue causes the websocket endpoint on mainnet to be invalid. The endpoint is required for the explorer to get new transactions.
In nginx, ws endpoints proxy passed to proxy_pass http://validator/ext/bc/2PDRxzc6jMbZSTLb3sufkVszgQc2jtDnYZGtDTAAfom1CTwPsE/ws;
which is the validator instance I mentioned above. While the http rpc endpoint and archive node endpoint uses the archive node instead of the validator-1 instance, so only websocket endpoint is dead.
➤ James Chien commented:
Bofu Chen I found the archive_nodes also support websocket endpoint, so I modified the nginx config on mainnetrpc to use archive node in websocket endpoint. The explorer resumes updating new transactions now.
➤ Bofu Chen commented:
James Chien val-m1 is not working, and I'm restarting it
➤ Bofu Chen commented:
It's correct to retrieve the onchain info from the archive node. We will keep validators focusing on validating transactions.
➤ Bofu Chen commented:
Although I increased the mainnet explorer instance's CPU cores from 2 to 4 ( https://app.asana.com/0/0/1206800502333768/f ), the explorer is still unstable with the 2 issues:
Tammy Yang I reopen the task directly, and please help add this task to the next sprint.
Numbers Mainnet (Jade) Numbers Explorer has stopped getting new transactions for over 24 hours.
blockscout | 2023-11-27T03:52:37.436 application=db_connection [info] Postgrex.Protocol (#PID<0.4124.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.11230.53> exited
blockscout | 2023-11-27T03:52:37.448 application=db_connection [info] Postgrex.Protocol (#PID<0.4159.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.22279.53> exited
blockscout | 2023-11-27T03:52:37.448 application=db_connection [info] Postgrex.Protocol (#PID<0.4132.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.22284.53> exited
blockscout | 2023-11-27T03:52:37.448 application=db_connection [info] Postgrex.Protocol (#PID<0.4121.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.11168.53> exited
blockscout | 2023-11-27T03:52:37.448 application=db_connection [info] Postgrex.Protocol (#PID<0.4112.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.22288.53> exited
It seems there are a lot of DB connection error in app, but the explorer instance can connect to db with psql and redis-cli without a problem.
Might be related to
┆Issue is synchronized with this Asana task by Unito ┆Created By: James Chien