numbersprotocol / numbers-network

MIT License
1 stars 1 forks source link

Mainnet explorer not updating #77

Open sync-by-unito[bot] opened 9 months ago

sync-by-unito[bot] commented 9 months ago

Numbers Mainnet (Jade) Numbers Explorer has stopped getting new transactions for over 24 hours.

blockscout | 2023-11-27T03:52:37.436 application=db_connection [info] Postgrex.Protocol (#PID<0.4124.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.11230.53> exited

blockscout | 2023-11-27T03:52:37.448 application=db_connection [info] Postgrex.Protocol (#PID<0.4159.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.22279.53> exited

blockscout | 2023-11-27T03:52:37.448 application=db_connection [info] Postgrex.Protocol (#PID<0.4132.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.22284.53> exited

blockscout | 2023-11-27T03:52:37.448 application=db_connection [info] Postgrex.Protocol (#PID<0.4121.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.11168.53> exited

blockscout | 2023-11-27T03:52:37.448 application=db_connection [info] Postgrex.Protocol (#PID<0.4112.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.22288.53> exited

It seems there are a lot of DB connection error in app, but the explorer instance can connect to db with psql and redis-cli without a problem.

Might be related to


blockscout               | 2023-11-27T04:19:44.619 application=ethereum_jsonrpc fetcher=internal_transaction count=2 [warning] Call from a callTracer with an unknown type: %{"from" => "0x0000000000000000000000000000000000000000", "gas" => "0x0", "gasUsed" => "0xf4240", "input" => "0x", "type" => "STOP"}

blockscout               | 2023-11-27T04:19:44.658 application=ethereum_jsonrpc fetcher=internal_transaction count=2 [warning] Call from a callTracer with an unknown type: %{"from" => "0x0000000000000000000000000000000000000000", "gas" => "0x0", "gasUsed" => "0x2dc6c0", "input" => "0x", "type" => "STOP"}

┆Issue is synchronized with this Asana task by Unito ┆Created By: James Chien

sync-by-unito[bot] commented 9 months ago

➤ James Chien commented:

Blockscout dev suggested upgrade to the latest backend version ( https://github.com/blockscout/blockscout/issues/8888 ). I checked the related code ( https://github.com/blockscout/blockscout/blob/1c1d143c2975487722d102e888d3c3c930458b38/apps/ethereum_jsonrpc/lib/ethereum_jsonrpc/geth.ex#L278-L293 ) and it seems the latest version did add handling of STOP call type which might be the root cause of the issue.

However there are a lot of changes in the newer version. I tried to include only related changes but there are more issues coming with it. Currently I'm trying to upgrade to the newer version.

sync-by-unito[bot] commented 9 months ago

➤ James Chien commented:

Similar errors also happens in the upstream master version, with a completely new db setup. Need to investigate further

backend | {"time":"2023-11-28T05:48:42.717Z","severity":"info","message":"Postgrex.Protocol (#PID<0.4188.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.5197.0> exited","metadata":{}}

sync-by-unito[bot] commented 9 months ago

➤ Bofu Chen commented:

James Chien 我的理解正確嗎

  1. 原本的 explorer code 無法處理 STOP call type,但 mainnet 送出了 STOP call,所以整個 explorer 就壞了 (且無法復原?)
  2. 新的 explorer code 修正了 #1,但修正包含過多改動,難以將 STOP call fix 單獨抽出
  3. 新的 explorer code 目前無法正常運作
sync-by-unito[bot] commented 9 months ago

➤ James Chien commented:

Bofu Chen

應該不是,因為 stop call type update 完了(不再跳 warning)還是一樣狀況

目前新版 explorer 我也 run 不起來,遇到一樣問題

還在看是哪個地方會發生 database disconnect

sync-by-unito[bot] commented 9 months ago

➤ Bofu Chen commented:

James Chien 如果不是 (你的意思是 #1 不成立?),是因為新舊 explorers 都因為不明原因導致 database disconnect 嗎?

Random idea: 有可能是 instance 自動升級 PostgreSQL 導致的相容性問題嗎?

sync-by-unito[bot] commented 9 months ago

➤ James Chien commented:

Bofu Chen Yes

應該不是 Postgres 版本,現在的 postgres 是由 docker compose 執行,不會有自動升級問題

另外 psql 可以正常連進去,db 重啟也未解決問題,目前推測應該是 blockscout backend 在做了某些跟 jsonrpc 有關的動作之後遇到 exception ,沒有相應的 error logging ,所以只看到 postgres disconnect error ,現在在找這部分

sync-by-unito[bot] commented 9 months ago

➤ Bofu Chen commented:

ChatGPT's input ( https://chat.openai.com/share/235095fe-3068-4e88-8079-fad2d5cfd826 ) 給你多少作點參考,雖然看起來不是太有幫助

sync-by-unito[bot] commented 9 months ago

➤ Olga commented:

James Chien I found this log in the database:


2023-11-28 08:25:48.012 UTC [39233] ERROR:  canceling statement due to user request

2023-11-28 08:25:48.012 UTC [39233] STATEMENT:  SELECT DISTINCT ON (f1."number") f1."number" FROM "blocks" AS b0 RIGHT OUTER JOIN (

SELECT distinct b1.number

FROM generate_series(($1)::integer, ($2)::integer) AS b1(number)

WHERE NOT EXISTS

(SELECT 1 FROM blocks b2 WHERE b2.number=b1.number AND b2.consensus)

ORDER BY b1.number DESC

)

AS f1 ON b0."number" = f1."number" ORDER BY f1."number"

2023-11-28 08:26:00.098 UTC [39402] ERROR:  canceling statement due to user request

2023-11-28 08:26:00.098 UTC [39402] STATEMENT:  SELECT DISTINCT ON (f1."number") f1."number" FROM "blocks" AS b0 RIGHT OUTER JOIN (

SELECT distinct b1.number

FROM generate_series(($1)::integer, ($2)::integer) AS b1(number)

WHERE NOT EXISTS

(SELECT 1 FROM blocks b2 WHERE b2.number=b1.number AND b2.consensus)

ORDER BY b1.number DESC

)

AS f1 ON b0."number" = f1."number" ORDER BY f1."number"

It seems like the data is growing, and the SQL query is taking too much time to run.

sync-by-unito[bot] commented 9 months ago

➤ James Chien commented:

Olga I noticed this log too, but the frequency of this error does not match the database disconnect error, and the query doesn't seem to really take long time

Unique (cost=29975.84..29976.84 rows=200 width=4) (actual time=254.137..254.140 rows=0 loops=1)

-> Sort (cost=29975.84..29976.34 rows=200 width=4) (actual time=254.136..254.139 rows=0 loops=1)

Sort Key: b1.number

Sort Method: quicksort Memory: 25kB

-> Nested Loop Left Join (cost=29112.12..29968.20 rows=200 width=4) (actual time=254.131..254.133 rows=0 loops=1)

-> Sort (cost=29111.70..29112.20 rows=200 width=4) (actual time=254.130..254.132 rows=0 loops=1)

Sort Key: b1.number DESC

Sort Method: quicksort Memory: 25kB

-> HashAggregate (cost=29102.05..29104.05 rows=200 width=4) (actual time=254.126..254.128 rows=0 loops=1)

Group Key: b1.number

Batches: 1 Memory Usage: 40kB

-> Hash Anti Join (cost=18106.55..28660.99 rows=176424 width=4) (actual time=254.122..254.123 rows=0 loops=1)

Hash Cond: (b1.number = b2.number)

-> Function Scan on generate_series b1 (cost=0.00..3528.49 rows=352849 width=4) (actual time=27.607..58.443 rows=352849 loops=1)

-> Hash (cost=12316.93..12316.93 rows=352849 width=8) (actual time=93.922..93.923 rows=352849 loops=1)

Buckets: 131072 Batches: 8 Memory Usage: 2754kB

-> Index Only Scan using one_consensus_block_at_height on blocks b2 (cost=0.42..12316.93 rows=352849 width=8) (actual time=0.031..39.602 rows=352849 loops=1)

Heap Fetches: 4064

-> Index Only Scan using blocks_number_index on blocks b0 (cost=0.42..4.26 rows=1 width=8) (never executed)

Index Cond: (number = b1.number)

Heap Fetches: 0

Planning Time: 0.262 ms

Execution Time: 255.480 ms

sync-by-unito[bot] commented 9 months ago

➤ James Chien commented:

Lots of idle activate transactions

datid | 16384

datname | blockscout

pid | 50048

leader_pid |

usesysid | 10

usename | postgres

application_name |

client_addr | 172.31.85.175

client_hostname |

client_port | 34048

backend_start | 2023-11-28 09:19:31.612246+00

xact_start |

query_start | 2023-11-28 09:20:51.30981+00

state_change | 2023-11-28 09:20:51.309846+00

wait_event_type | Client

wait_event | ClientRead

state | idle

backend_xid |

backend_xmin |

query_id |

query | SELECT b0."number", b0."hash" FROM "blocks" AS b0 WHERE ((b0."consensus" = TRUE) AND b0."number" = ANY($1))

backend_type | client backend

sync-by-unito[bot] commented 9 months ago

➤ James Chien commented:

I think mainnet websocket endpoint is down. I'm checking this.

websockets.exceptions.InvalidStatusCode: server rejected WebSocket connection: HTTP 502

sync-by-unito[bot] commented 9 months ago

➤ James Chien commented:

Bofu Chen Can you help check if it is normal that on numbers-mainnet-validator-1 (10.128.0.9), there are no avalanche-go process running?

I think this issue causes the websocket endpoint on mainnet to be invalid. The endpoint is required for the explorer to get new transactions.

In nginx, ws endpoints proxy passed to proxy_pass http://validator/ext/bc/2PDRxzc6jMbZSTLb3sufkVszgQc2jtDnYZGtDTAAfom1CTwPsE/ws;

which is the validator instance I mentioned above. While the http rpc endpoint and archive node endpoint uses the archive node instead of the validator-1 instance, so only websocket endpoint is dead.

sync-by-unito[bot] commented 9 months ago

➤ James Chien commented:

Bofu Chen I found the archive_nodes also support websocket endpoint, so I modified the nginx config on mainnetrpc to use archive node in websocket endpoint. The explorer resumes updating new transactions now.

sync-by-unito[bot] commented 9 months ago

➤ Bofu Chen commented:

James Chien val-m1 is not working, and I'm restarting it

sync-by-unito[bot] commented 9 months ago

➤ Bofu Chen commented:

It's correct to retrieve the onchain info from the archive node. We will keep validators focusing on validating transactions.

sync-by-unito[bot] commented 6 months ago

➤ Bofu Chen commented:

Although I increased the mainnet explorer instance's CPU cores from 2 to 4 ( https://app.asana.com/0/0/1206800502333768/f ), the explorer is still unstable with the 2 issues:

  1. The explorer cannot sync onchain data as fast as Routescan.
  2. The explorer website is inaccessible now (I will make it work first but need James to figure out the root cause).

Tammy Yang I reopen the task directly, and please help add this task to the next sprint.