Open krhougs opened 2 years ago
Logs on the Polkadot.js client:
2022-04-19 08:24:24 RPC-CORE: getFinalizedHead(): BlockHash:: No response received from RPC endpoint in 30s
{"name":"prb","hostname":"prb-data-provider-2-0","pid":46,"level":50,"msg":"Failed to update target: Error: No response received from RPC endpoint in 30s\n at WsProvider.value (/opt/app/node_modules/@polkadot/rpc-provider/ws/index.cjs:382:32)\n at Timeout._onTimeout (/opt/app/node_modules/@polkadot/rpc-provider/ws/index.cjs:428:90)\n at listOnTimeout (node:internal/timers:559:17)\n at processTimers (node:internal/timers:502:7)","time":"2022-04-19T08:24:24.632Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12318998,"msg":"Fetched parent block.","time":"2022-04-19T08:24:24.862Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12318999,"msg":"Fetched parent block.","time":"2022-04-19T08:24:24.984Z","v":0}
2022-04-19 08:24:29 RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:24:29 RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:24:29 RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319000,"msg":"Fetched parent block.","time":"2022-04-19T08:24:31.136Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 399, sentPerSecAvg: 79.8 }","time":"2022-04-19T08:24:34.304Z","v":0}
2022-04-19 08:24:37 RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 772, sentPerSecAvg: 154.4 }","time":"2022-04-19T08:24:39.304Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319001,"msg":"Fetched parent block.","time":"2022-04-19T08:24:40.248Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 975, sentPerSecAvg: 195 }","time":"2022-04-19T08:24:44.305Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 837, sentPerSecAvg: 167.4 }","time":"2022-04-19T08:24:49.305Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319002,"msg":"Fetched parent block.","time":"2022-04-19T08:24:49.366Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319003,"msg":"Fetched parent block.","time":"2022-04-19T08:24:52.476Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 825, sentPerSecAvg: 165 }","time":"2022-04-19T08:24:54.304Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 907, sentPerSecAvg: 181.4 }","time":"2022-04-19T08:24:59.305Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319004,"msg":"Fetched parent block.","time":"2022-04-19T08:25:01.615Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 839, sentPerSecAvg: 167.8 }","time":"2022-04-19T08:25:04.304Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 636, sentPerSecAvg: 127.2 }","time":"2022-04-19T08:25:09.305Z","v":0}
2022-04-19 08:25:09 RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:25:09 RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:25:09 RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
Do you have anymore information when this happens? Is there something happening on the Parachain? Can you reproduce this locally? Can you reproduce this always for the same block?
To mine on Khala/Phala, there is a tool to synchronize block data to and verify them in the TEE enclave. This tool will query api.rpc.chain.getFinalizedHead
to get the current finalized height in the synchronization loop.
This issue is found by our mining community. There are several miners facing this issue but unfortunately we can not reproduce this in our development environment.
In this specific case(Phala-Network/runtime-bridge#83), the user started the tool when both the parent chain and the parachain were synched to the network height. Its jsonrpc session was frozen when querying getFinalizedHead on 2022-04-19T08:23:29.530Z where it turned to the 1503872rd block of the Khala chain.
Since I am not familiar with Substate and Rust, I can only guess that querying getFinalizedHead
while the parachain node is finalizing a new coming block would cause a dead lock. Correct me if I'm wrong :)
I can set a watchdog for the jsonrpc session's keepalive failure for a workaround, but it do affect the reliability of the mining toolchains and the earning and income of the miners.
hmm, so you don't have any logs from node that you connected to?
Did you try to drop the WebSocket connection and start a new one?
My hunch is that you run in to https://github.com/paritytech/substrate/issues/10672
For this case we used RUST_LOG="trace,soketto=warn,jsonrpsee_ws_client=warn,remote-ext=warn,trie=warn,wasmtime_cranelift=warn"
but nothing valuable found from the output.
Full log: https://github.com/Phala-Network/runtime-bridge/files/8511920/NODE.%2B.data_provider.zip
Hey again,
Thanks for the logs but there are only node logs from 2022-04-19 08:26
because the actual getFinalizedHead
call was made around 2022-04-19 08:24
. So can you provide the full logs from the node please?
It should be sufficient to run with rpc node with RUST_LOG="rpc=debug"
to begin with.
@niklasad1 something I thought about. Why isn't this "ping-pong" mechanism that polkadot-js uses not backed into the protocol?
something I thought about. Why isn't this "ping-pong" mechanism that polkadot-js uses not backed into the protocol?
Good question, it should be "baked" into the WebSocket protocol but when I checked polkadot-js it doesn't really use WebSocket ping/pong instead the warnings are printed when a call takes longer than some pre-defined timeout but I could be wrong.
soketto
supports it and ws-rs
"should too"
@jacogr why not use the "ping-pong" protocol of websockets?
I honestly cannot recall as to why system_health
is specifically used to keep the connection open in a ping/pong fashion. This is all from early 2018. All I know is that the server drops the connection is we don't specifically do it, i.e. if the server doesn't receive anything from the client inside a 60 second period it drops and assumes the client is dead ...which is not that useful if you have subscriptions open....
(The timeout on specific calls is something completely different, it is on a per-call basis where it assumes something went wrong if a reponse is not received for a query inside a timeout window)
Can certainly investigate WS-specific ping/pong, not sure if all the clients libs have it either (there are eg. different usages for Node.js vs the browser). Never looked into it, so cannot comment on server or client lib support and if it gets around the underlying drops.
(Although this doesn't look promising, it mentions the ping/pong is normally from the server, not client - https://stackoverflow.com/a/50883592 and this mentions specifically that the JS browser side doesn't have control of it https://stackoverflow.com/a/10586583 so it would rather seem like a server-side config)
Ok, we can add these as "server-side pings" when jsonrpsee is a thing in substrate if that works better because soketto supports it but it needs a fix in jsonrpsee (should be trivial though)
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
In some case, calling api.rpc.chain.getFinalizedHead in Polkadot.js with the parachain will fail without any error message but a timeout error. After this, whole JSONRPC WebSocket session is frozen, we can get health warnings which indicate that the keepalive requests got no response.
It only happens on the Khala parachain, same rpc method works just fine on the kusama node, reproduced by the miner community.
Related:
Steps to reproduce
No response