paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

Calling `api.rpc.chain.getFinalizedHead` freezes whole WebSocket JSONRPC session #11239

Open krhougs opened 2 years ago

krhougs commented 2 years ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

In some case, calling api.rpc.chain.getFinalizedHead in Polkadot.js with the parachain will fail without any error message but a timeout error. After this, whole JSONRPC WebSocket session is frozen, we can get health warnings which indicate that the keepalive requests got no response.

It only happens on the Khala parachain, same rpc method works just fine on the kusama node, reproduced by the miner community.

Related:

Steps to reproduce

No response

krhougs commented 2 years ago

Logs on the Polkadot.js client:

2022-04-19 08:24:24        RPC-CORE: getFinalizedHead(): BlockHash:: No response received from RPC endpoint in 30s
{"name":"prb","hostname":"prb-data-provider-2-0","pid":46,"level":50,"msg":"Failed to update target: Error: No response received from RPC endpoint in 30s\n    at WsProvider.value (/opt/app/node_modules/@polkadot/rpc-provider/ws/index.cjs:382:32)\n    at Timeout._onTimeout (/opt/app/node_modules/@polkadot/rpc-provider/ws/index.cjs:428:90)\n    at listOnTimeout (node:internal/timers:559:17)\n    at processTimers (node:internal/timers:502:7)","time":"2022-04-19T08:24:24.632Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12318998,"msg":"Fetched parent block.","time":"2022-04-19T08:24:24.862Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12318999,"msg":"Fetched parent block.","time":"2022-04-19T08:24:24.984Z","v":0}
2022-04-19 08:24:29        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:24:29        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:24:29        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319000,"msg":"Fetched parent block.","time":"2022-04-19T08:24:31.136Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 399, sentPerSecAvg: 79.8 }","time":"2022-04-19T08:24:34.304Z","v":0}
2022-04-19 08:24:37        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 772, sentPerSecAvg: 154.4 }","time":"2022-04-19T08:24:39.304Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319001,"msg":"Fetched parent block.","time":"2022-04-19T08:24:40.248Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 975, sentPerSecAvg: 195 }","time":"2022-04-19T08:24:44.305Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 837, sentPerSecAvg: 167.4 }","time":"2022-04-19T08:24:49.305Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319002,"msg":"Fetched parent block.","time":"2022-04-19T08:24:49.366Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319003,"msg":"Fetched parent block.","time":"2022-04-19T08:24:52.476Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 825, sentPerSecAvg: 165 }","time":"2022-04-19T08:24:54.304Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 907, sentPerSecAvg: 181.4 }","time":"2022-04-19T08:24:59.305Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319004,"msg":"Fetched parent block.","time":"2022-04-19T08:25:01.615Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 839, sentPerSecAvg: 167.8 }","time":"2022-04-19T08:25:04.304Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 636, sentPerSecAvg: 127.2 }","time":"2022-04-19T08:25:09.305Z","v":0}
2022-04-19 08:25:09        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:25:09        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:25:09        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
bkchr commented 2 years ago

Do you have anymore information when this happens? Is there something happening on the Parachain? Can you reproduce this locally? Can you reproduce this always for the same block?

krhougs commented 2 years ago

Background

To mine on Khala/Phala, there is a tool to synchronize block data to and verify them in the TEE enclave. This tool will query api.rpc.chain.getFinalizedHead to get the current finalized height in the synchronization loop.

The issue

This issue is found by our mining community. There are several miners facing this issue but unfortunately we can not reproduce this in our development environment.

In this specific case(Phala-Network/runtime-bridge#83), the user started the tool when both the parent chain and the parachain were synched to the network height. Its jsonrpc session was frozen when querying getFinalizedHead on 2022-04-19T08:23:29.530Z where it turned to the 1503872rd block of the Khala chain.

Since I am not familiar with Substate and Rust, I can only guess that querying getFinalizedHead while the parachain node is finalizing a new coming block would cause a dead lock. Correct me if I'm wrong :)

I can set a watchdog for the jsonrpc session's keepalive failure for a workaround, but it do affect the reliability of the mining toolchains and the earning and income of the miners.

niklasad1 commented 2 years ago

hmm, so you don't have any logs from node that you connected to?

Did you try to drop the WebSocket connection and start a new one?

My hunch is that you run in to https://github.com/paritytech/substrate/issues/10672

krhougs commented 2 years ago

For this case we used RUST_LOG="trace,soketto=warn,jsonrpsee_ws_client=warn,remote-ext=warn,trie=warn,wasmtime_cranelift=warn" but nothing valuable found from the output.

Full log: https://github.com/Phala-Network/runtime-bridge/files/8511920/NODE.%2B.data_provider.zip

niklasad1 commented 2 years ago

Hey again,

Thanks for the logs but there are only node logs from 2022-04-19 08:26 because the actual getFinalizedHead call was made around 2022-04-19 08:24. So can you provide the full logs from the node please?

It should be sufficient to run with rpc node with RUST_LOG="rpc=debug" to begin with.

bkchr commented 2 years ago

@niklasad1 something I thought about. Why isn't this "ping-pong" mechanism that polkadot-js uses not backed into the protocol?

niklasad1 commented 2 years ago

something I thought about. Why isn't this "ping-pong" mechanism that polkadot-js uses not backed into the protocol?

Good question, it should be "baked" into the WebSocket protocol but when I checked polkadot-js it doesn't really use WebSocket ping/pong instead the warnings are printed when a call takes longer than some pre-defined timeout but I could be wrong.

soketto supports it and ws-rs "should too"

bkchr commented 2 years ago

@jacogr why not use the "ping-pong" protocol of websockets?

jacogr commented 2 years ago

I honestly cannot recall as to why system_health is specifically used to keep the connection open in a ping/pong fashion. This is all from early 2018. All I know is that the server drops the connection is we don't specifically do it, i.e. if the server doesn't receive anything from the client inside a 60 second period it drops and assumes the client is dead ...which is not that useful if you have subscriptions open....

(The timeout on specific calls is something completely different, it is on a per-call basis where it assumes something went wrong if a reponse is not received for a query inside a timeout window)

Can certainly investigate WS-specific ping/pong, not sure if all the clients libs have it either (there are eg. different usages for Node.js vs the browser). Never looked into it, so cannot comment on server or client lib support and if it gets around the underlying drops.

(Although this doesn't look promising, it mentions the ping/pong is normally from the server, not client - https://stackoverflow.com/a/50883592 and this mentions specifically that the JS browser side doesn't have control of it https://stackoverflow.com/a/10586583 so it would rather seem like a server-side config)

niklasad1 commented 2 years ago

Ok, we can add these as "server-side pings" when jsonrpsee is a thing in substrate if that works better because soketto supports it but it needs a fix in jsonrpsee (should be trivial though)