paritytech / substrate-telemetry

Polkadot Telemetry service
GNU General Public License v3.0
299 stars 208 forks source link

Telemetry core: "Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)" #501

Open jsdw opened 1 year ago

jsdw commented 1 year ago

At some point recently, telemetry.polkadot.io went downwith lots of errors like:

2022-09-30 10:33:26,536 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174701)
2022-09-30 10:33:26,538 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(1)/ShardNodeId(217267)
2022-09-30 10:33:26,905 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(1)/ShardNodeId(217346)
2022-09-30 10:33:27,001 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174702)
2022-09-30 10:33:27,001 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174702)
2022-09-30 10:33:27,070 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217363)
2022-09-30 10:33:27,070 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217363)
2022-09-30 10:33:27,202 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217364)
2022-09-30 10:33:27,204 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217364)
2022-09-30 10:33:27,834 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174703)
2022-09-30 10:33:27,834 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174703)
2022-09-30 10:33:28,577 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174704)
2022-09-30 10:33:28,577 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174704)
2022-09-30 10:33:28,680 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)
2022-09-30 10:33:29,421 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(216564)
2022-09-30 10:33:29,458 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217031)
2022-09-30 10:33:29,458 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217031)
^C

Restarting the telemetry-core pod didn't help. Restarting the shards make things work again.

These errors imply that shards were sending information abotu nodes that the core knew nothing about.

Is there a chance that the core was restarted at some point (perhaps due to being out of memory or whatnot) and the shards didn't properly handle this and send new node information?

Alternately, is it possible that the connection between core and shards faultered and the core didn't properly clean up its internal state when this happened? (Right offhand I can't see anything that would drop all of the nodes in the core when a shard connection was lost).

The latter is also something that's a little harder to test locally (we'll have tested restarting shards and core plenty). Perhaps #497 also arose as a result of some conneciton issue like this that led to duplicates not being cleaned up?

jsdw commented 1 year ago

This might be resolved by https://github.com/paritytech/substrate-telemetry/pull/504