paritytech / polkadot

Polkadot Node Implementation
GNU General Public License v3.0
7.12k stars 1.58k forks source link

Incident: Kusama - Mangata Parachain halt for ~30 mins #6758

Closed goncer closed 1 year ago

goncer commented 1 year ago

We had a small incident and we can not guess the root cause of it. The incident happened between 2023-02-20 23:25:01 - 2023-02-20 23:41:10 CET. We had a chain halt during that time, and we do not understand why.

Attached are my collator logs if help! logs_incident_23_25_to_23_41.txt

Network: Kusama Parachain: Mangata - 2110

Tomen commented 1 year ago

possibly related: https://substrate.stackexchange.com/questions/7302/parachain-missed-blocks-due-to-relay-wrong-announcement

bkchr commented 1 year ago

Which Polkadot branch are you using?

eskimor commented 1 year ago

At that time there have been quite a few disputes happening. But looking at metrics, they seem to have been handled well. Load looks normal, ToFs are very good as well. So while this might be related, it is not obvious how.

eskimor commented 1 year ago

In the logs, worrying but before the incident time frame:

Block import error: Potential long-range attack: block not in finalized chain.

Also interesting, there are a couple of:

2023-02-20 23:32:51 [Parachain] Starting collation. relay_parent=0x870b20d669443a9d7b536a93b72f865c7d8ab5fd9d513f84a289f87d5f30a204 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:32:51 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:32:51 [Parachain] Starting collation. relay_parent=0x392d1c08565a2582d47efd089c4d01d2d4fd43409af6dc64c5e5ebe1ce4a6a12 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:32:51 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:32:51 [Parachain] Starting collation. relay_parent=0x6096bcce683be5b55ddd56b2b8325f7f96460dd18f07876130aa207f5f5d8c89 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f

so we seem to be having three forks, but even more interesting there is no PoV size message afterwards, which is supposed to be there if things go fine.

Seems like the only thing that could go wrong here without an immediate log (maybe higher level) is produce candidate.

There are PoV size messages much later, but way less than Starting collation, also unfortunately those message contain no relay parent, so it is not clear to which Starting collation message they actually belong.

We have up to 5 forks:

vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:37:01 [Parachain] Starting collation. relay_parent=0x91e7a8b2c4a608fd04fc950753be49421cda1c80e7a1055944ee6d51dbcde6b0 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:37:01 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:37:01 [Parachain] Starting collation. relay_parent=0x551a1ac9424c851b8322ba89017c631653c4f7492c000a2a5c07a5b7c056382f at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:37:01 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:37:01 [Parachain] Starting collation. relay_parent=0x493dbb6440bad814e8aae9be043803d8df760161a2514ac2045396d3b7db0ea8 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:37:01 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:37:01 [Parachain] Starting collation. relay_parent=0xa6d42a8b478fe2244ae9705984711b96b0715783431bbd51b3c7f60c649518c4 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:37:01 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:37:01 [Parachain] Starting collation. relay_parent=0xa2231d7dea7fc6315a7b43d8bfcbe4c208ed5686e3f7f571229bd2439bd805b4 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f

Oh no, it gets way worse later:

vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xb2f7e45040b1c62d2b78b8aab9197bf024b88479dd17e0ca5dba60cebf18178b at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x233c9479b58a1248d43452ceb33f272eb07af4ac7b7fcdcff39a9ec57df3dad6 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x578be6151c3a508e0932594f65780cd05388955b9162e5d56d53fbc823c5ad6c at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x1cb7b9376f67a0bb1000cf98b53f031baaf06d028062bceccd722761103e7fa6 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xe0a4369dd3df353a5b373945708a671945baf78e34b3c917883a9da435283b36 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x2f3109e1303ad9ed01d3072af730a6265010ccad3951737a51dc390d3c3a9025 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xfd638d2624d102736cf7f00520d33f2c35b16e4699646a6b985bd52e7b6e25bd at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x9459b5a6a20360d12b3ae1b4e122d28621c38feab8db94252f2edea1a025b83e at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xa3af93ada81f2de9bad3def1b556559844b7ef90edeed9ca00eed1df35efed65 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xe160b27054770359fa6591a16b5015d428f6e15f7848e6423021dcd25ccb15da at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x38ed944caa626a09ffaa6ac6b52c29bf95379f8a3e7e39ac90086ccaa18182d1 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xa38aff861116cbcfd5306950faf7b2cd7b7ea082a58113b80984c4eed620934c at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x0267673a31b6599db2df403b8785be4bda2be9d57aeb9dd3100b2a5ce8d52d9f at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xc35436d2c44fe20447227e8ccd01933bf6ac229bdd39d12aed7d7036cf0100e1 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x71019901e1ce57d43a3083d789d67781dc5229ad2a2f5103171c62b582a8cba6 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xd38ea10181bf483cad8877d0a81d4d2a315aca890df0c6682d1fd52949118084 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x5cd0aa50d1a8fa006ef42309cc94fc8c06107604c55d58daa9e54ec5a21da16e at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x4f76e18b42e541f39c4461df3043a715242595ea3f6fd6a5f6ed5fdf7497e6a8 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x28cf6a35a309cec168571def5d40f0c948255c49b355baeba668d777d858ddc4 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xea99e8b9277102c41e7ebd9ff16737f65ed9480b59f7cab0091e48b098869330 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x17ac132cacfc8dea044c06492452d7491b3cfbce1c2309baccb6a64df5462c4d at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xe4697c1c50a002d6d82298833a63ed394bc4b2f864e43fa73c70b56fa4ef5936 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xa3902f0455c420a5a7b26a134491647fccf2069d71cc5188730b84f9344519a3 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x5acd0327c062cd3677ba97c40d566f02ce7ec901174a3746309f687d6502e554 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x835b89b2024a979bac7c8c722596718c0f26db54fdf6a3be16b914255dcdbc18 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xeeb52a8646d45a32b4af434701497945e78da19fc7c301e7067bb6688873e0af at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x281f3256ab9e06be51bc3d0231fbdc7c34dd33dfb343a1fe8189a87ca416aec4 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x9e6f4e4f3c31080d250c692a42bccd1c41fcdc914fca8a90131c5a0cb4324ca6 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xfcb7f7f4d9730a64ae82f2cd2089e909bc15e0a2a293d210a7bb100945f8e4f8 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0xbc39467182d56de9cf5e442eb3a07cb1f3ed86925cdc6c0c9c0bba08470ca189 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x93e03e214c2fe2c3763c007c3fae9f8cd6282e40e9fbd5b23da311a863bc6c02 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x9b868ce6ba35c982e62d2d0a9a05dd98dda9b2d307971130df709e2881430877 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x2c5a770691b6c4e5fce5ff567b43df97af01382c467a2d4ade591548f9f92cba at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x41f94799ad98714cfea4d685146cd064f9f60120fe2453a2b4f8dd405a53b2c4 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x4d8218fa289894309cf45ff9d5b39fb70de18138e8bece5d29cdb846b9e9af46 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] 💤 Idle (33 peers), best: #1774198 (0x25f8…d439), finalized #1774197 (0xa717…803c), ⬇ 0.2kiB/s ⬆ 0.2kiB/s
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x0e51ee0c4a6df478e30442dbb9b6bc0454dfecbe20c28bc9867b398ed2109029 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0x2ca184db276e347860b055b0c044cbce4c26ebf83365d05cd119852bc649e652 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0xee021368094449541f09549cfc933ae72014473d3c9229c9c6104a17466b6580 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0xda724c9b0f4c5a5f8585d75f986c1dea03e3cb429faf14d9c6df69091769d18d at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0xd413f1ebd0336a80496caaa7aba3f75610523afaaeb004c2f6798baacdb175d4 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0xc5387f414b38541cde566f6a86ed2a7d41ef1919a02bac0d78ccfed17368693a at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0x84960cbca729d227235ee614f3167e9d19121feecdef1795839882cdc373ccef at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0x0baec6904818ec2164e4e6625382dbf56527ba111c21274b5cbc37d78b661e54 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0x7ab77d0a36e8f75a940284c5dace206d3bb86419467a0994fcf8b322e44efe98 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0x1916e18e2bfa64e32d6ca8a3f81c6a1caff1e3794d54a2aca1e1624bf3aacd46 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0x04ee8a3d71fbd3d895c6aa14372f0203c8554a0d3573ab77f63395db1d0d73ff at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0x09cbf9c9fe2704ceb8573f612e4db2440e7351620a15caa528cbf9e5785249c4 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f
Feb 20 23:47:00 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:47:00 [Parachain] Starting collation. relay_parent=0x2300eadc572c3bb5cfcd28e6d8848b4d8202dc9e8e3c93315c7c7214609582ea at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f

Interestingly, those excessive forks are after it has been recovered already.

Also those Starting collation without a PoV size message afterwards happen before the incident as well. So it might be normal.

eskimor commented 1 year ago

How was CPU load on that collator during that time frame?

bkchr commented 1 year ago

so we seem to be having three forks, but even more interesting there is no PoV size message afterwards, which is supposed to be there if things go fine.

If they are using AURA and it isn't the turn of their collator (slot wise) there is nothing more being printed. So, this is completely fine.

Interestingly, those excessive forks are after it has been recovered already.

These are no forks.

Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x233c9479b58a1248d43452ceb33f272eb07af4ac7b7fcdcff39a9ec57df3dad6 at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f Feb 20 23:46:59 vmi990291.contaboserver.net node_22[1427]: 2023-02-20 23:46:59 [Parachain] Starting collation. relay_parent=0x578be6151c3a508e0932594f65780cd05388955b9162e5d56d53fbc823c5ad6c at=0xb4eb14035ed38e2090fda95f28d61334ae9e2ebe2e7bda027db108b08bfdae6f

These are block 16724196 and block 16724197 that where imported 13 minutes before! To me it looks like there is for whatever reason a burst of active leaves messages.

goncer commented 1 year ago

How was CPU load on that collator during that time frame?

Sorry, I do have not such info. But I think this is not about a collator machine, since all the collators stopped producing blocks. I am going to try asking to some others collators if they have such metrics.

We have those if helps: https://gleburvanov.grafana.net/dashboard/snapshot/AxkrPQJDs3OeNfj1Tdnp84p8jajxOtP4?orgId=0 https://gleburvanov.grafana.net/dashboard/snapshot/j3G5Fm3ZiALaeUVmlkCpY8R3PSKHdjbY?orgId=0

https://github.com/paritytech/polkadot/issues/6758#issuecomment-1439657320 Develop 0.9.31 image

Thanks for investigation @eskimor , @bkchr . :pray: Let us know if you find any conclusion :)

bLd75 commented 1 year ago

Here is a description of the same incident on Shiden side. https://substrate.stackexchange.com/questions/7302/parachain-missed-blocks-due-to-relay-wrong-announcement

eskimor commented 1 year ago

Collators are running 0.9.31? I think I have an idea what was going on. This release does not yet contain: https://github.com/paritytech/polkadot/pull/6440 - this issue is even worse on collators than on validators, because the accumulated rate limit will always be SEND_RATE_LIMIT * number of active disputes: On each and every leaf update.

Upgrading nodes to a more recent release will very likely fix the issue.

What I think what happened is:

  1. A significant amount of disputes has been raised: Screenshot from 2023-03-06 11-50-30
  2. We consider them active for 3minutes. Therefore disputes keep piling up for 3 minutes, then any older disputes will be cleared. Hence peak dispute volume is 2x dispute rate * 180s. Which amounts worst case to around 2*0.2 disputes/s*180s = 72 disputes.
  3. Send rate limit is 150ms, this means 72 disputes will result in each active leaf update to take around: 10.79 seconds.
  4. Signal channel size is 64, meaning accumulated delay in leaf processing can be up to: 10.79*64 = 11.5 minutes .. considering forks, it might become even higher, so the 13 minutes could easily be explained by this.

Most of the time dispute rate was around 0.15 which means each active leaf update is around 8.1 seconds, right below the 10s timeout. The worst case of 0.2 was the peak: It recovered afterwards. I believe this is the reason we did not see any subsystem unresponsive errors.

  1. Dispute rate went down and the system recovered, after working through the backlog.

TL;DR: With https://github.com/paritytech/polkadot/pull/6440 in place, this should no longer happen: Please upgrade your nodes.

I will also prioritize this one, because there is no reason those subsystems should be active on collator nodes at all.

bkchr commented 1 year ago

Ahh that is a nice explanation @eskimor! I also had seen in the logs that the delay between import a relay chain block and triggering a collation also increased over time, which should also perfectly match your explanation!

goncer commented 1 year ago

Many many thanks for the explanation @eskimor , @bkchr . Its on the roadmap to update to l&g polkadot version :) Thanks! :crossed_fingers: