paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.79k stars 646 forks source link

Parachain not producing blocks, is stalling finality #947

Open crystalin opened 3 years ago

crystalin commented 3 years ago

Finality on our internal relaychain is at 50 blocks behind our best block.

This is happening on a public parachain network (Moonbase Alphanet), using 10 validators for the relay (westend based) and ~30 external collators, with around 300+ external nodes.

As requested by @rphmeier , here are the logs of the validators (30 minutes) (the file contains all the validator logs and also all.log which is the aggregation of all the validators by time) alphanet-logs.zip

rphmeier commented 3 years ago

As seen in the logs, the validators aren't assigning to check parablocks. I suspect that keys aren't set correctly

crystalin commented 3 years ago

Providing more information while we are gathering more logs. Those validators are controlled by our ops, and no key rotation or no settings were changed recently (keys were set ~1 year ago). The only change was the deployment of client 0.9.9 and runtime upgrade to 9090 that was done few days before the incident.

The current 50 blocks finality lag hasn't resolved since it started (as opposed to the other finality lag bug that was discussed)

rphmeier commented 3 years ago

runtime upgrade to 9090 that was done few days before the incident

This could've done something.

The validator nodes aren't issuing assignments, so either they're not detecting that there are candidates in the blocks or they're not signing. https://github.com/paritytech/polkadot/blob/master/node/core/approval-voting/src/criteria.rs#L229 - do you see any way here they wouldn't have signed assignments with keys set?

Anyway, without proper logging we can't do anything but speculate.

crystalin commented 3 years ago

I added a "load_all_sessions" debug line and here is the output:

validator1 2021-09-16 21:58:00.054  INFO tokio-runtime-worker polkadot_node_subsystem_util::rolling_session_window: load_all_sessions: 2717 - SessionInfo { validators: [Public(94ba99e24f8894828efee3d1f249a946c3b7719e4e93c54125dd1d3925735055 (5FRiNmoi...)), Public(c24ea87a8adca8979b25de4c91f766a9470afa79f06a29baa661487d9db88013 (5GTUWiSX...)), Public(f0b7c17d36cce99cf38682f1be65153baeaad26bd5ef26f4d82c52926b4eaf08 (5HWKwGCV...)), Public(74ad69ea110e44718622cd00d3330d99333039170fa3c28918e986862a4e252d (5Ehgvgk1...)), Public(14142e35598a41c66b3d2c67c822c8ec4c1b9c5cce9ae29e47c806dfcfd0c204 (5CX2ov8t...)), Public(70d24081a2614caadb6b54a7e9a319eda90482e69128f410990b88b900130c5b (5EcdgHV8...)), Public(fe0b72e5b83fe7133f29390278d8aeb027ca8bcdf0e2d4b317ba4b8072621659 (5HooRN7J...)), Public(4ef6104e4ee58aac7af1359693644903972ef01ddd711b7c50bcf63b628ee728 (5DrEgssZ...)), Public(e2286c9ee6e090ac939ba3098ea2bf25c451128bef944f164df8ffe709e24809 (5HBEgqSK...)), Public(8a73809b840dd4ba9066226596531249bfc389417bf90f2ea41bac1f70e7d271 (5FCEmzon...))], discovery_keys: [Public(94ba99e24f8894828efee3d1f249a946c3b7719e4e93c54125dd1d3925735055 (5FRiNmoi...)), Public(c24ea87a8adca8979b25de4c91f766a9470afa79f06a29baa661487d9db88013 (5GTUWiSX...)), Public(f0b7c17d36cce99cf38682f1be65153baeaad26bd5ef26f4d82c52926b4eaf08 (5HWKwGCV...)), Public(74ad69ea110e44718622cd00d3330d99333039170fa3c28918e986862a4e252d (5Ehgvgk1...)), Public(14142e35598a41c66b3d2c67c822c8ec4c1b9c5cce9ae29e47c806dfcfd0c204 (5CX2ov8t...)), Public(70d24081a2614caadb6b54a7e9a319eda90482e69128f410990b88b900130c5b (5EcdgHV8...)), Public(fe0b72e5b83fe7133f29390278d8aeb027ca8bcdf0e2d4b317ba4b8072621659 (5HooRN7J...)), Public(4ef6104e4ee58aac7af1359693644903972ef01ddd711b7c50bcf63b628ee728 (5DrEgssZ...)), Public(e2286c9ee6e090ac939ba3098ea2bf25c451128bef944f164df8ffe709e24809 (5HBEgqSK...)), Public(8a73809b840dd4ba9066226596531249bfc389417bf90f2ea41bac1f70e7d271 (5FCEmzon...))], assignment_keys: [Public(94ba99e24f8894828efee3d1f249a946c3b7719e4e93c54125dd1d3925735055 (5FRiNmoi...)), Public(c24ea87a8adca8979b25de4c91f766a9470afa79f06a29baa661487d9db88013 (5GTUWiSX...)), Public(f0b7c17d36cce99cf38682f1be65153baeaad26bd5ef26f4d82c52926b4eaf08 (5HWKwGCV...)), Public(74ad69ea110e44718622cd00d3330d99333039170fa3c28918e986862a4e252d (5Ehgvgk1...)), Public(14142e35598a41c66b3d2c67c822c8ec4c1b9c5cce9ae29e47c806dfcfd0c204 (5CX2ov8t...)), Public(70d24081a2614caadb6b54a7e9a319eda90482e69128f410990b88b900130c5b (5EcdgHV8...)), Public(fe0b72e5b83fe7133f29390278d8aeb027ca8bcdf0e2d4b317ba4b8072621659 (5HooRN7J...)), Public(4ef6104e4ee58aac7af1359693644903972ef01ddd711b7c50bcf63b628ee728 (5DrEgssZ...)), Public(e2286c9ee6e090ac939ba3098ea2bf25c451128bef944f164df8ffe709e24809 (5HBEgqSK...)), Public(8a73809b840dd4ba9066226596531249bfc389417bf90f2ea41bac1f70e7d271 (5FCEmzon...))], validator_groups: [[ValidatorIndex(0), ValidatorIndex(1), ValidatorIndex(2), ValidatorIndex(3), ValidatorIndex(4)], [ValidatorIndex(5), ValidatorIndex(6), ValidatorIndex(7), ValidatorIndex(8), ValidatorIndex(9)]], n_cores: 2, zeroth_delay_tranche_width: 0, relay_vrf_modulo_samples: 0, n_delay_tranches: 25, no_show_slots: 2, needed_approvals: 2 }

I haven't verified them all but they seem to match the keys we expect (like 5GTUWiSX...)

Here are the logs of the validator1, enabling trace for parachain::approval-voting: validator1.log

crystalin commented 3 years ago

I just saw that our relay also has a 2nd parachain that has never produced blocks. Could that be the reason ?

crystalin commented 3 years ago

Hum that was it ... :( So an auction was set in order to test it long time ago and it was schedule to start on lease 4 of our relaychain and this one started few days ago which launch this parachain without any code in it.

So I didn't know but 1 parachain can prevent the finality on the relay. Sorry for the waste of time :(

eskimor commented 3 years ago

So I didn't know but 1 parachain can prevent the finality on the relay.

As @ordian pointed out, that sounds very much like a bug that should be fixed.

crystalin commented 3 years ago

Happy to help more in that case, let me know how we can help

rphmeier commented 3 years ago

Hm, there are two weird things going on.

  1. is that the invalid Wasm was accepted by the backers.
  2. is that no assignments were produced

Is the Wasm blob literally an empty file?

As I can see from exploring the parachain state, only parablocks from para 1000 were ever included. Was this the malicious parachain, or was it a different ID?

The only logs I see related to execution are about code hash 0x0da8bdc863da75349f268f648f8f48e6d1031dc6492e294f932a125376e19321 which I've verified is the current code for para 1000, which I presume is your Moonbeam parachain. I don't have all the details about your setup, but it seems like there's no collators for the other (null) para, nor does it ever get backed. Really not sure why this would throw things off, but yeah, definitely worth investigating.

rphmeier commented 3 years ago

https://github.com/polkadot-js/api/issues/3967 I tried to look in further but am blocked (presumably) by this.

crystalin commented 3 years ago

@rphmeier I confirm that the main parachain (1000) is moonriver and producing correctly. The other parachain (2003) didn't have any collator. The wasm might have been empty yes, as it was for a test