paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.84k stars 668 forks source link

Fix flaky zombienet test `zombienet-substrate-0002-validators-warp-sync` #5974

Open pepoviola opened 1 week ago

pepoviola commented 1 week ago

Examples (from yesterday 10 of 78 fails)

https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7528201 https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7527201 https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7526808 https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7526157 https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7523749 https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7519971 https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7518972 https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7518884 https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7518515 https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7517649

cc: @paritytech/sdk-node

michalkucharczyk commented 1 week ago

@pepoviola

I see this test internally is using substrate_beefy_best_block - why is it a beefy block and not just a best block? Test is just using best block: https://github.com/paritytech/polkadot-sdk/blob/6c2b46f9620fd147e63b151320cc145551fd2c18/substrate/zombienet/0001-basic-warp-sync/test-warp-sync.zndsl#L17-L19

And it seems that this number is not being synced fast enough:

2024-10-07T16:20:55.002Z zombie::network-node using comparator isAtLeast for 10797, 56687
2024-10-07T16:20:55.002Z zombie::network-node [alice] Current value: 10797 for metric substrate_beefy_best_block, keep trying...
2024-10-07T16:20:56.002Z zombie::network-node [alice] Fetching metrics - q: 174  time:  Mon Oct 07 2024 16:20:56 GMT+0000 (Coordinated Universal Time)
2024-10-07T16:20:56.026Z zombie::network-node returning for: substrate_beefy_best_block from ns: _raw
2024-10-07T16:20:56.026Z zombie::network-node returning: 10997
2024-10-07T16:20:56.026Z zombie::network-node using comparator isAtLeast for 10997, 56687
2024-10-07T16:20:56.026Z zombie::network-node [alice] Current value: 10997 for metric substrate_beefy_best_block, keep trying...
2024-10-07T16:20:57.027Z zombie::network-node [alice] Fetching metrics - q: 175  time:  Mon Oct 07 2024 16:20:57 GMT+0000 (Coordinated Universal Time)
2024-10-07T16:20:57.048Z zombie::network-node returning for: substrate_beefy_best_block from ns: _raw
2024-10-07T16:20:57.048Z zombie::network-node returning: 11197
2024-10-07T16:20:57.048Z zombie::network-node using comparator isAtLeast for 11197, 56687
2024-10-07T16:20:57.048Z zombie::network-node [alice] Current value: 11197 for metric substrate_beefy_best_block, keep trying...
     Error:  
         [alice] Timeout(180), "getting desired metric value 56687 within 180 secs".

From a quick glance it looks like beefy sync issue?

pepoviola commented 1 week ago

I see this test internally is using substrate_beefy_best_block - why is it a beefy block and not just a best block? Test is just using best block:

Thanks for pointing this @michalkucharczyk, I think this is a bug. I will fix it and re-check this test.

Thanks for the help!

pepoviola commented 1 week ago

Hi @michalkucharczyk, I double check and the lines you mention are from other test (0001). This issue is for 0002 and is set to check that metric

https://github.com/paritytech/polkadot-sdk/blob/6c2b46f9620fd147e63b151320cc145551fd2c18/substrate/zombienet/0002-validators-warp-sync/test-validators-warp-sync.zndsl#L34-L38

Thanks!

michalkucharczyk commented 1 week ago

Oh, sorry.

So it looks like 180s is not enough for beefy to sync the best block. @serban300 what are your thoughts on this? Do you think it can be some regression?

serban300 commented 1 week ago

I don't think it's a regression, but I would have to look more closely. I'll look on it these days.