paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.81k stars 659 forks source link

Investigate `peregrine` syncing issue when using the `polkadot-parachain` binary #4787

Open serban300 opened 3 months ago

serban300 commented 3 months ago

The polkadot-parachain bin should be able to run custom parachains that satisfy some assumptions, mainly using the aura consensus.

peregrine should be such a parachain but it doesn't work with polkadot-parachain. More details here: https://forum.polkadot.network/t/polkadot-parachain-omni-node-gathering-ideas-and-feedback/7823/18

serban300 commented 3 months ago

I'm seeing a lot of errors that go like:

2024-06-13 12:47:28.738 DEBUG tokio-runtime-worker sync: [Parachain] Substream opened for 12D3KooWALJtiCZzcUPVsCa5f5egGfQyFhPY67kKosDw95bJqK7M, handshake [4, 145, 221, 91, 0, 0, 0, 0, 0, 117, 190, 65, 105, 246, 208, 200, 8, 87, 59, 131, 14, 183, 20, 248, 133, 39, 127, 124, 43, 225, 152, 61, 108, 70, 116, 173, 12, 114, 110, 252, 129, 160, 198, 227, 186, 195, 130, 179, 22, 166, 139, 202, 113, 65, 175, 31, 186, 80, 114, 7, 89, 76, 118, 16, 118, 132, 124, 227, 88, 174, 237, 204, 33]    
2024-06-13 12:47:28.738 TRACE tokio-runtime-worker sync: [Parachain] New peer 12D3KooWALJtiCZzcUPVsCa5f5egGfQyFhPY67kKosDw95bJqK7M [4, 145, 221, 91, 0, 0, 0, 0, 0, 117, 190, 65, 105, 246, 208, 200, 8, 87, 59, 131, 14, 183, 20, 248, 133, 39, 127, 124, 43, 225, 152, 61, 108, 70, 116, 173, 12, 114, 110, 252, 129, 160, 198, 227, 186, 195, 130, 179, 22, 166, 139, 202, 113, 65, 175, 31, 186, 80, 114, 7, 89, 76, 118, 16, 118, 132, 124, 227, 88, 174, 237, 204, 33]    
2024-06-13 12:47:28.738 TRACE tokio-runtime-worker sync: [Parachain] Validate handshake for 12D3KooWALJtiCZzcUPVsCa5f5egGfQyFhPY67kKosDw95bJqK7M    
2024-06-13 12:47:28.738 DEBUG tokio-runtime-worker sync: [Parachain] Failed to decode handshake for 12D3KooWALJtiCZzcUPVsCa5f5egGfQyFhPY67kKosDw95bJqK7M: Error { cause: None, desc: "Input buffer has still data left after decoding!" }    
2024-06-13 12:47:28.738 DEBUG tokio-runtime-worker sync: [Parachain] `SyncingEngine` rejected 12D3KooWALJtiCZzcUPVsCa5f5egGfQyFhPY67kKosDw95bJqK7M    
2024-06-13 12:47:28.738 DEBUG tokio-runtime-worker sync: [Parachain] 12D3KooWALJtiCZzcUPVsCa5f5egGfQyFhPY67kKosDw95bJqK7M does not exist in `SyncingEngine`

Not sure if they are related to the issue so far

serban300 commented 3 months ago

The problem is that peregrine uses pub type BlockNumber = u64 while polkadot-parachain uses u32

I recompiled polkadot-parachain with BlockNumber = u32 and peregrine syncs. I didn't do a full sync, but at least it's syncing:

[Parachain] ⚙️  Syncing 164.2 bps, target=#6020845 (3 peers), best: #50380 (0xed2a…13e1), finalized #0 (0xa0c6…cc21), ⬇ 405.7kiB/s ⬆ 0.3kiB/s 
Polkadot-Forum commented 3 months ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-parachain-omni-node-gathering-ideas-and-feedback/7823/20

ntn-x2 commented 3 months ago

We've hit this issue of a fixed BlockNumber definition in a lot of different places, last in the XCM emulator tests as well. Please do let me know if I can support in any way to get this fixed, as we would love to get rid of our node binary, eventually.

kianenigma commented 3 months ago

I think foremost you should migrate the blocknumber in your current node and runtime to be u32, and then solving the rest would be easier.

Within the runtime, you'd have to see all the places where you might have stored a block number in the state, and migrate it over.

Although, assuming no blocknumber is above u32::max, decoding a u64 into a u32 is the same.

bkchr commented 3 months ago

Although, assuming no blocknumber is above u32::max, decoding a u64 into a u32 is the same.

This is not true. U32 is only 4 byte, while u64 is 8 byte. This can very easily break decoding of other stuff.

ntn-x2 commented 3 months ago

I think foremost you should migrate the blocknumber in your current node and runtime to be u32, and then solving the rest would be easier.

But why is that? Why can't we simply use a u64? The requirement for the BlockNumber trait is

pub trait BlockNumber:
    Member
    + MaybeSerializeDeserialize
    + MaybeFromStr
    + Debug
    + sp_std::hash::Hash
    + Copy
    + MaybeDisplay
    + AtLeast32BitUnsigned
    + Into<U256>
    + TryFrom<U256>
    + Default
    + TypeInfo
    + MaxEncodedLen
    + FullCodec
kianenigma commented 3 months ago

But why is that? Why can't we simply use a u64

I suspect the early versions of the omni-node will not be configurable will use u32 block number as it is more commmonly used.

kianenigma commented 3 months ago

Although, assuming no blocknumber is above u32::max, decoding a u64 into a u32 is the same.

This is not true. U32 is only 4 byte, while u64 is 8 byte. This can very easily break decoding of other stuff.

True, my advice is very much a shutgun as it only applies if you decode and not decode_all, but I cannot know if all underlying code is using the former.

ntn-x2 commented 3 months ago

@kianenigma I see, so perhaps we just identified one more requirement for the upcoming omni-node 😄

bkchr commented 3 months ago

True, my advice is very much a shutgun as it only applies if you decode and not decode_all, but I cannot know if all underlying code is using the former.

struct Something<BlockNumber> {
     number: BlockNumber,
     whatever: Vec<u8>,
}

If you change the block number there, the decoding will be completely broken. This is not related to decode/decode_all.

But why is that? Why can't we simply use a u64? The requirement for the BlockNumber trait is

The point being that every interaction between the node and the runtime, plus any usage of block number in the node requires to know the type of the blocknumber. It is generic over the block number, but only generic at compile time and not generic at runtime.

ntn-x2 commented 3 months ago

The point being that every interaction between the node and the runtime, plus any usage of block number in the node requires to know the type of the blocknumber. It is generic over the block number, but only generic at compile time and not generic at runtime.

@bkchr I am not sure I understand 100% what you are saying, sorry 😀 Are you saying that it's not possible to know at compile time which block number type to use for a given runtime? Or does your point apply to a generic node that must run all runtimes, regardless of their block number type?

bkchr commented 3 months ago

Or does your point apply to a generic node that must run all runtimes, regardless of their block number type?

My comment applies to this.

kianenigma commented 1 month ago

Last step before we can close this on our side: https://github.com/KILTprotocol/kilt-node/issues/716

ntn-x2 commented 1 month ago

Amazing. We will check it out. Thanks a lot @kianenigma!