Increase para block inclusion reliability

eskimor commented 2 weeks ago

Having produced parachain blocks retracted is the very least detrimental to throughput of the chain, but also harms user and developer experience:

Worst case, blocks gets retracted, transaction becomes invalid, user has to issue a new one.
Having a transaction in a block, is already some proof of validity, under the assumption that we trust collators.
Having a transaction in a block, which got backed (off-chain), provides even higher level of security.

In general the block will move up in assurance over time, but if it happens frequently that a block after all just gets discarded, the benefit this property vanishes and one actually has to wait for definite finality, which takes the longest.

The following is a kind of unordered list of things that can cause a parachain block to not make it + solutions to it.

Speculative Availability

Give availability more time, to enhance likelihood of cores getting freed on time:

[ ] Factor out provisioner logic to get backable candidates
[ ] Reuse that code in availability-distribution to already fetch scheduled cores and then backable candidates from prospective parachains
[ ] Add fetch tasks for these (with the leaf it was scheduled in)
[ ] Profit

Immunity to relay chain forks

Either:

Build on slightly older relay parents

Simple
Can be made pretty robust, if we choose relay parents which have been finalized already.
Also provides some resilience against relay chain reorgs.
But: Latency with regards to message processing is added.

Build on all forks

No additional latency
More resources on the collator (likely fine, as block building is mostly single core, hence additional cores are free)
More complexity: One does not only need to track one fork, but multiple.

Avoid relay parents becoming obsolete

Allow relay parents that survive longer than the claim queue length. Then the runtime would still accept those candidates if in the current claim queue the parachain still has assignments on the core. This way, if e.g. a block producer does not produce a block, the parachain would merely slow down a bit, but not get its blocks discarded.

Session boundaries

Even with above optimizations, session boundaries would still make relay parents obsolete. A simple fix would be for collators to anticipate the session change and stop producing candidates that would end up getting backed in the last block of the session.

Core Changes

With above "Avoid relay parents becoming obsolete", this would not work if the parachain still has a core assigned, but it is different now. This is not easy to fix in the current design, luckily it should also have very little impact:

Chains which need high levels of reliability, should aim to keep their core mappings stable.
With the above, you only run into issues if a block producer messes up exactly at that rare occasion you changed your core mapping. Given that we have pretty solid 6s block times, the chances for this happening seem acceptable.

Reliable Collator Protocol

We want to make validator - collator connections as reliable as possible to ensure produced blocks also end up getting validated in a timely manner.

sandreim commented 2 weeks ago

Speculative Availability

This is a good solution if at some point we discover that 1,5 seconds is not enough time for availability. I'd expect 10MB PoVs could add some pressure here. Running subsystem benchmark numbers with some realistic latencies should give a hint.

Immunity to relay chain forks

Either:

Build on slightly older relay parents

Simple

Can be made pretty robust, if we choose relay parents which have been finalized already.

Finality can slow down and then this strategy doesn't work.

Also provides some resilience against relay chain reorgs.

But: Latency with regards to message processing is added.

I'd expect Sassafras to fix this, but until then we need something to alleviate things a bit. I think the relay chain parent choice should be more dynamic so it can optimize for either tput or latency depending on tx pool and relay chain messaging state.

Build on all forks

No additional latency

More resources on the collator (likely fine, as block building is mostly single core, hence additional cores are free)

More complexity: One does not only need to track one fork, but multiple.

I think this is what we were doing until slot based collator. Beefier (more cores) collators should make this solution a lower hanging fruit.

Avoid relay parents becoming obsolete

Allow relay parents that survive longer than the claim queue length. Then the runtime would still accept those candidates if in the current claim queue the parachain still has assignments on the core. This way, if e.g. a block producer does not produce a block, the parachain would merely slow down a bit, but not get its blocks discarded.

We are planning to use the same value for the max ancestry and claim queue length. I don't really see a point in allowing RPs survive longer. If we do that, why not also have the same scheduling look ahead ?

Session boundaries

Even with above optimizations, session boundaries would still make relay parents obsolete. A simple fix would be for collators to anticipate the session change and stop producing candidates that would end up getting backed in the last block of the session.

💯

Reliable Collator Protocol

We want to make validator - collator connections as reliable as possible to ensure produced blocks also end up getting validated in a timely manner.

I think this will have most impact on block times in general.

sandreim commented 2 weeks ago

Another one that makes sense to have on this list and a lower hanging fruit:

Currently for availability we actually have more time, but we are starting the bitfield singing task and timer as soon as we import a relay chain block. If we imported that block very early we have more than 1.5s to fetch chunks and also the PRE_PROPOSE_TIMEOUT provisioner timeout can be higher than 2s. We'd just have to compute when we've imported the block wrt to the next slot.

bkchr commented 2 weeks ago

Even with above optimizations, session boundaries would still make relay parents obsolete. A simple fix would be for collators to anticipate the session change and stop producing candidates that would end up getting backed in the last block of the session.

If the underlying validator set doesn't change, we should completely stop invalidating candidates on a session change. Or is there any proper reason?

Currently for availability we actually have more time, but we are starting the bitfield singing task and timer as soon as we import a relay chain block. If we imported that block very early we have more than 1.5s to fetch chunks and also the PRE_PROPOSE_TIMEOUT provisioner timeout can be higher than 2s. We'd just have to compute when we've imported the block wrt to the next slot.

https://github.com/paritytech/polkadot/pull/5484#discussion_r872538269 :see_no_evil:

eskimor commented 2 weeks ago

If the underlying validator set doesn't change, we should completely stop invalidating candidates on a session change. Or is there any proper reason?

Mostly implementation complexity. @rphmeier back then decided, that it is not worth it for now. Worth checking again though, things have changed a lot.

rphmeier commented 2 weeks ago

My reasoning at the time was that session changes affect only a tiny proportion of blocks. Session changes happen only once every several hours and take thousands of blocks. So we'd be chasing like 0.1% efficiency.

More resources on the collator (likely fine, as block building is mostly single core, hence additional cores are free)

worth noting that collation is bottlenecked on IOPS, not CPU, so building on all forks might work until parachains actually are under load and then stop working altogether.

maybe things have changed, but AFAIK slow availability shouldn't cause a parachain block to get retracted. it should just become available more slowly. is the 1 minute availability timeout still a thing?

eskimor commented 2 weeks ago

maybe things have changed, but AFAIK slow availability shouldn't cause a parachain block to get retracted. it should just become available more slowly. is the 1 minute availability timeout still a thing?

The issue is, that it delays follow up blocks, up until the point where their relay parent might went out of scope. (Fixable by being more lenient with accepted relay parents)

paritytech / polkadot-sdk