paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.86k stars 680 forks source link

prospective-parachains: allow chained candidates coming out-of-order #3541

Closed alindima closed 5 months ago

alindima commented 8 months ago

Assume latest backed candidate of a parachain is A. With elastic-scaling enabled, one collator builds candidate B and another collator candidate C.

There's currently no restriction on the order in which these collations will be advertised and supplied to the soon-to-be block author. If prospective-parachains learns of candidate C before candidate B, candidate C will be discarded.

For elastic-scaling to properly work, we need to allow some number of disjoint trees to form, even if they don't build directly on top of the latest backed candidate.

The alternative is collator cooperation and hoping that network latency/availability does not defeat that.

sandreim commented 8 months ago

Sounds like it would create additional complexity in prospective-parachains or collator client. Alternatively we could buffer them in candidate-backing and batch send them once we have proper parent but not sure if this is better.

alindima commented 8 months ago

Sounds like it would create additional complexity in prospective-parachains or collator client. Alternatively we could buffer them in candidate-backing and batch send them once we have proper parent but not sure if this is better.

that's a good idea. candidate-backing is less complex than prospective-parachains, so adding this logic wouldn't hurt as bad

burdges commented 8 months ago

With elastic-scaling enabled, one collator builds candidate B and another collator candidate C.

Although we never liked the restriction, we envisioned that initially B and C must come from the same collator.

If prospective-parachains learns of candidate C before candidate B, candidate C will be discarded.

We could've candidate C provide a merkle proof into the header of B, or the whole header of B.

If the parachain uses sufficiently parallel algorithms, then yes B and C could be built without knowledge of one another. We could consider this "tree" scaling, instead of of "elastic" scaling. An approach goes: parachains have short forks of "helper" blocks, maybe based off some common "master" parachain parent. A future "master" parachain block could reintegrate these forks by proving their state changes mesh correctly. All this feels like "future work" though.

eskimor commented 8 months ago

Although we never liked the restriction, we envisioned that initially B and C must come from the same collator.

That does not change anything, because that same collator would still send the collations to different backing groups.

We just need to be more flexible and accept an "unconnected" fork, we just need to be careful to not open up a spam vector.

eskimor commented 7 months ago

Given the added complexity due to elastic scaling (and also more potential spam vectors), it would be good if we could drop support for parachain forks.

Note we are not talking about relay chain forks here, it is clear that there will be a parachain fork for each relay chain fork, we are not concerned about these, only about multiple parachain forks for a single relay chain fork.

We are pretty sure that

(a) parachains have no good reason to fork (b) that they are also not doing it in practice

(b) would be good to have verified, hence I would suggest adding a metric to prospective parachains recording the number of forks we see in the wild for parachains.

The result should be improved validator performance, less complexity, less spam vectors, at the cost of forky parachains taking a performance hit.

Thinking about this some more, actually I don't think we need metrics here:

  1. Without asynchronous backing, forks don't matter. We just pick one anyway and they are not able to build upon each other.
  2. With asynchronous backing, if we dropped the "wrong" fork, then we might invalidate a complete prospective chain that is currently being built. In that case the parachain would actually take a performance hit. But, we have not launched asynchronous backing yet and if this would be an issue, we should fix it on the parachain side and not try to waste validator resources to resolve an issue at the wrong level.
eskimor commented 7 months ago

So by not allowing forks, we have not yet solved "unconnected" collations though. If they are not connected, they could be whatever. Nodes which are not even collators could just provide us old collations they found and we would need to believe this is an unconnected fork and accept it.

To resolve this, I would suggest for MVP to introduce the notion of extended validation:

  1. We accept unconnected collations and validate them with regards to the provided head data, but we delay distribution of statements until we received statements for the missing links up to the last included candidate.
  2. Then we distribute our statements.

The idea is, that the validation of a candidate is only considered successful once we have seen statements for its dependencies. This allows still for parallel validation of candidates, but sequences the results ... adding a bit of delay, which should not matter much with asynchronous backing.

Consequences: For second and third cores in elastic scaling there will be no PoV distribution between validators. The collator need to provide the collation to sufficient validators himself.

Reasoning

If we consider having the full chain as part of a successful validation, then elastic scaling does not worsen the situation for validators. Yes nodes can mess with us, but they could equally well just provide an invalid collation. It is arguably a easier to provide invalid resource consuming collations now, but by making the completing of the chain part of the validation the issue stays isolated on the validator and it can react to garbage candidates, by dropping reputation of the peer, especially with #616 validators will have proper means for that.

If we instead distributed statements earlier, then garbage candidates would no longer be an isolated problem, but other validators will also already have invested work which then needs to be discarded. These validators could then not meaningfully "punish" anybody.

Outlook

With proper dependency tracking we could distribute statements early, iff the dependency (CandidateHash of parent) was not only part of the descriptor, but part of the CandidateCommitments. Then the collator can simply provide us the missing descriptors and we can verify the chain - knowing it is valid, if our candidate is valid. This improves the situation dramatically as then we can be sure that the collation is fully valid and not e.g. some historic one. The only left attack is obviously to never provide those dependencies.

We could make it so that those unconnected collations can still be put into a block: They won't be enacted, but the backers could still get rewards. Thus a misbehaving collator would again only harm the parachain (and it has to be a collator as the provided collation was valid) and we can therefore again leave taking care of those nasty nodes to the Parachain.

Implementation

  1. We check for dependencies being present, we don't yet fully consider a candidate valid - no notifications will be sent. We can, for maximal streamlining, already sign the statements, but we need to buffer them until we have seen dependencies.
  2. For maximum throughput we need some notification on incoming dependencies. Could be registering a oneshot sender in prospective parachains.
  3. We need an additional timeout, similar to the normal validation: If we don't receive the dependencies in a reasonable timeframe, we drop the statements. A reasonable timeout would be in the ballpark of a slot, aka. <= 6s.
  4. Raciness: Even if we only distribute, after we have received statements of dependencies, this does not mean that all other nodes will also receive them in order. I would assume that this is handled by statement distribution, but have not checked.
ordian commented 7 months ago

This should work, if the current validation succeeds then for any actual block chain, making these intermediate head data up, should not be possible. If the chain were no actual block chain and making up that data would be possible, then the validators are still fine: We assume it can not be made up, if it can it is a liveness problem of the parachain, that can be fixed by the parachain itself (e.g. just be a proper chain).

Not sure I follow which problem we're trying to solve. Malicious collators trying to waste resources of validators? They can always add an invalid parent head data and never reveal grandparent candidate. But they could also just send an invalid candidate and create a new identify after we disconnect them. I guess it's a question of which candidates/collators do we prioritize. Not sure if sending the full chain of head data helps here though. Could you elaborate on how it helps exactly?

What about parachains that genuinely have forks, for example PoW parachains? In addition to that, do we want to support non-blockchain based parachains in the future, e.g. DAG-based "chains"?

eskimor commented 7 months ago

Not sure I follow which problem we're trying to solve. Malicious collators trying to waste resources of validators? They can always add an invalid parent head data and never reveal grandparent candidate. But they could also just send an invalid candidate and create a new identify after we disconnect them. I guess it's a question of which candidates/collators do we prioritize. Not sure if sending the full chain of head data helps here though. Could you elaborate on how it helps exactly?

Completely reworked the comment. For the new identity #616 should help.

What about parachains that genuinely have forks, for example PoW parachains? In addition to that, do we want to support non-blockchain based parachains in the future, e.g. DAG-based "chains"?

I don't think that PoW parachains make any sense and if so, they would have poor performance anyways. All the ignorance on forks does is that it reduces the parachain performance if there are forks ... also 6s block times for a proof of work parachain seem unrealistic anyway.

We force them to be a chain ... for our intents or purposes. This does not limit flexibilty in any way. In particular for the final version, all we need is that candidates form a chain ... given that candidates include the relay parent, this is almost guaranteed just because of this requirement. ParaDags could e.g. add a sequence number to ensure this in all cases. (Candidates based on the same relay parent)

burdges commented 7 months ago

PoW parachains could be ignored for many reasons, including them being too slow for elastic scaling, but more because they cannot really contact backing groups or do off-chain messaging in sensible ways.

We accept unconnected collations and validate them with regards to the provided head data, but we delay distribution of statements until we received statements for the missing links up to the last included candidate.

There an issue that backing groups share the candidates internally first, and only later to other validaters, so you want to avoid this internal sharing for spam, but your solution works then why not make collators do this anyways? It's worse latency with slow collators I guess?

We've a race for reporting attestations on-chain under your proposal: We've many groups that've validated candidates, but none of them know if they can gossip their candidates. We need maybe 1/4 second before each being gossiped triggers the next to be gossiped, so we'll only manage 4 elastic scalaing candidates per extra second after the validation time.

They won't be enacted, but the backers could still get rewards

We should never pay backers until finality anyways, when we'd pay approval checkers too, but I suppose internally valid candidates would be approved, even though they never fit into the parachain. That's fine.

We force them to be a chain ... for our intents or purposes

I do think there is some value in approving dangling candidates who never fit into any chain. We do not have to permit this for elastic scaling, but it really does provide a gateway for parachains where the collators have not really figured everything out themselves yet.

As I've said elsewhere, I'd favor an "asyncronous accumulation" replacement for the accumulate function in the core jam proposal. It'd just permit these dangling "helper" blocks, which then future parachain blocks could integrate into the parachain using the parachain's own logic, not bound by the execution time of code on the relay chain.

At a high level, the big trade off for dangling blocks is that we're limiting how parachains can handle their own economics: VCs, etc cannot so easily subsidise collators. It's maybe fine if elastic scaling works with VC funding, but then corejam or whatever winds up much less friendly to VC funding. I donno..

Ideas:

If headers were hashed like H(prefix_including_parent_hash, H(rest_of_header)) then we can share prefix_including_parent_hash more cheaply, without worrying about all the header bloat we've unfortunately accumulated in rest_of_header.

I've not followed core time closely enough, but can we now be sure that the relay chain gets paid for blocks which get approved but never fit into some chain? At least this must surely be doable.

sandreim commented 7 months ago

Not sure I follow which problem we're trying to solve. Malicious collators trying to waste resources of validators? They can always add an invalid parent head data and never reveal grandparent candidate. But they could also just send an invalid candidate and create a new identify after we disconnect them. I guess it's a question of which candidates/collators do we prioritize. Not sure if sending the full chain of head data helps here though. Could you elaborate on how it helps exactly?

Completely reworked the comment. For the new identity #616 should help.

What about parachains that genuinely have forks, for example PoW parachains? In addition to that, do we want to support non-blockchain based parachains in the future, e.g. DAG-based "chains"?

I don't think that PoW parachains make any sense and if so, they would have poor performance anyways. All the ignorance on forks does is that it reduces the parachain performance if there are forks ... also 6s block times for a proof of work parachain seem unrealistic anyway.

We force them to be a chain ... for our intents or purposes. This does not limit flexibilty in any way. In particular for the final version, all we need is that candidates form a chain ... given that candidates include the relay parent, this is almost guaranteed just because of this requirement. ParaDags could e.g. add a sequence number to ensure this in all cases. (Candidates based on the same relay parent)

I'd like to propose something in between. Let's accept a bounded number of forks per depth but at the same time we disallow cycles and force DAGs to include some nonce. This should remove complexity and make the prospective parachains subsystem interface more straight forward, as we can't have the same candidate at multiple depths and/or relay parents. Parachains which are too forkful take a performance hit(too is something we can be lenient if there is a real need, but we should be very strict at first) and this is fine as discussed above.

That being said, to tackle the issue at hand, I'd go for a simpler approach in this lighter prospective parachains subsystem by introducing a new mapping unconnected_candidates_by_parent_head: HashMap<ParaHeadHash, HashSet<CandidateHash>. We'd prune elements in this mapping as candidates get connected into the fragment tree, or they get out of view.

WDYT ?

eskimor commented 7 months ago

Just realized: There is no risk of random nodes messing with us with historic candidates as we are still limiting the allowed relay parents. So all valid candidates even unconnected ones must have been produced recently by an actual collator.

alindima commented 7 months ago

Let's accept a bounded number of forks per depth but at the same time we disallow cycles and force DAGs to include some nonce.

How could we even enforce that we disallow cycles? Just ignore them and hope that they won't blast out into bricking the relay chain? I don't think we have feasible alternatives

sandreim commented 7 months ago

Yeah, IMO for the general case we really cannot, but for what is in scope of the prospective parachains at any point in time we could:

Additionally it might be good to only once allow the introduction of a candidate to the fragment tree.

eskimor commented 7 months ago

We can right now not disallow them, but we don't need to provide a guarantee that a parachain having cycles works. So we can drop (if it gains us something significant) support in that sense, we still need for validators not to crash if people do it anyway ;-)

eskimor commented 7 months ago

Yeah, IMO for the general case we really cannot, but for what is in scope of the prospective parachains at any point in time we could:

* not allow same candidate at multiple depths

* enforce no duplicate para heads in chained candidates we pass to runtime

Additionally it might be good to only once allow the introduction of a candidate to the fragment tree.

Sounds reasonable. Just dropping duplicates for whatever we have in view seems sensible, if it makes things easier.

alindima commented 7 months ago

I have some updates from my prospective-parachains immersion.

As this issue says, we've been under the impression that, if candidate B (which builds on top of candidate A) is supplied to a backing group before the group finds out (via statement-distribution) about candidate A, candidate B will be discarded. According to the logic in prospective-parachains, this is correct.

Based on this, I've been rewriting prospective-parachains to allow for a number of "orphan" candidates (still a draft yet). While doing this, I simplified it by not allowing parachain forks and cycles.

I just discovered though that this is already somewhat tackled in collator-protocol. On the validator side, before fetching an advertised collation, it asks prospective-parachains whether this new candidate can be seconded (builds on a leaf of the fragment tree). If it doesn't, it won't fetch the collation yet. It will be buffered and retried on a leaf update or when another candidate is backed.

This fixes the problem but hurts pipelining. Candidate B will not be validated and backed until Candidate A is backed and the statements reach the other backing group.

We can therefore get on with testing on a testnet and see how big of an issue this is. In the meantime, I'll get on with the changes.

In local zombienet tests, this is not an issue yet because the network is optimal and validation times are low