Open mrcnski opened 1 year ago
Yeah, indeed - RealOverseerGen
is mostly an artifact of pre-launch days. Improving the configurability of the Polkadot node with better options for which subsystems to run would be welcome. We have some code to dynamically disable certain subsystems, but it's not done very cleanly and it'd be nice to have a higher standard of code in the startup logic.
One thing that should definitely be expunged from the codebase is this offensive pattern of having subsystems disable themselves internally according to configuration. This defies all logic and should be removed as soon as possible.
As subsystems need to be disabled often not individually, but together, I propose this refactoring strategy which avoids duplicate OverseerGen
implementations:
MaybeDisabled<T>
wrapper type which wraps subsystems with the either-spawned-or-dummy logic.MaybeDisabled<T>
in the overseer builder.OptionalRequirements
struct with a builder patternstruct OptionalRequirements {
// private fields for backwards compatibility
// some suggestions without digging
execute_pvfs: bool,
build_collations: bool,
collator_protocol_mode: Sending/Receiving
// other conditionals
}
// some baked-in defaults for things like validators, polkadot full nodes, collators, parachain full nodes, minimal, etc.
struct OptionalRequirementsBuilder {
// builder pattern
}
The optional requirements should be as backwards compatible as possible. i.e. it should encapsulate all the optional tasks a node running on Polkadot should be able to do.
This does shift some of the burden of thought onto collation libraries like Cumulus, but well-defined presets should help with this.
Sounds like a good idea. @skunert already looked into this for the minimal node work and can provide some feedback here. Some work on him on this: https://github.com/paritytech/orchestra/pull/41 & https://github.com/paritytech/orchestra/issues/32
For reference, my original approach was to get rid of the unneeded subsystems altogether, not just replacing them with a DummySubsystem
. I was expecting this to work since the collator side subsystems basically form a group that should not exchange messages with the other ones. I introduced the option to disable subsystems in the overseer via #[cfg]
flags in https://github.com/paritytech/orchestra/pull/41. The main problem why I did not pursue this further was that with the coming monorepo, all features in the workspace would be unified. We could not have differently configured overseer in cumulus vs. polkadot.
The approach described in this issue looks good to me, even though the goal is a bit different. IIUC we are always spawning the same number of susbsystems, but like on the collator we replace disabled ones by DummySubsystem
.
Feature flags are going to be less flexible than requirements passed in at runtime, especially if we consider Polkadot-node-as-a-library
As found by @alexggh, collators seem to connect to validators over validation
PeerSet, which seem quite wasteful, because they only receive gossip messages, but don't forward them. I suspect it has to do with open slots on validation peerset config and libp2p discovery mechanism: https://github.com/paritytech/polkadot/blob/d74c60b6489f31f437edb1a19de0d0f6b8ad432e/node/network/protocol/src/peer_set.rs#L79-L94
We could have a different network config for collators as well: https://github.com/paritytech/polkadot/blob/d74c60b6489f31f437edb1a19de0d0f6b8ad432e/node/service/src/lib.rs#L852
This is intentional. We need the validator peerset to recover PoVs. If we don't want them to be connected there, we would probably need to introduce an extra peerset for this (as normal full nodes also enable this).
Availability recovery doesn't use gossip, but req/resp protocol with IfDisconnected::TryConnect
: https://github.com/paritytech/polkadot/blob/046c43b97e46bc843fab277a497504275a5f5407/node/network/availability-recovery/src/lib.rs#L454-L459
We're talking about gossip messages, such as approvals, assignments, bitfields, which are no concern to collators.
Having open slots on validators seems fine, but not on collators.
Availability recovery doesn't use gossip, but req/resp protocol with
IfDisconnected::TryConnect
:
Yeah I know, I thought you would still need to allow in the peerset, but my understanding was wrong ;)
As found by @alexggh, collators seem to connect to validators over
validation
PeerSet, which seem quite wasteful, because they only receive gossip messages, but don't forward them.
This tick collator node on Versi seems to be sending out messages over the validation peer set. This seems like gossip to me.
As found by @alexggh, collators seem to connect to validators over validation PeerSet, which seem quite wasteful, because they only receive gossip messages, but don't forward them.
Looked a bit more and I think we are having two small independent problems:
Collators connect over the validaton
PeerSet, which I think @ordian is saying is not really needed.
Validators part of the active authorities set for the Session sends their Random sampling assignments to anyone that connected over the validator
PeerSet. Which I don't think is correct because collators and Validators that are not part of the active authorities set won't have a topology, so they won't gossip the assignments according to the topology.
However, they do take some random sample and will gossip the assignments based on that, but that's like 4 samples and that's what you see here, I don't think we should rely on that since they aren't doing any consensus work.
This tick collator node on Versi seems to be sending out messages over the validation peer set. This seems like gossip to me
In my opinion 1) should be fixed as part of this refactoring, but independent of that I think we should still enforce 2) and make sure the random sampling is taken from the active authorities for the given session something around this lines: https://github.com/paritytech/polkadot/pull/7554/commits/3ef84bf4df18c55c397a155bf7ba4e4b710bc98f
Availability recovery doesn't use gossip, but req/resp protocol with
IfDisconnected::TryConnect
:We're talking about gossip messages, such as approvals, assignments, bitfields, which are no concern to collators.
Having open slots on validators seems fine, but not on collators.
Are these request expensive to process or is there any validation w.r.t. whom they're received from? Syncing protocol has a problem I would like to get rid of in which anybody can send sc-network-sync
a block request and it's always processed, regardless of whether the peer is actually part of the accepted syncing peers or not. This means that anybody that can connect over TCP to the node can start spamming requests and at least for syncing it's relatively easy to craft unique requests which is a potential attack vector.
Are these request expensive to process or is there any validation w.r.t. whom they're received from?
Loading the data from disc is probably the most expensive thing of these requests. There is also no validation, as anyone should be able to do this requests.
@altonen they are open to any peers, but restricted in total requests number https://github.com/paritytech/polkadot/blob/d74c60b6489f31f437edb1a19de0d0f6b8ad432e/node/network/protocol/src/request_response/mod.rs#L226-L234
and request sizes https://github.com/paritytech/polkadot/blob/d74c60b6489f31f437edb1a19de0d0f6b8ad432e/node/network/protocol/src/request_response/mod.rs#L160-L168
If substrate makes use of this limited queue in a fair fashion (e.g. one node can't occupy all requests slots), then we don't need to worry about a potential attack vector here.
@bkchr @ordian
Fair enough. There is not fairness (at least not currently) in who gets to occupy the queue and it's filled FIFO. For syncing it was relatively easy to show the issue with the current request-response implementation, steal attention from the target, and get other connected nodes disconnected due to request timeouts.
Currently we include all subsystems in
RealOverseerGen
, irregardless of whether they are needed or not. A couple of them are only initialized based on passed-in parameters. E.g. there is a conditional for whether pvf-checker is enabled, and if not we spawn aDummySubsystem
, which is a bit clumsy.Proposal
We could instead nix
RealOverseerGen
and have three new structs that implementOverseerGen
:ValidatorOverseerGen
CollatorOverseerGen
FullNodeOverseerGen
Then for each we only initialize the subsystems that are relevant to that kind of node.
Edit: in the future there can be nodes that are both validators and collators. See https://github.com/paritytech/polkadot/pull/7566#discussion_r1279753180. Perhaps a capabilities-based approach would be better, or a builder pattern, e.g.
new().is_validator(true).is_collator(true)
.I'm not too familiar with the code so looking for feedback. cc @ordian
Followup
After this has been done we can move the worker binary check (where we call
determine_workers_paths
) intoCandidateValidation::new
. This would ensure that the workers are only required for nodes that actually need them (validators).