probe-lab / go-kademlia

Generic Go Kademlia implementation
Other
17 stars 4 forks source link

Add bootstrap state machine #90

Closed iand closed 1 year ago

iand commented 1 year ago

This adds a state machine for running the bootstrap query described in https://github.com/plprobelab/go-kademlia/issues/45. The state machine is very simple (it runs a query that attempts to find the self node), but it drives the design of the coordinator in a number of ways:

There are still some ugly parts which I may be able to fix within the scope of this PR:

Fixes #47

dennis-tra commented 1 year ago

Priority is simple: the coordinator first attempts to advance the bootstrap state machine and only if it is idle, indicating the bootstrap has no further work, will it proceed to advance the query pool state machine.

I think it should be possible to already start queries as soon as we have connected to the first peer. It'll probably not be the fastest query but it'll resolve and will be faster than waiting for the whole bootstrap process to finish.

What should happen if the bootstrap is waiting for a message response but the caller attempts to start a user query?

IMO this should result in an error for the caller. So the coordinator would process that message normally, determine that it cannot start the query, and return an error to the caller.

However, I see your point and that, in the general case, it might be necessary to delay events until a certain state is reached.

I think having separate queues undermines the hierarchy of the hierarchical state machine. Sub-components can enqueue events with sub-states, by-passing the parent state machines that might want to react to or track these events as well. I think it's easier to reason about the control flow if data always flows from the root to the leaves and cannot skip any intermediate states.

iand commented 1 year ago

I think having separate queues undermines the hierarchy of the hierarchical state machine. Sub-components can enqueue events with sub-states, by-passing the parent state machines that might want to react to or track these events as well. I think it's easier to reason about the control flow if data always flows from the root to the leaves and cannot skip any intermediate states.

I don't understand this point. The coordinator is coordinating state machines and is translating asynchronous behaviour into synchronous. That async to sync interface needs to have some kind of queueing. Each hierarchical state machine is synchronous and there is no enqueing of substates, only the coordinator is feeding async events to the top of each state machine hierarchy.

iand commented 1 year ago

I think it should be possible to already start queries as soon as we have connected to the first peer. It'll probably not be the fastest query but it'll resolve and will be faster than waiting for the whole bootstrap process to finish.

I discuss this a bit in the design issue #45 and expand in this comment. If we're only connected to a bootstrap node then queries will hit that bootstrap node far more than is usual and the bootstrap node has to deal with that for every node that starts up. We should at least wait until the routing table is better populated before allowing queries (or we could throttle them based on bootstrap progress).

iand commented 1 year ago

What should happen if the bootstrap is waiting for a message response but the caller attempts to start a user query?

IMO this should result in an error for the caller. So the coordinator would process that message normally, determine that it cannot start the query, and return an error to the caller.

That only works in this specific case. A more general example is what happens if a non-bootstrap query response is received as an event while the bootstrap is waiting for its own query response? There is no place to return an error in that case.

In general the coordinator will also be advancing other state machines like the routing table probe/refresh, exploring etc. These will essentially be background tasks and should be given the chance to progress and potentially change state, which needs to be returned by the coordinator. Any incoming event needs to be queued in that case.

The rust-libp2p design handles this by calling methods to mutate state on receipt of an inbound event. So if a message response is received it calls the equivalent of onMessageReceived on pool, which looks up the query and calls onMessageReceived on the query which then calls onMessageReceived on the iterator which looks up the responding node and changes its state.

iand commented 1 year ago

@dennis-tra I moved the event queues out of the state machines and into the coordinator. After some thought I ended up agreeing with your point that this is the responsibility of the coordinator, and it allows state machines to be composed in a simple way