Add state machine coordinator and query state machine

iand commented 1 year ago

This change promotes the state machine logic from the statemachine example to the main codeline.

The main additions are:

a query implementation using a deterministic hierarchical state machine
a coordinator that presents an event-oriented interface for users to interact with the query state machine

The intention is to add further state machines to the coordinator to implement other features such as routing table refresh.

See the state machine rationale and design document.

dennis-tra commented 1 year ago

Some things I noticed:

the inboundEvents/outboundEvents channels should be unbuffered.
I think it would be nicer if the coordinator had not a single outboundEvents channel but return channels that the caller then subscribes to on a per-request basis. E.g., FindNode would return a channel. Right now, there's this indirection through the event loop of the IpfsDht. The coordinator would handle the I/O to the caller (the design also states the coordinator does the I/O).
- The IpfsDht wouldn't need to keep track of ongoing queries and their "waiters."
We discussed this, and you had good reasons to keep it as it is, but I felt again that it would be great if the caller wouldn't need to generate a query ID. This can be handled by either the coordinator/pool?
Stopping queries doesn't cancel the requests. It just marks them as finished. The resources should be cleaned up.
the NodeIter.Each pattern breaks the state machine pattern. Alternatively, an iter could implement a Next() and Reset() methods or just a method that returns the list of nodes directly. That way, the query could just normally iterate over them, and you wouldn't need to capture local variables in a closure.
what's the heartbeat for?
why queries and queriesIndex on QueryPool?

iand commented 1 year ago

the inboundEvents/outboundEvents channels should be unbuffered.

A small outbound buffer can smooth over temporary differences in read/send speed and it still blocks if the reader falls behind. See below as well.

Inbound could be unbuffered.

I think it would be nicer if the coordinator had not a single outboundEvents channel but return channels that the caller then subscribes to on a per-request basis. E.g., FindNode would return a channel. Right now, there's this indirection through the event loop of the IpfsDht. The coordinator would handle the I/O to the caller (the design also states the coordinator does the I/O). The IpfsDht wouldn't need to keep track of ongoing queries and their "waiters."

This gives the coordinator a lot more state to manage. It would still need a general purpose outbound event channel to notify events that are not for specific requests (e.g. routing table updates. rust libp2p also emits events to notify of unroutable peers).

I can see how it would be easier to work with but what should the coordinator do if it cannot send to one of the channels? Should it block? It can't skip sending the message and waiting in a new goroutine isn't viable if we want to keep resource utilization deterministic.

Of course it faces the same problem with a single outbound queue, however the usage pattern there makes it less likely that the caller will stop reading from the channel. If they do then nothing makes progress in their application.

We discussed this, and you had good reasons to keep it as it is, but I felt again that it would be great if the caller wouldn't need to generate a query ID. This can be handled by either the coordinator/pool?

Is there a good case for random identifiers over meaningful ones?

Stopping queries doesn't cancel the requests. It just marks them as finished. The resources should be cleaned up.

The requests will run to completion and no more will be issued. What cancelation mechanism would be used?

the NodeIter.Each pattern breaks the state machine pattern. Alternatively, an iter could implement a Next() and Reset() methods or just a method that returns the list of nodes directly. That way, the query could just normally iterate over them, and you wouldn't need to capture local variables in a closure.

An iter is just a list of nodes that can be enumerated in a specific order. Its not a state machine. Originally it was but it added much more complexity with almost no new value. Each state was duplicated almost one to one in the query. Adding Next would require it to hold iteration state. Each is a straightforward way to enumerate the list.

what's the heartbeat for?

To allow the state machine to make progress for time-sensitive states such as timeouts. If all queries had stalled requests then zero progress could be made since there would be no incoming event to trigger a state change.

why queries and queriesIndex on QueryPool?

Queries must be processed in order so this maintains an ordering while allowing random lookup for responses.

iand commented 1 year ago

@dennis-tra and I discussed this PR on a call.

We're going to open an issue for cancelling outbound requests cleanly.

Also discussed tighter integration with the action/scheduler/queue system already in place. We should be able to unify the two models but I tried and failed a couple of times during development of the original example. We have approach that should work and Dennis is going to take a look.

dennis-tra commented 1 year ago

Created issue for canceling outbound requests. Gave it a try to unify both, but also ran into issues.

dennis-tra commented 1 year ago

I had another thought about the IpfsDht consuming events in a single loop and I'm still not convinced this is the right way to do it. Don't we effectively move from a single event loop to two event loops?

My alternative proposal was to let StartQuery return a channel scoped to the query instead (I'm also not super convinced that this is great).

You had several arguments in favor of a single loop reading the outboundEvents channel. The strongest one I think is that it's less likely for implementers to not drain the channel. In an ephemeral query-scoped events channel implementers could forget to drain it.

Some things that are floating around in my mind in favour of query-scoped channel:

I think API ergonomic-wise this is easier to implement - less cognitive load by passing the events around
The API contract could state to always drain the channel until it's closed by go-kademlia
I'm also concerned with performance (it might be a little early for this) but passing events in a loop between three channels feels odd.

probe-lab / go-kademlia

Add state machine coordinator and query state machine #74