Open iand opened 1 year ago
Thanks for the proposal! I think that's important to figure out.
Having SpanContext() trace.SpanContext
on the BehaviourEvent
sounds good to me!
What I would want us to be able to do is that the user can cancel the context that they passed into, for example, routing.Routing
's FindPeer
method, and then only return from that method after all in-flight requests + associated go-routines have stopped and returned. This is to avoid dangling goroutines.
I don't have an idea how this could work without storing contexts somewhere.
I think your proposal works well for query timeouts/deadlines, but we would need a different mechanism for externally triggered cancellations with the above "no dangling goroutines" requirement, right?
What I would want us to be able to do is that the user can cancel the context that they passed into, for example, routing.Routing's FindPeer method, and then only return from that method after all in-flight requests + associated go-routines have stopped and returned. This is to avoid dangling goroutines.
Cancellation should be handled by the caller sending an EventStopQuery
. They can set up a context with a timeout or cancel function that sends this event if the query hasn't already terminated.
Stopping in-flight requests at the libp2p level needs to be handled in the router, perhaps by adding a CancelRequest method. The query knows which queries it is waiting on and could emit a "cancel request" event for each one when it is stopping.
When an incoming event is queued by a behaviour's
Notify
method the context supplied in the method call is also queued alongside the event and reused when the event is actioned byPerform
. This was primarily intended to preserve the tracing context for the event so that the event and its consequent outbound events could be traced through the system of coordinator, behaviours and state machines. Secondarily it was intended to allow context cancellation to be effective through the state machines. (Originally the context would be checked for cancellation before actioning the associated event but this has been lost through refactorings)However these goals can only be attained if the context is consistently preserved everywhere. Currently the coordinator uses its own independent context when dispatching events between state machines and the events emitted by a behaviour's
Perform
method are done so without their associated context.Additionally this storing of the context can be harmful if the context is used for an event generated as a side effect, such as a rouuting notification that adds a node to the include queue. This should have its own independent context that is not subject to the parent context's cancellation.
We should remove the storage of context and use a different mechanism to carry tracing and cancellation metadata.
Proposal
Tracing
Extend BehaviourEvent to have a
SpanContext
method:A
SpanContext
holds the trace id, span id and other tracing flags that should be associated with the event. See spancontext in the specification.Each outbound event that is generated as a direct result of actioning an inbound event should copy the
SpanContext
to the new event. Functions that process an event should use theSpanContext
, for example:When an event is submitted to the coordinator's
Notify
method (from an external source or as a result of calling a helper method likeCoordinator.Bootstrap
) anSpanContext
should be created that using a method liketrace.SpanContextFromContext
.Cancellation/Deadlines
Events that initiate queries (
EventStartFindCloserQuery
,EventStartMessageQuery
) and broadcasts (EventStartBroadcast
) should include a Deadline field that can be used to specify a deadline for the query. The query state machines should use this to terminate the query once it has passed its deadline and the relevantwaitForQuery
orwaitForBroadcast
functions can use to create a context with an appropriate deadline.Events that initiate outbound network requests (
EventOutboundGetCloserNodes
andEventOutboundSendMessage
) should also carry a deadline, inherited from the query that ultimately generated the request event.