probe-lab / zikade

A Go implementation of the libp2p Kademlia DHT specification
Other
8 stars 3 forks source link

Replace stored contexts with explicit tracing and cancellation metadata #67

Open iand opened 9 months ago

iand commented 9 months ago

When an incoming event is queued by a behaviour's Notify method the context supplied in the method call is also queued alongside the event and reused when the event is actioned by Perform. This was primarily intended to preserve the tracing context for the event so that the event and its consequent outbound events could be traced through the system of coordinator, behaviours and state machines. Secondarily it was intended to allow context cancellation to be effective through the state machines. (Originally the context would be checked for cancellation before actioning the associated event but this has been lost through refactorings)

However these goals can only be attained if the context is consistently preserved everywhere. Currently the coordinator uses its own independent context when dispatching events between state machines and the events emitted by a behaviour's Perform method are done so without their associated context.

Additionally this storing of the context can be harmful if the context is used for an event generated as a side effect, such as a rouuting notification that adds a node to the include queue. This should have its own independent context that is not subject to the parent context's cancellation.

We should remove the storage of context and use a different mechanism to carry tracing and cancellation metadata.

Proposal

Tracing

Extend BehaviourEvent to have a SpanContext method:

// SpanContext returns tracing information associated with the event.
SpanContext() trace.SpanContext

A SpanContext holds the trace id, span id and other tracing flags that should be associated with the event. See spancontext in the specification.

Each outbound event that is generated as a direct result of actioning an inbound event should copy the SpanContext to the new event. Functions that process an event should use the SpanContext, for example:

ctx, span := c.tele.Tracer.Start(trace.ContextWithSpanContext(ctx, ev.SpanContext()), "Coordinator.AddNodes")
defer span.End()

When an event is submitted to the coordinator's Notify method (from an external source or as a result of calling a helper method like Coordinator.Bootstrap) an SpanContext should be created that using a method like trace.SpanContextFromContext.

Cancellation/Deadlines

Events that initiate queries (EventStartFindCloserQuery, EventStartMessageQuery) and broadcasts (EventStartBroadcast) should include a Deadline field that can be used to specify a deadline for the query. The query state machines should use this to terminate the query once it has passed its deadline and the relevant waitForQuery or waitForBroadcast functions can use to create a context with an appropriate deadline.

Events that initiate outbound network requests (EventOutboundGetCloserNodes and EventOutboundSendMessage) should also carry a deadline, inherited from the query that ultimately generated the request event.

dennis-tra commented 9 months ago

Thanks for the proposal! I think that's important to figure out.

Tracing

Having SpanContext() trace.SpanContext on the BehaviourEvent sounds good to me!

Cancellation

What I would want us to be able to do is that the user can cancel the context that they passed into, for example, routing.Routing's FindPeer method, and then only return from that method after all in-flight requests + associated go-routines have stopped and returned. This is to avoid dangling goroutines.

I don't have an idea how this could work without storing contexts somewhere.

I think your proposal works well for query timeouts/deadlines, but we would need a different mechanism for externally triggered cancellations with the above "no dangling goroutines" requirement, right?

iand commented 9 months ago

What I would want us to be able to do is that the user can cancel the context that they passed into, for example, routing.Routing's FindPeer method, and then only return from that method after all in-flight requests + associated go-routines have stopped and returned. This is to avoid dangling goroutines.

Cancellation should be handled by the caller sending an EventStopQuery. They can set up a context with a timeout or cancel function that sends this event if the query hasn't already terminated.

Stopping in-flight requests at the libp2p level needs to be handled in the router, perhaps by adding a CancelRequest method. The query knows which queries it is waiting on and could emit a "cancel request" event for each one when it is stopping.