sociomantic-tsunami / swarm

Asynchronous client/node framework library
Boost Software License 1.0
14 stars 26 forks source link

`ISuspendable.resume` implementations must not cause context switching #370

Closed mihails-strasuns closed 5 years ago

mihails-strasuns commented 5 years ago

Ocean temporary workaround - https://github.com/sociomantic-tsunami/ocean/pull/658

Currently neo client suspendables (contrary to old ones) may start processing some data immediately when resume is called. This violates expectations of both throttler (tries to resume several suspendables in a simple loop) and task scheduler (expect no context switching when handling termination hooks).

Implementation of ISuspendable should instead mark itself as resumed but delay any actual processing until control is returned to the event loop.

gavin-norman-sociomantic commented 5 years ago

task scheduler (expect no context switching when handling termination hooks).

I do wonder whether this problem is isolated to the suspendable implementation, or whether that's just the concrete example that has shown this up. Everything in swarm is constantly switching fibers without epoll cycles.

mihails-strasuns commented 5 years ago

Right now there is a very limited set of termination hooks and suspendable throttling is the only one that can cause trouble afaics. Outside of termination hooks it is generally fine though practice has shown one can easily screw things up when managing fibers this way :)

gavin-norman-sociomantic commented 5 years ago

I actually don't really understand what the connection with termination hooks is. Could you give (or link to) a brief summary of the overall problem?

mihails-strasuns commented 5 years ago

Background: termination hooks are bunch of delegates that run after main task run method finishes. They are used to inject various actions that should semantically happen after task finishes - most common examples being delayedResume on another task inside await implementation or recycling itself task that comes from a TaskPool.

There are two important properties of termination hooks that limit how those can be used:

1) There is no well-defined order of how those may run. Adding new hooks and running them happens in LIFO order but any code can add a new hook at any moment - it is a public API. 2) While running hooks worker fiber is in a somewhat weird state - it is neither busy running a task (because it is already considered finished), nor free to take another one. Being able to query scheduler stats and/or task state during this moment will likely lead to confusing/wrong data.

Because of that scheduler documentation puts a hard restriction on termination hooks - those must never cause a context switch, ensuring that it is not possible to observe any ordering issues or inconsistent state from any other code.

Now the key thing here is that suspendable throttler has to call throttling method twice for each task - when it is about to start (to check if suspend is needed) and when it finished (to check if resuming is needed). Latter behavior is also naturally done as a termination hook.

With old swarm it worked just fine because resuming a client meant nothing but registering epoll event causing no other immediate effects. And considering throttler API also expected that (because it expected to be able resume multiple suspendables in a plain foreach at once) I felt safe to assume it doesn't violate scheduler requirements.

With swarm neo this is violated as immediate processing of request may result in another task spawned from callback. Reason why we noticed it is that after some scheduler refactoring recycling hook ended up being moved to happen before throttling hook - and apps started to hit situation where task is attempted to be reused from task pool again despite its termination hooks are still running. For now quick workaround was to hard-code recyling hook in ocean to be special and always happen last - but it is both dirty and unreliable.

Relevant links:

1) Termination hook logic: https://github.com/sociomantic-tsunami/ocean/blob/v4.x.x/src/ocean/task/Task.d#L465-L482 2) Throttler logic: https://github.com/sociomantic-tsunami/ocean/blob/v4.x.x/src/ocean/io/model/ISuspendableThrottler.d#L227 3) Hack to workaround swarm neo: https://github.com/sociomantic-tsunami/ocean/blob/v4.x.x/src/ocean/task/TaskPool.d#L121-L133

mihails-strasuns commented 5 years ago

Ping

gavin-norman-sociomantic commented 5 years ago

This will be addressed in #369.