radicle-dev / radicle-link

The second iteration of the Radicle code collaboration protocol.
Other
421 stars 39 forks source link

librad: task cancellation #650

Closed kim closed 3 years ago

kim commented 3 years ago

This is the (hopefully) last attempt to ensure orderly shutdown of spawned tasks. It naturally touches a lot of files, but fret not: the relevant pieces are mainly in librad/src/executor.rs and librad-test/src/rad/testnet.rs.

Problem Overview

Task cancellation is ill-defined (ahem, an open problem) in contemporary async Rust runtimes, particularly tokio: when the JoinHandle to a spawned task is dropped, the task becomes detached and there is no way to join or abort it.

The task itself does not get notified of runtime shutdown, so it needs to ensure it stops by some other means -- by default, the runtime blocks on the completion of any outstanding tasks when it shuts down.

Even if we manage to call abort on a JoinHandle, this will not ensure prompt cancellation: the task could be currently scheduled to run, and there's no way for the runtime to preempt it. That is, even after aborting a task, we need to poll the JoinHandle at least once more, or if the handle is lost wait for the entire runtime to terminate.

Lastly, aborting a task does not mean that any JoinHandles the task is await-ing will be aborted, too -- instead, they are going to become detached.

All of this is arguably error-prone, but not so much of a problem for a program with a single entry point, which shuts down whenever the "main loop" it spawns terminates. It is a problem, however, when the program wishes to restart such a main loop, but only after the previous instance has fully released all resources.

Solution Overview

We define a thin wrapper around the current runtime, executor::Spawn, which shall be the single means of spawning tasks throughout the library. This type is not publicly exported (we'll see why later).

Spawn adopts the JoinHandle semantics of async-std: when the handle is dropped, the task gets aborted. It is possible to "detach" a task, on which case Spawn internally keeps track of it. When Spawn::shutdown is called, it is no longer possible to spawn new tasks (detached or not), the outstanding detached tasks get aborted, and the call blocks until the outstanding tasks have all terminated.

This does not present a user-visible task supervision framework, but allows the stateful objects of the library to keep track of their own tasks, and block on their completion when dropped. It is not necessary for library maintainers to define their tasks in a particular cancellation-aware way.

Subtleties

We currently have two such stateful objects: one being the protocol stack, and the other a handle to it (net::Peer). It is possible for the handle to outlive the protocol stack (and continue to communicate with it in case it gets restarted). The handle may, however, spawn tasks itself, as it currently manages the storage pool. For these reasons, we allocate independent Spawns for each of them -- which means that this is all internal state management, and the library user never gets to see any of those Spawners.

For all of this to work, it is crucial that all librad code is executed in an async context (ie. inside Runtime::block_on). This will usually behave as expected in single-entry main() programs, but requires to manage the runtime lifecycle in the multi-peer integration tests: since they hold on to temporary files, it is advisable to wait on complete task shutdown (that is: runtime termination) before removing them.

Note that morally, these tests should allocate a separate runtime per logical peer, in order to approximate the real-world multi-process behaviour. This would, however, require considerable boilerplate for test writers to ensure operations are scheduled onto the right peer's event loop, so it was opted to use a single runtime peer test instead.

Collaterals

2029473319c961e326d774683acda49d2106f274 is reverted in the process, as this would subvert the task-tracking introduced here.

The logging verbosity of the test suite has been reduced in CI, as the amount of log output has exceeded a useful volume.

Signed-off-by: Kim Altintop kim@monadic.xyz

kim commented 3 years ago

The infuriating part of this is that it is apparently not possible to wake our background future from a Drop impl for Spawn, even when keeping track of a Waker. I have no explanation as to why this is different from calling shutdown() explicitly, other than Drop being allowed to change evaluation order (contradicting the reference).

kim commented 3 years ago

If you have time, could you maybe run this in a few different ways trying to see if it breaks (that is, fails to release resources eventually)? I can observe that the tests now shut down orderly, but haven’t gotten around to killing off some network interfaces f.ex.

I’ll see if I can figure out the wake-on-drop issue at a later point, but I think this might be a win overall already.

FintanH commented 3 years ago

I used what @geigerzaehler was using for stressing resource usage https://archlinux.org/packages/community/x86_64/stress/.

Running on master I was able to reproduce the errors where git couldn't find temporary directories:

May 05 16:05:59.407 ERROR librad::git::refs:error=failed to create temporary file '/run/user/1001/.tmptNWOKP/git/objects/tmp_object_git2_kG0REN': No such file or directory; class=Os (2)
May 05 16:05:59.426 ERROR librad::git::replication::project:error=failed to create temporary file '/run/user/1001/.tmptNWOKP/git/objects/tmp_object_git2_kG0REN': No such file or directory; class=Os (2)
May 05 16:05:59.426 ERROR librad::git::replication::project:error=failed to create temporary file '/run/user/1001/.tmptNWOKP/git/objects/tmp_object_git2_kG0REN': No such file or directory; class=Os (2)
May 05 16:05:59.430 ERROR librad::git::replication:error=failed to create temporary file '/run/user/1001/.tmptNWOKP/git/objects/tmp_object_git2_kG0REN': No such file or directory; class=Os (2)

I then switched to kim/cancel-culture and ran the test suite 3 times without any failures.

kim commented 3 years ago

Unfortunately, a number of the daemon tests fail to terminate after rebasing.