Closed davidbarsky closed 10 months ago
Thanks @davidbarsky for capturing this, I have written out a few more details on how we could integrate tokio-rs/simulation
with Tokio.
tokio-rs/simulation is an initial POC bringing deterministic simulation
testing support to tokio 0.2.0-alpha.6
.
The way it currently works is by providing an Environment
trait to users.
Applications are made generic over the Environment
trait.
#[async_trait]
pub trait Environment: Unpin + Sized + Clone + Send + 'static {
type TcpStream: TcpStream + Send + 'static + Unpin;
type TcpListener: TcpListener + Send + 'static + Unpin;
/// Spawn a task on the runtime provided by this [`Environment`].
fn spawn<F>(&self, future: F)
where
F: Future<Output = ()> + Send + 'static;
/// Return the time now according to the executor.
fn now(&self) -> time::Instant;
/// Returns a delay future which completes after the provided instant.
fn delay(&self, deadline: time::Instant) -> tokio_timer::Delay;
/// Returns a delay future which completes at some time from now.
fn delay_from(&self, from_now: time::Duration) -> tokio_timer::Delay {
let now = self.now();
self.delay(now + from_now)
}
/// Creates a timeout future which which will execute T until the timeout elapses.
fn timeout<T>(&self, value: T, timeout: time::Duration) -> tokio_timer::Timeout<T>;
/// Binds and returns a listener which can be used to listen for new connections.
async fn bind<A>(&self, addr: A) -> io::Result<Self::TcpListener>
where
A: Into<net::SocketAddr> + Send + Sync;
/// Connects to the specified addr, returning a [`TcpStream`] which can be
/// used to send and receive bytes.
///
/// [`TcpStream`]:`TcpStream`
async fn connect<A>(&self, addr: A) -> io::Result<Self::TcpStream>
where
A: Into<net::SocketAddr> + Send + Sync;
}
The desired outcome of this is an application which is generic over sources of
nondeterminism, such as parallel scheduling, network and time. This works in
practice for simple applications, and existing libraries such as Hyper which
allow parameterization of the executor and TcpListener/TcpStream
types can be
adapted to use the Environment
trait.
However, while this can be made to work in some cases, there are 3rd party
libraries for which this is not possible. It is also cumbersome to have
to pass the Environment
trait through entire application. Lastly, changes to
the public API in tokio 0.2.0
have removed some hooks which
tokio-rs/simulation
relied upon, motivating this proposal.
We propose adding simulation support directly to Tokio via a feature flag. This
proposal assumes the feature flag will be the existing test-util
flag, but
that can be decided separately.
There are types which will need to be mocked when the feature flag is enabled.
When the feature flag is enabled, all runtime creation will use the
basic_scheduler
. The remote queue of the basic scheduler will be effectively
disabled as scheduling tasks from remote threads would introduce nondeterminism.
We could either log a warning or panic if scheduling from a remote thread was
attempted.
By disabling the remote queue, it appears that the basic_scheduler
will
execute tasks in a deterministic order between program runs.
Each spawned task will be wrapped in a top-level task which will provide
metadata about the task, and allow for manipulating the work done by the task.
Metadata includes things like a logical location (datacenter/rack/machine), as
well as hostname and IP address. On each poll
invocation, the task metadata
will be set on a well-known thread-local for other components to pickup. The
task metadata can be used by fault injection, networking and disk IO to uniquely
identify the task.
Tasks which are spawned via `tokio::spawn` will automatically inherit the parents task metadata.
The task metadata will be optional to begin with, but in the future we would like to support users specifying it explicitly when setting up a simulation run.
#[tokio::simulation]
fn simulation(seed: u64) {
tokio::simulation::spawn_process(
"us-east-1a", "rack-1", "server-1.svc.cluster.local", spawn_server()
);
tokio::simulation::spawn_process(
"us-east-1b", "rack-2", "server-2.svc.cluster.local", spawn_server()
);
tokio::simulation::spawn_process(
"us-east-1c", "rack-3", "server-3.svc.cluster.local", spawn_server()
);
tokio::spawn(tokio::simulation::clog_connections("us-east-1c", "us-east-1b"));
verify_linearizability().await;
}
The mock clock will be advanced automatically by the executor whenever
Park::park_timeout
is called. It will be advanced for exactly the
std::time::Duration
passed to Park::park_timeout
. This provides two
desirable properties for simulation testing.
Timeout/Delay
. It's guaranteed that all tasks waiting on later timeouts
execute before tasks which are waiting on earlier timeouts.When the feature flag is enabled, the TcpStream
and TcpListener
types will
be swapped out for types which register themselves with a process-global
in-memory networking implementation. The existing TcpStream
will wrap
something akin to this SimulatedTcpStream
type, and the existing TcpListener
will return TcpStream
's which wrap the SimulatedTcpStream
type.
The ToSocketAddrs
implementations which are gated behind the DNS feature flag
will be swapped out to use the global in-memory networking implementation for
DNS lookups. DNS will be derived from the task metadata.
pub struct SimulatedTcpStream {
tx: mpsc::Sender<Bytes>,
rx: mpsc::Receiver<Bytes>,
staged: Option<Bytes>,
shutdown: bool,
local_addr: net::SocketAddr,
peer_addr: net::SocketAddr,
}
TcpStream::bind
will register a new connection queue with the global
networking object, allowing subsequent TcpStream::connect
calls to enqueue
connection objects. TcpStream::bind
will use the task metadata when binding,
allowing for DNS resolution to occur.
Like networking, disk IO will be done using an in-memory mock filesystem. The mock filesystem will be namespaced by task metadata.
This is not needed initially, and has not been explored yet with
tokio-rs/simulation
. However, I'm including it in this proposal for
completeness.
Beyond providing a deterministic environment for executing simulations, it's also desirable to expose an API for fault injection. Scheduling, networking and disk IO all depend on a process-global state store to back the simulated types they expose.
We can take advantage of this global state to also support an API for fault injection. Targeted faults can be injected based on the spawned task metadata by user tasks. This allows for users to write fault injectors as normal async tasks.
async fn swizzle_clog_fault() {
loop {
let connections: Vec<_> = tokio::simulation::get_connections();
for connection in connections {
connection.clog();
warn!("clogging connection {:?}", connection);
tokio::timer::delay_from(time::Duration::from_secs(10)).await;
}
warn!("waiting 30s before unclogging connections {:?}", connection);
tokio::timer::delay_from(time::Duration::from_secs(30)).await;
for connection in connections {
connection.unclog();
warn!("unclogging connection {:?}", connection);
tokio::timer::delay_from(time::Duration::from_secs(10)).await;
}
}
}
Some of the faults that will be supported:
yield()
at the root of a
task). This can be used to simulate slow machines or processes.This is a fantastic write-up @gardnervickers! I am super excited to start seeing some of this stuff being used.
I think overall the proposal is very very solid. The one area I am a bit fuzzy on is how we integrate with tokio directly. It may make sense to explore providing an enumed TcpStream that matches the tokio api 1-1 that lives within the simulation crate. This way we could move away from the giant Environment
trait and explore what it might look like to have some of this code live within tokio and enabled via a feature flag.
I can't speak to the details of the tokio integration, but in general this seems like a great direction. It certainly would be very valuable for tokio to have good support for in-memory testing (or every developer or protocol implementor would need to separately implement their own).
Will try to follow along!
Since simulation has been archived, I'll close this.
Despite the relatively limited use that tokio-rs/simulation has seen so far, @gardnervickers has been able to reproduce several complex bugs in H2 and Tonic. To make
tokio-rs/simulation
more accessible to more people, I think it's worth opening a discussion on re-exporting simulation's networking types from Tokio. I believe (and I might be misquoting @gardnervickers!) that the desired end state for this feature would enable a library like Hyper to be deterministically simulated just by changing which feature flags are enabled within Tokio.To enable this, I think that Tokio will need to make the following changes:
tokio::runtime::Builder::clock
to allow for setting of other hooks in Tokio as to enable deterministic simulation.FaultyTcpStream
through a rename and re-export under the name of the non-simulated type (In this case,tokio::net::TcpStream
) if a feature flag enablingsimulation
is toggled.A discussion between @gardnervickers, @carllerche, and I covering how this would be implemented. It's slightly messy, I apologize about that.