Deterministic execution checks

tokio-rs / simulation

Framework for simulating distributed applications

96 stars 9 forks source link

Deterministic execution checks #2

Open gardnervickers opened 5 years ago

gardnervickers commented 5 years ago

In the StrangeLoop FoundationDB talk, Will Wilson mentions the need for re-running simulations with the same deterministic seed and comparing the execution in order to detect sources of nondeterminism which have sneaked into the application.

It would be good to support this, if even at a basic level. One idea that has been floated is to use Tracing to collect a globally ordered set of execution events. This could serve as the basis for comparison to ensure that for multiple runs of the same seed, execution histories are identical.

davidbarsky commented 4 years ago

I'm not sure that tracing is the ideal mechanism for tracing executions of simulations because tracing is primarily meant for collecting instrumentation data that can be arbitrarily queried over. This isn't to say that it can't be used for that, but I think pushing tracing towards this use-case might be more effort than getting each simulated component to insert an enum describing the operation to a global/lazy-static'd vec.

gardnervickers commented 4 years ago

@davidbarsky I think that makes sense, the reason I was drawn to using tracing was that I wanted to allow users to interleave their own application specific events into an execution history. I agree that it's less than ideal to force tracing to serve this purpose.

I think instead, we could achieve the same thing by using a global epoch which events can incorporate. Something like {timestep}-{epoch} where timestep is the mock time value. The reason we can't use mock time alone is that multiple events can fire in the span of a single timestep.

davidbarsky commented 4 years ago

We spoke on Discord about this, but to summarize: I think a priority queue of events will be sufficient. The events will be inserted into the priority queue during a simulation run. "Priority" of an event is a monotonic instant/integer that allows a user/this system to determine a total ordering of events during a given simulation.

This approach is outlined in greater detail here: https://lobste.rs/s/igiolo/learning_build_distributed_systems#c_nlpl7r

bIgBV commented 4 years ago

I think the same approach would work for not only network events but IO events as well, and will let us simulate storage failures.

A good implementation strategy would be to have each mock Async* object maintain it's own event priority queue. This can then be used with a global fault injector to drop events/delay them to induce failures.

thisismiller commented 4 years ago

The easy way to do this is to, right before the test ends, print/log the random number generator's next value. If two runs produce the same final random number, they they're very likely to have identical executions. FoundationDB calls this the unseed.