Reuse domains rather than spawning them for each parallel test

OlivierNicole commented 4 months ago

Given that domains are costly to spawn and join, and parallel tests in multicoretests do it many thousands of times, I wanted to look at it more closely. A quick perf profile on src/array/lin_tests.ml shows that more than 50 % of the time is spent in Domain.spawn. I wanted to try spawning domains only at the start of the test sequence and reusing them. Then I realised that this is exactly what Domainslib provides with domain pools!

This proof of concept PR simply replaces Domain.spawn with Domainslib.Task.async and Domain.join with Domainslib.Task.await. The approach implies to setup a domain pool and pass it to the tests. I have edited src/array/{lin,stm}_tests.ml to reflect this.

The results are rather encouraging: src/array/lin_tests.ml is cut down from about 4.5 s to 0.3 s! (With extremely far off statistical outliers due to bad luck in interleaving search.) However, this is a favourable case as the test’s commands are very cheap to run. The same comparison on my extensive Dynarray STM parallel test gives 24 s vs 12 s for the version with domain reuse, so merely a 2x speedup.

If depending on Domainslib is an issue, the mechanism can be reimplemented using atomics, mutexes and condition variables easily enough. (A reimplementation would likely squeeze some more performance out by not requiring a work-stealing queue.)

It looks like about 20 % of the time is now taken by Sys.cpu_relax, so further improvement may be possible still.

OlivierNicole commented 4 months ago

I removed the dependency to Domainslib to use Stdlib mutexes. Since @jmid mentioned in a discussion that he wasn’t fond of the user having to wrap the test runner in a call to Util.Domain_pair.run (fun pair -> ...), I also tried to reuse domains only for intra-test repetitions, as he suggested. I measured the following run times (s):

	`src/array/lin_tests.ml`	`src/array/stm_tests.ml`
main (c65989d)	4.796 s ± 2.306	2.640 s ± 5.203
Intra-test reuse of domains	2.543 s ± 0.751 (speedup 1.89)	2.272 s ± 1.925 (speedup 1.16)
Maximal reuse of domains	716.9 ms ± 725.0 (speedup 6.67)	1.270 s ± 1.004 (speedup 2.08)

@jmid also argued that spawning a pair of Domains for each pair of parallel command sequences allows to test them for a clean state. Although it’s true that it means an initially empty minor heap, I can’t think of tests where an initial empty minor heap is a better test context.

OlivierNicole commented 4 months ago

(With some advice and help from @fabbing to switch to mutexes!)

jmid commented 4 months ago

Thanks Olivier! Quick observation: why does this cause so many Failure("failed to allocate domain") errors? Overall, this should allocate less Domains - not more. Are there conditions under which the spawned Domains are not joined? (that could explain why several tests are running out of them)

OlivierNicole commented 4 months ago

Strange, it should indeed spawn strictly less domains.

Overall, this should allocate less Domains - not more. Are there conditions under which the spawned Domains are not joined?

Normally, no, the Util.Domain_pair.run function joins the domains before returning. I’ll try to take a look the next time I have some time on my hands.

jmid commented 4 months ago

Thanks again for this! :pray:

From your numbers it seems we should be able to get a decent speed-up, even if we just retain the pool to the property.
I think I found the issue with running out of Domains: the property raises a QCheck exception on failure to report the observations. When doing so, I believe the exception meant that the underlying Domain's wheren't joined in the end.

OlivierNicole commented 4 months ago

Well spotted, thanks!

There seems to be a lot of failures still, I’ll have a look at it on my spare time at some point…

jmid commented 4 months ago

I think I figured this one out too: the hint was STM tests with non-trivial precond triggering failed to allocate domain. Preconditions are checked before Domains running, and if the precondition fails, it is signaled to QCheck ... with an exception, thereby causing takedown not to run :shrug: With that in mind, it makes sense to run the precondition check first, before attempting pool init.

The opam install workflows will still fail, due to util.ml now requiring OCaml 5.

jmid commented 4 months ago

@jmid also argued that spawning a pair of Domains for each pair of parallel command sequences allows to test them for a clean state. Although it’s true that it means an initially empty minor heap, I can’t think of tests where an initial empty minor heap is a better test context.

There are a few, e.g., tests of Weak and Ephemeron where the state of the heap may affect the behaviour of a test run. Another example where Domain-reuse may bite is in tests of Domain.DLS, where the state is attached to the Domain itself.

I think we misunderstood eachother however: Generally, I'm fond of self-contained properties, as they are central to reproduce errors and have shrinking work well. Starting to depend on the state of reused Domains, challenges this, if the state of the underlying pool may start to influence things. From this POV it is preferable to decouple the testing of one triple from the next.

For end-users, and assuming the runtime is bug-free, Domain reuse may be less of a worry. However, as we are foremost concerned with testing and helping ensure the correctness of the OCaml 5 runtime, I'm thinking of whether Domain-reuse should be optional... :thinking:

OlivierNicole commented 3 months ago

I see, it makes sense to me too, then, to make domain re-use an option.

ocaml-multicore / multicoretests

Reuse domains rather than spawning them for each parallel test #457