Serialize tests during coverage calculation

Vampire commented 1 year ago

Depends on #90 Fixes #73 Fixes hcoles/pitest#760

This generically fixes hcoles/pitest#760 and #73 for all platform engines, removing the Jupiter specific work-around from #74 and serializing test execution during coverage calculation using locks. This way the tests can also properly run in parallel later on during mutant hunting.

I tested it extensively with the MCVE from hcoles/pitest#760 and also with the same MCVE ported to Spock with parallel test execution and got consistently the correct result of 6 killed mutants across several 100 runs. (7th mutant is immortal)

hcoles commented 1 year ago

@Vampire Do you have the spock version of the MCVE available anywhere?

Vampire commented 1 year ago

Sure, I expanded the given MCVE to add Spock: pitest-issue-760-with-spock.zip It currently points to snapshot versions for PIT and the junit5 plugin.

hcoles commented 1 year ago

@Vampire Just been taking another look at this.

Unfortunately I think the idea of stopping parallel execution during coverage, but allowing it during mutation testing is flawed. It risks the killing test being misidentified, which could give in incorrect results when history recording is enabled.

I'm also still uncomfortable about the large number of locks required to prevent parallel running. I think the current soloution of automatically disabling parallel running by automatically setting a junit configuration value is preferable. Although it doesn't work for spock, this could presumably be rectified by adding the correct setting. In theory we'd have to do this for an open ended number of test runners, but in practice there are very few in use.

Vampire commented 1 year ago

Unfortunately I think the idea of stopping parallel execution during coverage, but allowing it during mutation testing is flawed. It risks the killing test being misidentified, which could give in incorrect results when history recording is enabled.

Would you mind elaborating where you see that risk? For the coverage calculation, definitely, as it is collected in static state so parallel execution is a killer. But for identifying the killing test you call TestUnit#execute with a ResultCollector that records the results of the tests. This ResultCollector is called in the JUnit5TestUnit from a JUnit platform TestExecutionListener, so whether the tests are run in parallel or sequentially should not make any difference.

You just have to make sure you properly handle the calls in the result collector in a thread-safe way. Indeed this is currently not the case, but from a cursory look I'd say changing the ArrayLists in the two Container implementations to a thread-safe collection will already be enough for that. You should probably anyway do this, as there could be other PIT engine plugins out there that call the collector from varying threads and not always from the same one.

I'm also still uncomfortable about the large number of locks required to prevent parallel running.

And I'm still unaware of what you mean.

For Jupiter tests you will have exactly 1 lock, the global lock, as there are no nodes with type CONTAINER_AND_TEST and so the situation as with Spock cannot happen. So all tests use the same 1 global lock.

For Spock, there will be at most 2 locks locked at the same time. All features (i.e. test methods) are locked with the same 1 global lock. For each data driven feature (i.e. the ones with a where: clause) there will be one additional lock, but only one of them will be locked at all times as the features are all sequentialized on the global lock. So actually I can change the code that all iterations of data-driven methods use the same lock, so in the Spock case there will even be exactly 1 or 2 locks existing then.

It is just implemented like it is to cover all platform engines someone could come up with sanely. But the amount of locks is actually minimal and never a "large number".

I think the current soloution of automatically disabling parallel running by automatically setting a junit configuration value is preferable.

Well, our opinions greatly differ here and I still hope I can convince you. :-) From what I have seen so far, there is no real reason to forbid parallel execution, except for during coverage calculation.

Although it doesn't work for spock, this could presumably be rectified by adding the correct setting. In theory we'd have to do this for an open ended number of test runners, but in practice there are very few in use.

There is no setting you can set that overrides it for Spock afair. There is the system property spock.parallel.enabled, but that is only used as a default if the Spock config file does not configure a value. If the Spock config value sets a value, it wins.

There is also no setting on the JUnit platform level I'm aware of that could disable this, because it is the test engine that decides to either use a ForkJoinPoolHierarchicalTestExecutorService instance, or a SameThreadHierarchicalTestExecutorService instance and with which parallelism settings. If the parallelism setting is calculated from the available CPU cores, and you are running on a JVM that has the right -XX:... switch available, you could restrict the available CPU cores. But the parallelism could as well be set to some fixed value and thus the available CPU cores would bepointless. And even if the JUnit platform would have or introduce a platform-level switch to disable parallel running, that would also only help for those engines that actually use this facility. A test engine can practically do whatever it wants and use a complete own implementation for whatever reason.

I really do think that using the locks during coverage calculation and accepting parallel execution later on is the proper way to go. It is imho the easiest, than to always react to a new engine being brought up and then first make it add an overwrite setting and then use it and so on, when this solution here should just generally work for practically any sane implementation someone can come up with without the need to do anything further.

The non-thread safe parts that could fail like the mentioned ArrayList or the other spots you accepted my PRs for already should be fixed anyway, as you never know what a test engine does, or if it will run all tests on the same thread even if run sequentially. There is for example right now a [bug / strange behavior / unexpected behavior / depends on PoV how you call it] that causes too many tests to run in parallel when parallel execution is enabled. The cause for it is in the platform code, so for example Jupiter and Spock are both affected the same. And also the work-around for both is the same. Have an interceptor that does the actual test execution on a different thread / thread pool than the ForkJoinPool the test scheduling happens on.

This work-around will typically be used independent of whether parallel running is currently enabled or disabled. You put it in place because you enable parallel running and even if for whatever reason you temporarily disable parallel running (or even permanently because by then you forgot you made this work-around), the work-around will still be in place. So if you now disable parallel running using the settings switch of Jupiter or one that might be introduced for Spock and all test scheduling now happens on the same thread, the test execution and thus ResultCollector notification will still happen on various other threads.

So all according code that somehow communicates with the test execution or discovery phase needs to be made thread-safe where not already happened anyway or you get nasty multi-threading problems that will be hard to find and fix. Which should most proabably already be achieved by my past PRs, except for this last one location.

Vampire commented 1 year ago

So actually I can change the code that all iterations of data-driven methods use the same lock, so in the Spock case there will even be exactly 1 or 2 locks existing then.

Nah, that's not as trivial as I thought while writing that sentence, because you need to recognize that there is a parent locked and on which layer you are.

Having looked over the implementation and what I imagined as replacement, I think the current code is fine already. Again talking about the Spock case, there actually is at most 1 or 2 locks existing at the same time already with this code. All features are locked on the 1 global lock, so only one feature can actually start. For this started feature a potential serializer reference for childs is registered, but it is only lazily initialized if really needed.

Now it could happen that executionStarted ist called for sibling features, or child iterations. To differentiate between theese cases we recorded the potential serializer reference. If it is called for a sibling, there is no entry for its parent and thus the root serializer is used (or if talking about the generic case, it could also be the same one, resulting in the same result, that the siblings are serialized). If it is instead called for a child, which can only be the case for the one started parent / feature, then the potential child serializer reference is found, lazily initialized and locked, so that at most one of the iterations can actually start.

This logic automatically works for any depth as long as you don't have a CONTAINER_AND_TEST with a CONTAINER child that has a TEST child. But this is what I meant with "sane implementations". I don't expect such a construct to appear and if it appears it should be handled then, by searching not only for getParentIdObject serializer, but recursively up the hierarchy until only the root serializer is left. But I would not add this most probably unnecessary complexity now but only on demand.

hcoles commented 1 year ago

You are correct, there should no longer be a problem misreporting the killing test if they run in parallel. The minion used to send the name of a test back to the parent process before executing it so the parent was guaranteed to receive it even if the test caused an infinite loop. This would not work if the tests ran in parallel, but I'd forgotten that this got ripped out because it proved too flakey. We now accept that we won't know which test triggered an infinite loop and all data is sent after execution.

The "lots of locks" comment is because looking at the code statically the number of locks is unbounded (we have two maps of them). The fact that we have only 1 or 2 locks at runtime is not obvious without intimate knowledge of how spock, jupiter, cucumber and the other test runners behave at runtime.

Given that tests cannot be run in parallel during the coverage stage, and the benefit of them being in parallel during the mutation analysis stage is unclear (pitest will only run one at once so the only possible parallelism is where a "unit" contains more than one test) the simplest, easiest to maintain soloution would be to disable parallel running at the level of the test framework.

Unfortunately that doesn't seem to be an option.

So, the two possible ways forward are

Merge this and allow parallel running
Throw an error/log when it's detected and ask the user to disable it

My preference is always to keep things as simple as possible (i.e 2), but if you're happy to maintain this going forward I'll go with 1 and merge it in.

Vampire commented 1 year ago

pitest will only run one at once so the only possible parallelism is where a "unit" contains more than one test

Which I hope to get changed with https://github.com/hcoles/pitest/issues/1223 once the JUnit platform launcher supports aborting a run gracefully. :-)

But for now at least for the hunting minion test discovery and for atomic units with mutliple tests it will speed up a bit.

Merge this and allow parallel running

❤️

if you're happy to maintain this going forward I'll go with 1 and merge it in.

Sure, no problem. I'm honest, I will not follow the issues, I have enough on my plate to read already. But feel free to ping me anytime if an issue comes in regarding it and I'll try to take care as soon as my time allows. :-)

Vampire commented 1 year ago

Maybe it would make sense though to first cut a new pitest release and depend on that from this one, so that the result collector is thread-safe which should have been the last unsafe part hopefully.

Vampire commented 1 year ago

Any idea when you get a chance to cut a release of pitest @hcoles? I actually use a combination of this PR and #93. Two of the test specs in the project get treated atomically and are hitting exactly the result collector problem. 30 - 40 of the 338 mutants where those tests are relevant are "killed" by RUN_ERROR due to a NullPointerException in the non-thread-safe result collector. With the latest master code, everything works fine.

hcoles commented 1 year ago

@Vampire 1.14.2 should be available from maven central now.

Vampire commented 1 year ago

Perfect, thank you very much. I also updated the PR here to depend on the new version.

Vampire commented 1 year ago

So, anything left to do here? :-)

pitest / pitest-junit5-plugin

Serialize tests during coverage calculation #91