sebastianbergmann / phpunit

The PHP Unit Testing framework.
https://phpunit.de/
BSD 3-Clause "New" or "Revised" License
19.69k stars 2.2k forks source link

Bring back `--repeat` CLI option #5718

Open sebastianbergmann opened 8 months ago

sebastianbergmann commented 8 months ago

IIRC, @epdenouden wants to look into this during next week's code sprint.

curkan commented 8 months ago

Yes please return

epdenouden commented 8 months ago

Returning does nobody any good when I'm still void

Nutickets commented 7 months ago

Hell to the yeah

ERuban commented 4 months ago

So then? 😀

Bilge commented 3 months ago

Want to see this backported for v10 also.

PerryRylance commented 1 month ago

We often use factories during testing, which in turn user faker and friends. I reached for this when running tests that deal with models that may or may not appear in a collection based on date, I understand we can restructure our tests (and this is what I was in the process of doing) - but it would be useful to have this back.

benascbr commented 3 weeks ago

make phpunit great again!

sebastianbergmann commented 2 weeks ago

@theseer, @localheinz, @staabm, @Schrank, @sebastianheuer, @tesla91, and I discussed this issue during the PHPUnit Code Sprint in Munich today.

Way back when I originally implemented the old --repeat CLI option, my use case had nothing to do with uncovering flaky tests. It was purely about benchmarking. In hindsight, I consider it a mistake to have implemented this feature as benchmarking is out of scope for PHPUnit.

While working towards PHPUnit 10, there was a point in time when the --repeat CLI option was blocking the release. As the feature was not important to me, and I was not aware that others were relying on it for other use cases than the one I originally had in mind, I decided to remove it from PHPUnit 10.

Looking at the comments on this ticket as well as in similar tickets such as #5174 and #5717, the only use case we found was the uncovering of flaky tests. However, how exactly this use case should be handled remains unclear.

The only reasonable behaviour that we were able to come up with is this: when --repeat 10 is used, each test is run up to 10 times. If all 10 executions of the test are successful, then the test as a whole is considered a success. The first failure leads to 1) no further attempt of executing the test and 2) considering the test as a whole a failure. By the way, this is how Pest's ->repeat() feature works.

The approach outlined above would be the only one we are willing to try to implement. However, we are not sure whether (all of) you who commented on / reacted to this ticket would expect this behaviour.

If this behaviour is viable for your use case: please give this comment a thumbs-up.

sanmai commented 2 weeks ago

So the only difference from the previous behavior is that the test will be aborted after the first failure, right?

To make sure I understand correctly, if I use --repeat 10, would either:

  1. Run the entire test suite 10 times in succession (like the old --repeat option)?
  2. Or, run each individual test up to 10 times, stopping at the first failure for that specific test?

Clarifying this would help ensure everyone is on the same page about the expected behavior.

sebastianbergmann commented 2 weeks ago

run each individual test up to 10 times, stopping at the first failure for that specific test

This.

Bilge commented 2 weeks ago

It did not stop at the first failure before, so it should not do so now, unless that option (stop on first failure) is specified.

You are correct that identifying flaky tests is a good and necessary use-case for this feature. Moreover, supporting that use-case can mean benching what proportion of the time it is failing. To do that, I may want to run the test 100 times and count the failures. If it stops on first failure, that is not longer possible.

jrfnl commented 2 weeks ago

@sebastianbergmann I actually have a different usecase for which I was using the --repeat option: testing of build-in, function level caching.

A certain library contains a number of functions often called with the same parameters from different entry-points within the same process. These functions will always return the same result with the same inputs, but getting to that result can have a performance impact. So these functions determine the result the first time they are run with each combination of parameters and then cache the result. The next time the function is called with the same parameters, the cached result is served.

The tests are run with --repeat 2 to safeguard that the caching doesn't cause any side-effects.

Note: the proposed solution for re-implementing the feature would work for this use-case, so :+1: from me.

Bilge commented 2 weeks ago

The tests are run with --repeat 2 to safeguard that the caching doesn't cause any side-effects.

That doesn't seem like a valid use case for repeat. The SUT should just be called twice in that case. That is, it should not be necessary to invoke the test runner in a specific way in order to run the tests in the expected manner. To put it another way, --repeat should be used ad-hoc, and not baked in as part of a regular test run.

sanmai commented 2 weeks ago

Thanks for confirming. That clarifies the intended scope.

Now, another question arises: how will this handle tests with dependencies? If TestB depends on TestA, and TestB is repeated due to flakiness, wouldn't we need to re-execute TestA each time as well to ensure a clean state?

It seems like there are two main options to address this:

I believe addressing these dependency concerns is necessary to make the --repeat option useful for the widest range of scenarios.

sebastianbergmann commented 2 weeks ago

Now, another question arises: how will this handle tests with dependencies? If TestB depends on TestA, and TestB is repeated due to flakiness, wouldn't we need to re-execute TestA each time as well to ensure a clean state?

A test will not be repeated "due to flakiness".

When --repeat is used then all tests will be repeated as explained in https://github.com/sebastianbergmann/phpunit/issues/5718#issuecomment-2421955447. So when B depends on A and --repeat 10 is used then B will only be run when A was successfully run 10 times.

sebastianbergmann commented 2 weeks ago

You are correct that identifying flaky tests is a good and necessary use-case for this feature.

We identified it as a use case, we did not say that is a use case we think must be supported.

It did not stop at the first failure before, so it should not do so now

That is your opinion, it is not ours. If we decide to bring --repeat back then it will work as described in https://github.com/sebastianbergmann/phpunit/issues/5718#issuecomment-2421955447.

sanmai commented 2 weeks ago

I see a potential problem with this sequential approach, especially with stateful tests.

Consider this scenario:

With the proposed --repeat 10, TestA runs 10 times, then TestB runs 10 times after that. However, TestB will likely fail after the first run because it's operating on an SUT that has already been modified by the previous TestB execution. This breaks the dependency chain and makes the repetition meaningless.

This effectively means that the --repeat option cannot be used with tests that alter the SUT's state, severely limiting its applicability. The previous implementation did not have these limitations.

Given the diverse needs and complexities surrounding test repetition, what stops the team from considered opening up the PHPUnit API to allow developers to implement their own repetition algorithms?

theseer commented 2 weeks ago

You are correct that identifying flaky tests is a good and necessary use-case for this feature. Moreover, supporting that use-case can mean benching what proportion of the time it is failing. To do that, I may want to run the test 100 times and count the failures. If it stops on first failure, that is not longer possible.

Correct. It won't be.

But, imho, that's not a problem at all: To test for flakiness, a single fail among a single success is already proving the point. There is no informational benefit in continuing the test repetition for this aspect.

As it's already know to be flaky, the percentage / distribution of fail and success is likely going to by flaky as well. There is nothing to gain by having a different distribution of numbers on every run, e.g. a changing number of fails.

Bilge commented 2 weeks ago

As it's already know to be flaky, the percentage / distribution of fail and success is likely going to by flaky as well. There is nothing to gain by having a different distribution of numbers on every run, e.g. a changing number of fails.

It has already proved to be an interesting statistic hitherto. If it should no longer be supported then the conclusion is to use PHPUnit 9 forever.

When a test is successfully identified as flaky, the next question should be how flaky? There's a big difference between 5% failure rate and 70%. This is good to know.

Bilge commented 2 weeks ago

You are correct that identifying flaky tests is a good and necessary use-case for this feature.

We identified it as a use case, we did not say that is a use case we think must be supported.

It did not stop at the first failure before, so it should not do so now

That is your opinion, it is not ours. If we decide to bring --repeat back then it will work as described in #5718 (comment).

Notwithstanding it is already quite obnoxious to remove a working feature, to remove a feature that has subsequently been identified as useful even beyond its originally intended purpose, just to spite people for using it for the "wrong" reasons sounds like whomsoever "we" are constitutes a body of people dedicated to mismanagement of this project.

To reintroduce it with different semantics is almost as bad as not reintroducing it at all, especially when an option to stop on error already exists and could be designed to compliment this feature.

bwoebi commented 2 weeks ago

@Bilge Could you please tone down your dissatisfaction. Mistakes happen, bad assumptions have been made, and they're now looking at rectifying and understanding other users use cases, to eventually reach a decision on how it should work. There's nothing wrong with at least exploring different options.

sebastianbergmann commented 2 weeks ago

Given the diverse needs and complexities surrounding test repetition, what stops the team from considered opening up the PHPUnit API to allow developers to implement their own repetition algorithms?

The means that wrappers around PHPUnit's test runner such as paratest or paraunit use should be sufficient to implement repeated test execution for flaky test detection/analysis outside of PHPUnit itself.

Nutickets commented 2 weeks ago

When a test is successfully identified as flaky, the next question should be how flaky? There's a big difference between 5% failure rate and 70%. This is good to know.

Good to know, definitely sometimes, but I'd argue having the ability to identify a flaky test is at least 80% of the utility.. and if that last 20% demands a disproportionate amount of maintenance effort from the PHPUnit team, or they simply don't want to support it within their OSS repo then we sure as heck don't have the right to start making demands!

sanmai commented 2 weeks ago

Wrappers around PHPUnit's test runner are not terribly efficient because each test will need to go through the possibly expensive setting up. That's what I have to use where I can't downgrade to 9, and in several cases it is way slower than with the old --repeat option unfortunately.

I'd say I loved and cherish the old --repeat option. Too bad we can't have it back.