Open sebastianbergmann opened 8 months ago
Yes please return
Returning does nobody any good when I'm still void
Hell to the yeah
So then? 😀
Want to see this backported for v10 also.
We often use factories during testing, which in turn user faker and friends. I reached for this when running tests that deal with models that may or may not appear in a collection based on date, I understand we can restructure our tests (and this is what I was in the process of doing) - but it would be useful to have this back.
make phpunit great again!
@theseer, @localheinz, @staabm, @Schrank, @sebastianheuer, @tesla91, and I discussed this issue during the PHPUnit Code Sprint in Munich today.
Way back when I originally implemented the old --repeat
CLI option, my use case had nothing to do with uncovering flaky tests. It was purely about benchmarking. In hindsight, I consider it a mistake to have implemented this feature as benchmarking is out of scope for PHPUnit.
While working towards PHPUnit 10, there was a point in time when the --repeat
CLI option was blocking the release. As the feature was not important to me, and I was not aware that others were relying on it for other use cases than the one I originally had in mind, I decided to remove it from PHPUnit 10.
Looking at the comments on this ticket as well as in similar tickets such as #5174 and #5717, the only use case we found was the uncovering of flaky tests. However, how exactly this use case should be handled remains unclear.
The only reasonable behaviour that we were able to come up with is this: when --repeat 10
is used, each test is run up to 10 times. If all 10 executions of the test are successful, then the test as a whole is considered a success. The first failure leads to 1) no further attempt of executing the test and 2) considering the test as a whole a failure. By the way, this is how Pest's ->repeat()
feature works.
The approach outlined above would be the only one we are willing to try to implement. However, we are not sure whether (all of) you who commented on / reacted to this ticket would expect this behaviour.
If this behaviour is viable for your use case: please give this comment a thumbs-up.
So the only difference from the previous behavior is that the test will be aborted after the first failure, right?
To make sure I understand correctly, if I use --repeat 10
, would either:
--repeat
option)?Clarifying this would help ensure everyone is on the same page about the expected behavior.
run each individual test up to 10 times, stopping at the first failure for that specific test
This.
It did not stop at the first failure before, so it should not do so now, unless that option (stop on first failure) is specified.
You are correct that identifying flaky tests is a good and necessary use-case for this feature. Moreover, supporting that use-case can mean benching what proportion of the time it is failing. To do that, I may want to run the test 100 times and count the failures. If it stops on first failure, that is not longer possible.
@sebastianbergmann I actually have a different usecase for which I was using the --repeat
option: testing of build-in, function level caching.
A certain library contains a number of functions often called with the same parameters from different entry-points within the same process. These functions will always return the same result with the same inputs, but getting to that result can have a performance impact. So these functions determine the result the first time they are run with each combination of parameters and then cache the result. The next time the function is called with the same parameters, the cached result is served.
The tests are run with --repeat 2
to safeguard that the caching doesn't cause any side-effects.
Note: the proposed solution for re-implementing the feature would work for this use-case, so :+1: from me.
The tests are run with
--repeat 2
to safeguard that the caching doesn't cause any side-effects.
That doesn't seem like a valid use case for repeat. The SUT should just be called twice in that case. That is, it should not be necessary to invoke the test runner in a specific way in order to run the tests in the expected manner. To put it another way, --repeat
should be used ad-hoc, and not baked in as part of a regular test run.
Thanks for confirming. That clarifies the intended scope.
Now, another question arises: how will this handle tests with dependencies? If TestB
depends on TestA
, and TestB
is repeated due to flakiness, wouldn't we need to re-execute TestA
each time as well to ensure a clean state?
It seems like there are two main options to address this:
I believe addressing these dependency concerns is necessary to make the --repeat
option useful for the widest range of scenarios.
Now, another question arises: how will this handle tests with dependencies? If
TestB
depends onTestA
, andTestB
is repeated due to flakiness, wouldn't we need to re-executeTestA
each time as well to ensure a clean state?
A test will not be repeated "due to flakiness".
When --repeat
is used then all tests will be repeated as explained in https://github.com/sebastianbergmann/phpunit/issues/5718#issuecomment-2421955447. So when B depends on A and --repeat 10
is used then B will only be run when A was successfully run 10 times.
You are correct that identifying flaky tests is a good and necessary use-case for this feature.
We identified it as a use case, we did not say that is a use case we think must be supported.
It did not stop at the first failure before, so it should not do so now
That is your opinion, it is not ours. If we decide to bring --repeat
back then it will work as described in https://github.com/sebastianbergmann/phpunit/issues/5718#issuecomment-2421955447.
I see a potential problem with this sequential approach, especially with stateful tests.
Consider this scenario:
TestA
modifies the SUT, leaving it in a specific state.TestB
depends on that state and further modifies the SUT during its execution.With the proposed --repeat 10
, TestA
runs 10 times, then TestB
runs 10 times after that. However, TestB
will likely fail after the first run because it's operating on an SUT that has already been modified by the previous TestB
execution. This breaks the dependency chain and makes the repetition meaningless.
This effectively means that the --repeat
option cannot be used with tests that alter the SUT's state, severely limiting its applicability. The previous implementation did not have these limitations.
Given the diverse needs and complexities surrounding test repetition, what stops the team from considered opening up the PHPUnit API to allow developers to implement their own repetition algorithms?
You are correct that identifying flaky tests is a good and necessary use-case for this feature. Moreover, supporting that use-case can mean benching what proportion of the time it is failing. To do that, I may want to run the test 100 times and count the failures. If it stops on first failure, that is not longer possible.
Correct. It won't be.
But, imho, that's not a problem at all: To test for flakiness, a single fail among a single success is already proving the point. There is no informational benefit in continuing the test repetition for this aspect.
As it's already know to be flaky, the percentage / distribution of fail and success is likely going to by flaky as well. There is nothing to gain by having a different distribution of numbers on every run, e.g. a changing number of fails.
As it's already know to be flaky, the percentage / distribution of fail and success is likely going to by flaky as well. There is nothing to gain by having a different distribution of numbers on every run, e.g. a changing number of fails.
It has already proved to be an interesting statistic hitherto. If it should no longer be supported then the conclusion is to use PHPUnit 9 forever.
When a test is successfully identified as flaky, the next question should be how flaky? There's a big difference between 5% failure rate and 70%. This is good to know.
You are correct that identifying flaky tests is a good and necessary use-case for this feature.
We identified it as a use case, we did not say that is a use case we think must be supported.
It did not stop at the first failure before, so it should not do so now
That is your opinion, it is not ours. If we decide to bring
--repeat
back then it will work as described in #5718 (comment).
Notwithstanding it is already quite obnoxious to remove a working feature, to remove a feature that has subsequently been identified as useful even beyond its originally intended purpose, just to spite people for using it for the "wrong" reasons sounds like whomsoever "we" are constitutes a body of people dedicated to mismanagement of this project.
To reintroduce it with different semantics is almost as bad as not reintroducing it at all, especially when an option to stop on error already exists and could be designed to compliment this feature.
@Bilge Could you please tone down your dissatisfaction. Mistakes happen, bad assumptions have been made, and they're now looking at rectifying and understanding other users use cases, to eventually reach a decision on how it should work. There's nothing wrong with at least exploring different options.
Given the diverse needs and complexities surrounding test repetition, what stops the team from considered opening up the PHPUnit API to allow developers to implement their own repetition algorithms?
The means that wrappers around PHPUnit's test runner such as paratest or paraunit use should be sufficient to implement repeated test execution for flaky test detection/analysis outside of PHPUnit itself.
When a test is successfully identified as flaky, the next question should be how flaky? There's a big difference between 5% failure rate and 70%. This is good to know.
Good to know, definitely sometimes, but I'd argue having the ability to identify a flaky test is at least 80% of the utility.. and if that last 20% demands a disproportionate amount of maintenance effort from the PHPUnit team, or they simply don't want to support it within their OSS repo then we sure as heck don't have the right to start making demands!
Wrappers around PHPUnit's test runner are not terribly efficient because each test will need to go through the possibly expensive setting up. That's what I have to use where I can't downgrade to 9, and in several cases it is way slower than with the old --repeat
option unfortunately.
I'd say I loved and cherish the old --repeat
option. Too bad we can't have it back.
IIRC, @epdenouden wants to look into this during next week's code sprint.