jerry-git / pytest-split

Pytest plugin which splits the test suite to equally sized "sub suites" based on test execution time.
https://jerry-git.github.io/pytest-split/
MIT License
245 stars 30 forks source link

duration_based_chunks algorithm produces empty groups, causing pytest to fail when collecting tests #26

Open alexforencich opened 3 years ago

alexforencich commented 3 years ago

Something in the most recent set of changes has broken something. Previously I have been doing --splits 20 and all of the groups have at least one test. Now, at least the last two groups are empty, which causes pytest to exit with error 5 and github actions to kill all of the tests. Not sure what the precise cause of this is at the moment.

alexforencich commented 3 years ago

Yep, there is definitely something borked with duration_based_chunks. It is producing empty groups. Flip side, it seems like a pretty useless algorithm anyway that's highly dependent on the test ordering. The least_duration algorithm seems to be much more sensible, and I think it's not possible for it to produce empty groups so long as there are at least as many tests as there are groups.

jerry-git commented 3 years ago

Thanks for reporting @alexforencich, can you pinpoint which version started causing this? Or can you share the repo in which this is happening?

alexforencich commented 3 years ago

I have not attempted to figure out which version caused this, but presumably this was introduced when the duration_based_chunks algorithm was added.

Repo in question: https://github.com/alexforencich/verilog-pcie

Commit that caused the workflow run to fail: https://github.com/alexforencich/verilog-pcie/commit/ccc44d7dbbac1ae851536672795ce5431579e35c

Workflow run: https://github.com/alexforencich/verilog-pcie/runs/2861569844?check_suite_focus=true

jerry-git commented 3 years ago

Thanks @alexforencich! FYI @mbkroese

mbkroese commented 3 years ago

@alexforencich thanks for reporting. Is it possible to run your test suite with this version: 0.2.3? This is the version before the last one, and would help us pin down the problem.

alexforencich commented 3 years ago

Yes, 0.2.3 is also affected.

alexforencich commented 3 years ago

Did some more testing, 0.1.6 is OK, but 0.2.0 is broken.

mbkroese commented 3 years ago

This is a small example that reproduces the issue:

from collections import named tuple

from pytest_split import algorithms

item = namedtuple('item', 'nodeid')

durations = {}
durations['a'] = 2313.7016449670773
durations['b'] = 46.880724348986405
durations['c'] = 2196.7077018650016
durations['d'] = 379.9717799640057
durations['e'] = 1476.3481151770247
durations['f'] = 979.7326026459923
durations['g'] = 1876.5443489580794
durations['h'] = 1.3951316330058035

items = [item(x) for x in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']]

groups = algorithms.duration_based_chunks(7, items, durations)
alexforencich commented 3 years ago

The problem is that the algorithm is extremely greedy. Seems to me that this case is likely to happen if there are a bunch of short tests that are processed last. One, ah, stupid fix would be to move on to the next group if there are less than splits - group_idx items left to process. This would ensure that all of the groups have at least one test, but the algorithm is still extremely greedy so you would still end up with a very lopsided split.

jerry-git commented 3 years ago

Thanks for the investigation and sorry for pinging @mbkroese 😅 If it happened between 0.1.6 and 0.2.0 then it's most likely something in https://github.com/jerry-git/pytest-split/pull/14, FYI @sondrelg.

jerry-git commented 3 years ago

Yeah indeed sounds like we should have some mechanism which ensures that there's at least one test per group (and if splits > len(tests) -> error)

sondrelg commented 3 years ago

Whoops, that's unfortunate. I could probably add a test and make sure it works again sometime this week if needed 🙂

Just for the sake of my own curiosity @alexforencich, how small is your test sample size? I think I might have changed the behavior from stopping before a group exceeds the time-limit to stopping when it exceeds. My thinking was that front-loading tests a little bit makes sense 99% of the time, because at least in Github actions, containers seem to spin up sequentially; i.e., group 1 starts considerably earlier than group 20 (at least for me).

Then again, I guess if there are a very small number of tests this could create some weird behavior. Is that what's happening here?

alexforencich commented 3 years ago

The other algorithm (least_duration) is fine as it adds tests to the group with the smallest duration. So this means that if you have N splits, the first N tests get distributed across those splits. This is also a very simple greedy algorithm, but at least it's guaranteed to not produce any empty groups so long as you have at least as many tests as you have splits. I suppose what you could do if you have less tests than you have splits is just duplicate one of the tests in all of the remaining splits so pytest doesn't crash.

I think there are two parts to this: 1) fix duration_based_chunks, and 2) add some safeguard somewhere else to ensure that it's not possible to ever produce any empty groups so that pytest doesn't crash no matter what the splitting algorithm does (unless, of course, there are no tests at all).

alexforencich commented 3 years ago

I want to say it's something like 170 tests split into 20 groups. But the spread of run-times is quite large, some of them run in a few seconds, some take 10 minutes. And that spread is probably what's screwing up the greedy algorithm that places tests into each group until it hits the average, then moves on to the next group. I think if you make a test case of, say, 10 tests of 1 second each, then 1 test of 100 seconds each, and ask for 10 groups, you'll get 1 group with all of the tests and 9 empty groups.

mbkroese commented 3 years ago

sorry for pinging @mbkroese

happy to help :)

sondrelg commented 3 years ago

Opened a quick PR. Please take a look when you have time 🙂

jgopel commented 1 year ago

Any update on this? I seem to be encountering this issue in 0.8.0.

sondrelg commented 1 year ago

This is the last thing to happen AFAIK

jgopel commented 1 year ago

Seems like the PR is abandoned - is there interest in reviving it?

alexforencich commented 1 year ago

Is using least_duration not an option for you?

jgopel commented 1 year ago

The issue I have is that the tests were designed such that there are optimizations around data loading, etc that are setup in the first test of the suite to run and then consumed by all tests that follow. Using least_duration ends up with inconsistency in .test_durations because whether a test is the first test of a suite to run in a particular group is not deterministic. I also get better performance from grouping similar tests together for the same reason.

alexforencich commented 1 year ago

It almost sounds like you want to do something similar to what's detailed in the readme for nbval - shuffling things around to keep groups together and preserve the ordering. So perhaps even more restrictive than what the default algorithm is doing, since the default one I think will split things up.

jgopel commented 1 year ago

I had missed that part of the readme. That looks like pretty much exactly what I'd like to do in the ideal case. Is there a mode for that? Or is it automatically detected?

alexforencich commented 1 year ago

I suspect that it is linked to using nbval specifically. But it seems like it might be a useful mode to have in general.

Perhaps there is a general way to do this: instead of dealing with individual tests, handle groups of tests, with tests grouped according to some parameters, perhaps based on the file. The current behavior would just be putting each test in its own group.

jgopel commented 1 year ago

Are you imagining that the user could provide a predicate for custom splitting? How would that be specified and passed in?

alexforencich commented 1 year ago

not sure, just tossing out some ideas. But a simple one could be an option to keep all of the tests defined in each file together.

jgopel commented 1 year ago

That makes sense to me as a starting point. Would there be an appetite to merge that if I were to do the work?

Prodian0013 commented 4 months ago

I ran into this same issue with duration_based_chunks trying to run empty groups 0 selected I had to switch to using least_duration