Split results in 0 selected tests, which leads to failed GitHub action

ArnauMunsOrenga commented 3 months ago

Hi,

We have integrated pytest-split as part of our CI pipeline and noticed the following behaviour.

We were running our tests in 7 splits, but after updating the .test_durations script (new tests being added), the 7th split did not select any test. This lead to a failed CI run since all the tests were deselected (see below).

Any idea why this could happen?

We tried to create more splits after this error was raised (increased from 7 to 10) and everything worked fine.

We would like to understand why the 7th split did not select any test when --splits 7 but everything worked well when --splits 10.

Thanks in advance. Your package has become very useful in speeding up our CICD pipelines.

jerry-git commented 3 months ago

Interesting! Probably not an open source repo as you didn't include a link to a GHA run? If it's private, could you share the durations file? The test names can be anonymised. I'd be mainly interested in seeing what kind of duration values there are.

ArnauMunsOrenga commented 3 months ago

@jerry-git thanks for the quick reply. Indeed, the repo is not open source so I can't share the github action run.

These were the different splits:

And this was the durations file for that run: tests_empty_split.txt

Most of the tests are fast, while there are only a few which take more than 5 seconds.

Thank you!

jerry-git commented 2 months ago

Thanks for the data! I think I know what’s up. So, there are a couple of tests which have 10+ seconds duration while around 500/608 tests take 0.01 seconds or less. If the splits (when split to 7) would be optimal, each group would take around 17.92 seconds to run if we’d run all the tests listed in tests_empty_split.txt.

The duration based chunks algo basically takes tests into a single group until optimal time is reached (17.92 seconds in this case). Then it moves to fill the next group. The tests are looped in the same order in which pytest collects them (alphabetical order AFAIK). Same as code here: https://github.com/jerry-git/pytest-split/blob/c7a32727fadb17494f9d5db009dee2cbbc94665e/src/pytest_split/algorithms.py#L109-L118.

I believe those longer tests coincidentally happen to be at the end of their groups. For example, the very first group could have 472 shorter tests with total execution time of just a hair lower than 17.92 seconds, then a test which takes 10+ seconds comes and gets still added to the group. The fact that the estimated runtime for that first group is notably longer than 17.92 seconds is not taken into account while filling the next groups. So, at the end, this leads to the last group not having tests at all.

Here's a quickly hacked analysis of tests_empty_splits.txt:

1: 466/608 estimated duration: 18.03s
2: 8/608 estimated duration: 19.52s
3: 8/608 estimated duration: 19.11s
4: 15/608 estimated duration: 26.18s
5: 9/608 estimated duration: 28.84s
6: 102/608 estimated duration: 13.74s
7: 0/608 estimated duration: 0.00s
In total would run 608/608
Avg test time 0.20627700692763154
Optimal time per group would be 17.91663145885714

If your test suite is robust enough to run the tests in semi random order, the best would be to use --splitting-algorithm least_duration. You can check the README for details about the behaviour but that would lead into optimal splits in your case.

I’ll see how I could improve the splits with the duration based chunks algo. Taking into account the estimated total runtime for previous groups should be relatively low hanging fruit for improving the behaviour of the algorithm.

ArnauMunsOrenga commented 2 months ago

Thanks for the support and the detailed explanation Jerry.

Indeed our test suite can run in random order, as there are no dependencies between tests. We have changed the --splitting-algorithm to least_duration and everything is working fine 🙂.

Moving forward, I believe this issue can be closed, unless you want to keep it open to track any change related to it.

Once again, thank you.

bronsonrudner commented 2 months ago

Hi @jerry-git . I encountered a similar issue. Perhaps an algorithm like the following. Though mainly the _get_minimum_split would be a more accurate way of determining the max duration for test suites with chunky tests.

def split_tasks(tasks, n):
    """Split tasks into n contiguous sections, such that each is non-empty, and the maximum section is minimised"""
    max_bucket_size = _get_minimum_split(tasks, n)
    return list(_get_sections(tasks, n, max_bucket_size))

def _get_minimum_split(tasks, n):
    """If tasks split into n contiguous sections, determines the maximum segment of the split which minimises this number"""
    left, right = max(tasks), sum(tasks)
    while left < right:
        mid = (left + right) // 2
        if _can_split(tasks, n, mid):
            right = mid
        else:
            left = mid + 1
    return left

def _can_split(tasks, n, max_sum):
    current_sum = 0
    required_sections = 1
    for task in tasks:
        if current_sum + task > max_sum:
            required_sections += 1
            current_sum = task
            if required_sections > n:
                return False
        else:
            current_sum += task
    return True

def _get_sections(tasks, n, max_bucket_size):
    tasks = tasks[::-1]
    current_sum = 0
    current_bucket = []
    num_buckets = 1
    while tasks:
        task = tasks.pop()
        if current_sum + task > max_bucket_size:
            yield current_bucket
            num_buckets += 1
            current_sum = task
            current_bucket = [task]
        else:
            current_bucket.append(task)
            current_sum += task
        if num_buckets + len(tasks) == n:
            yield current_bucket
            break
    while tasks:
        yield [tasks.pop()]

# Example usage:
tasks = [7, 2, 5, 10, 8]
n = 2
buckets = split_tasks(tasks, n)
print(buckets)  # Output: [[7, 2, 5], [10, 8]]

jerry-git / pytest-split

Split results in 0 selected tests, which leads to failed GitHub action #95