Workflow for running slow tests

jhabriel commented 1 year ago

Background

Currently, tests are run on every push and at merging time through the Pytest workflow (https://github.com/pmgbergen/porepy/actions/workflows/run-pytest.yml). Naturally, slow tests (e.g, integration tests for three-dimensional models and convergence tests) are not part of this workflow because they take too much time. To be precise, they are skipped since they are explicitly marked with a decorator of the type @pytest.mark.skip or @pytest.mark.skipif.

The need for a workflow that runs all tests

At the moment, there is no workflow that checks if the skipped tests are actually passing. This means that the skipped tests are currently not in used, which is obviously suboptimal from a testing framework standpoint.

Proposed changes

Create a new workflow named Pytest-slow, that run all tests on a weekly basis (specifically, on Sundays at 00:00 GMT+2) and also manually (see https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow and https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflow_dispatch). The latter option is important, since new slow tests must pass before merging a PR. It will therefore be the responsibility of the assignee and the reviewers to verify that Pytest-slow completes without errors.

Marking tests

We need a system to effectively mark tests as slow (in the future, there might be a need for other types of classifications). There is no unique way to achieve this (in fact, there is at least three options, see: https://jwodder.github.io/kbits/posts/pytest-mark-off/). However, it seems that the best option is to create a custom mark (e.g., slow) and use the pytest_collection_modifyitems hook to add an additional pytest.mark.skip. This is done in a conftest.py file.

How would this work in practice?

Mark tests as slow with the decorator @pytest.mark.slow. When one runs pytest, all tests with the assigned decorator will be skipped. If one wants to run all tests, then is simply a matter of adding the --run-slow flag, i.e., pytest --run-slow.

Implementational steps

[x] Create a conftest.py file following the recipe from https://jwodder.github.io/kbits/posts/pytest-mark-off/
[x] Mark all current slow tests will the decorator @pytest.mark.skipped
[x] Update the PR template (we need to know if the PR contains new slow tests and if they've passed locally).
[x] Create the new workflow with the configurations mentioned above.

keileg commented 1 year ago

Thanks for the nice writeup, and for trying out task lists inside an issue.

Some thoughts:

The idea is to run these tests locally, or on GH? I am fine with the former, but the latter, is there a way to trigger the action for that specific test?
For the weekly test, the idea is again to run these locally, or on GH? I could not find information about maximum time allowable for GH actions, but I would not be surprised if there is one.
Although there will always be borderline cases, can we formulate a criterion for what constitutes a slow test?
Should we, in the checklist for PRs, include a new box 'I have verified that all new tests marked as slow actually runs', or something to that effort?

It may also be wise to ask Ivar for opinions after the two of us are done discussing, but before implementing something.

jhabriel commented 1 year ago

Thanks for the input. Super valid questions, I will try to answer them below:

I can see your confusion, I might not have been clear on that aspect. The proposed workflow is meant to be run on GH (although one can also take advantage of the functionality locally). Before merging a PR containing a slow test, the action will have to be triggered manually to make sure that all tests (including the slow ones) pass. There might be a way to tell actions to do this automatically, but I could not find it. I assume it will be non-trivial, since it will have to "look" into the code for new slow tests.
For a weekly test, the idea is to run them also on GH. Here are the limitations for workflows on GH (https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration). The maximum amount of time for a job is 6 hours (a reasonable amount of time IMO). And, for a complete workflow, is 35 days.
We sure can. This is simply a matter of setting an upper bound on the time. I would say that a test that takes more than 10 seconds should be marked as slow.
Yes. This will have two important consequences: (1) To let the reviewer know that the PR contains new slow tests, and that consequently the workflow Pytest-slow should be run, and (2) To remind the PR opener to check that the tests pass locally (so we don't waste computational resources in actions).

Of course. If you are satisfied with the answers, feel free to tag him on this issue to have his input.

jhabriel commented 1 year ago

@keileg, @OmarDuran and @IvarStefansson:

I would like to have your opinion on what constitutes a slow test, i.e., what would be the criterion to mark a test as slow based on the amount of time it takes to run. Please, leave your thumbs up in one the options from below.

Also, if you have a concrete suggestion for the bullet point number 4 of https://github.com/pmgbergen/porepy/issues/856#issuecomment-1505369611, please let me know.

Thanks in advance.

IvarStefansson commented 1 year ago

I have two comments: 1) I think run time forms only part of the basis for choosing which tests to skip. Their importance to the test suite is also relevant. The importance criterion includes which part of the code the test covers, the suite's coverage of that part of the code and possibly more. 2) I suggest a nuance toEirik's check box: It should not apply "trivial" PRs (only style, documentation, very basic maintanance etc.).

jhabriel commented 1 year ago

I agree. Importance should definitely be considered too. Meaning, if there is a very important test that qualifies as slow but is crucial for the test suite, I assume we will keep it. However, I see those as outliers rather than the norm.

Now, there might also be non-crucial fast tests that we might want to skip, as you say. My suggestion will therefore be to create a different marker for these types of tests, e.g., non_crucial.

That being said, we still need a criterion to classify a test as slow. And if the non_crucial marker is introduced, also a criterion to classify them as such.

IvarStefansson commented 1 year ago

I would hesitate to introduce too many labels for now. If we can get away with a single flag to indicate whether a test is run always or only on "full runs", that would probably be easier. Introducing more categories later is of course possible if the need arises.

jhabriel commented 1 year ago

I agree that introducing too many categories is a no-go. However, there are a few reasons why you would want to categorize skipped tests:

I believe it is important to know why a test is being skipped, not only that is being skipped.
Categorizing skipped tests unlocks the possibility to only run those belonging to that category, without having to run the full test suite.
For people outside of the core developer group, it will be more transparent to know when one has/can skip a test.

To your point, I think there should be strict rules for categorizing tests.

jhabriel commented 1 year ago

Solved via #863.

pmgbergen / porepy