Fix PETSc tests in CI - Githubissues

matthewcarbone commented 2 years ago

@Chiffafox looks like I've got almost everything figure out on ubuntu-latest. However, the PETSc tests seem to be finishing with 100% success, but at the end they kinda randomly fail.

Would you mind verifying that these work locally for you? Here's the action in question: https://github.com/x94carbone/GGCE/actions/runs/3246363273/jobs/5325016321.

matthewcarbone commented 2 years ago

Yeah I can't even get this to run locally, but that's probably my M1 not your tests!

(py3.9) > $ mpiexec -n 4 pytest -v -s --cov --cov-report xml --cov-append --with-mpi ggce/_tests/petsc/*.py                                               [±master ●]
======================================================================== test session starts =========================================================================
platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- /Users/mc/miniforge3/envs/py3.9/bin/python3.9
======================================================================== test session starts =========================================================================
platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- /Users/mc/miniforge3/envs/py3.9/bin/python3.9
cachedir: .pytest_cache
rootdir: /Users/mc/GitHub/GGCE
plugins: anyio-3.6.1, mpi-0.6, cov-3.0.0
collecting ... cachedir: .pytest_cache
rootdir: /Users/mc/GitHub/GGCE
plugins: anyio-3.6.1, mpi-0.6, cov-3.0.0
collecting ... ======================================================================== test session starts =========================================================================
platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- /Users/mc/miniforge3/envs/py3.9/bin/python3.9
======================================================================== test session starts =========================================================================
platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- /Users/mc/miniforge3/envs/py3.9/bin/python3.9
cachedir: .pytest_cache
rootdir: /Users/mc/GitHub/GGCE
plugins: anyio-3.6.1, mpi-0.6, cov-3.0.0
collecting ... cachedir: .pytest_cache
rootdir: /Users/mc/GitHub/GGCE
plugins: anyio-3.6.1, mpi-0.6, cov-3.0.0
collected 25 items
collected 25 items
collected 25 items

collected 25 items

ggce/_tests/petsc/test_hp_fint_petsc.py::test_zero_vs_tiny_T[p0]
ggce/_tests/petsc/test_hp_fint_petsc.py::test_zero_vs_tiny_T[p0]
ggce/_tests/petsc/test_hp_fint_petsc.py::test_zero_vs_tiny_T[p0] 2022-10-13 21:15:18 Predicted 4 generalized equations (agrees with analytic formula)
2022-10-13 21:15:18.562 ggce.engine.system:checkpoint:311 |WARNING   | root not provided to System - System checkpointing disabled
2022-10-13 21:15:18 Generated 7 total equations
2022-10-13 21:15:18.565 ggce.engine.system:checkpoint:311 |WARNING   | root not provided to System - System checkpointing disabled
2022-10-13 21:15:18 Closure checked and valid
2022-10-13 21:15:18.565 ggce.executors.solvers:__init__:65 |WARNING   | root not provided to Solver - Solver checkpointing disabled
2022-10-13 21:15:18.565 ggce.executors.petsc4py.base:__init__:86 |WARNING   | Only one brigade, no splitting required. Using original MPI_COMM.
2022-10-13 21:15:18 Matrices solved by the engine are being computed on the fly from the basis.
2022-10-13 21:15:18.565 ggce.executors.petsc4py.base:get_jobs_on_this_brigade:137 |WARNING   | Chunking jobs with COMM_WORLD_SIZE=1
2022-10-13 21:15:18 Predicted 28 generalized equations
2022-10-13 21:15:18.634 ggce.engine.system:checkpoint:311 |WARNING   | root not provided to System - System checkpointing disabled
2022-10-13 21:15:18 Generated 61 total equations
2022-10-13 21:15:18.677 ggce.engine.system:checkpoint:311 |WARNING   | root not provided to System - System checkpointing disabled
2022-10-13 21:15:18 Closure checked and valid
2022-10-13 21:15:18.680 ggce.executors.solvers:__init__:65 |WARNING   | root not provided to Solver - Solver checkpointing disabled
2022-10-13 21:15:18.680 ggce.executors.petsc4py.base:__init__:86 |WARNING   | Only one brigade, no splitting required. Using original MPI_COMM.
2022-10-13 21:15:18 Matrices solved by the engine are being computed on the fly from the basis.
2022-10-13 21:15:18.681 ggce.executors.petsc4py.base:get_jobs_on_this_brigade:137 |WARNING   | Chunking jobs with COMM_WORLD_SIZE=1
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
/Users/mc/miniforge3/envs/py3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/Users/mc/miniforge3/envs/py3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/Users/mc/miniforge3/envs/py3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/Users/mc/miniforge3/envs/py3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

matthewcarbone commented 2 years ago

Should we be worried about this?

2022-10-13 21:15:18.565 ggce.executors.petsc4py.base:get_jobs_on_this_brigade:137 |WARNING   | Chunking jobs with COMM_WORLD_SIZE=1

Specifically the COMM_WORLD_SIZE=1 part?

Chiffafox commented 2 years ago

Hey Matt. On the first point: this test and all the other PETSc ones run fine for me locally. Every time I add a new test I run through all the previous ones: I also created a nopetsc conda environment so I can specifically test the skipping of the tests (so I don't make the same import mistakes as earlier). I suspect what's failing here is that PETSc is not being installed with the full configuration with the Github action.

Just checked out the action: we need to add some environment variables to make sure that when pip does its thing, it installs PETSc properly configured, i.e. with support for complex numbers and various auxiliary libraries like scalapack and mumps. Specifically, we need to add this line to the installation action: export PETSC_CONFIGURE_OPTIONS="--with-scalar-type=complex --download-mumps --download-scalapack" Once we have it, I think it should work. Am I allowed to edit the actions?

EDIT: wait, I got mixed up between your local trace stack and the Github action. I am surprised that all the tests work, considering that the PETSc being installed does not have complex number support.

On the Github stacktrace from the action you linked, it seems that the pytest coverage code is failing, no?

INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/opt/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/coverage/sqldata.py", line 1106, in execute
INTERNALERROR>     return self.con.execute(sql, parameters)
INTERNALERROR> sqlite3.OperationalError: no such table: file
INTERNALERROR> 
INTERNALERROR> During handling of the above exception, another exception occurred:
INTERNALERROR> 
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/opt/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/coverage/sqldata.py", line 1111, in execute
INTERNALERROR>     return self.con.execute(sql, parameters)
INTERNALERROR> sqlite3.OperationalError: no such table: file`

Chiffafox commented 2 years ago

On the second question: yes, this is okay. Some of the tests will run PETSc with brigade_size = world_size, i.e. will all available ranks dedicated to the parallel solution of a single (k,w) point. In that case, there is no need to chunk_jobs, so it gives the warning. Looking to the two lines above the one you mentioned, there is the line that says "running with one brigade only". So Chunking with Comm_World_Size = 1 is expected behaviour.

2022-10-13 21:15:18.565 ggce.executors.petsc4py.base:__init__:86 |WARNING   | Only one brigade, no splitting required. Using original MPI_COMM.
2022-10-13 21:15:18 Matrices solved by the engine are being computed on the fly from the basis.
2022-10-13 21:15:18.565 ggce.executors.petsc4py.base:get_jobs_on_this_brigade:137 |WARNING   | Chunking jobs with COMM_WORLD_SIZE=1

Chiffafox commented 2 years ago

I just checked and if I run the tests locally with the command you gave above mpiexec -n 4 pytest -v -s --cov --cov-report xml --cov-append --with-mpi ggce/_tests/petsc/*.py it fails at the end with the same stacktrace as for you. But if I run without the --cov parts, i.e. mpiexec -n 4 pytest -v --with-mpi ggce/_tests/petsc/*.py then it goes through fine. I'm not that familiar with the --cov things but I will try to debug a bit locally.

Chiffafox commented 2 years ago

I wonder if cov does not play nicely with mpi testing. If I run without mpiexec then it skips the tests just fine and the full command does not error out, it is able to write coverage.xml and everything.

Chiffafox commented 2 years ago

Yup, looks like this is an active issue: pytest-dev/pytest-cov#237.

matthewcarbone commented 2 years ago

@Chiffafox Ok lots to unpack here:

First, I do include export PETSC_CONFIGURE_OPTIONS="--with-scalar-type=complex --download-mumps --download-scalapack" in the CI file:

    - name: Setup PETSc
      run: pip install petsc petsc4py
      env:
        PETSC_CONFIGURE_OPTIONS: "--with-scalar-type=complex --download-mumps --download-scalapack"

The PETSC_CONFIGURE_OPTIONS should be exposed as an environment variable during that step. That's why the tests are passing. This seems like another issue.

On the Github stacktrace from the action you linked, it seems that the pytest coverage code is failing, no?

I'm honestly not sure! 😁

I just checked and if I run the tests locally with the command you gave above mpiexec -n 4 pytest -v -s --cov --cov-report xml --cov-append --with-mpi ggce/_tests/petsc/.py it fails at the end with the same stacktrace as for you. But if I run without the --cov parts, i.e. mpiexec -n 4 pytest -v --with-mpi ggce/_tests/petsc/.py then it goes through fine. I'm not that familiar with the --cov things but I will try to debug a bit locally.

Yup, looks like this is an active issue: https://github.com/pytest-dev/pytest-cov/issues/237.

That's really interesting... guess as you say though this is an open issue. For now I guess we just cannot upload those tests.

On the second question: yes, this is okay. Some of the tests will run PETSc with brigade_size = world_size, i.e. will all available ranks dedicated to the parallel solution of a single (k,w) point. In that case, there is no need to chunk_jobs, so it gives the warning. Looking to the two lines above the one you mentioned, there is the line that says "running with one brigade only". So Chunking with Comm_World_Size = 1 is expected behaviour.

Sounds good.

Anyway, if you have the time feel free to modify the CI file, push to master to trigger the runs. I might run out of compute time soon though while the repo is still private, so just try not to overdo it!

matthewcarbone commented 2 years ago

Crazy question: can we run PETSc with WORLD_SIZE=1? Perhaps that will allow the PETSc test to run.

Chiffafox commented 2 years ago

Ah okay I missed the config export, sorry! 😃

I am following a fix with suggested by someone in the issue now by using a setup.cfg for the coverage -- let's see if it work,s and if it does, we can think about how to integrate it here. What I gathered from reading the issue and linked things is that basically, the problem is that pytest is not MPI-aware, so at some point not all cores are finished writing the reports (or the reports are slightly different) and pytest fails because of an sqlite read error, trying to combine the reports.

Crazy question: can we run PETSc with WORLD_SIZE=1? Perhaps that will allow the PETSc test to run.

We certainly can, but the point of most of the tests is to actually test PETSc running in the parallel (and double-parallel, system-saving, etc.) regime. At WORLD_SIZE=1 many of the tests would become redundant and won't test the functionality.

Again, the PETSc test runs okay, like you said above -- it's only the pytest coverage that fails in MPI mode at the very end step of combining reports.

Chiffafox commented 2 years ago

Okay, looks like the fix offered in that issue works for me locally. The fix is to create a setup.cfg file with

[coverage:run]
parallel = true

and then add --cov-config=setup.cfg to the list of variables when running pytest. Namely the new command would be

mpiexec -np 4 pytest -v -s --cov --cov-report xml --cov-append --cov-config=setup.cfg --with-mpi ggce/_tests/petsc/*.py

I will run this locally for all the tests to make sure it works. Then I was thinking of adding the petsc_test_setup.cfg file to the _tests/petsc folder and amending the pytest command on the Actions. How does that sound?

Chiffafox commented 2 years ago

Okay, I think I found a way to do it. It seems to work for me locally. The serial part is unchanged, but to run the MPI and PETSc ones I will use the command

mpiexec -n 2 coverage run --rcfile=ggce/_tests/mpi/setup_pytest_mpi.cfg -m pytest -v --with-mpi ggce/_tests/mpi/*.py

and

mpiexec -n 4 coverage run --rcfile=ggce/_tests/petsc/setup_pytest_petsc_mpi.cfg -m pytest -v --with-mpi ggce/_tests/petsc/*.py

respectively. This is running exactly the same thing as pytest --cov, except that it allows the different MPI ranks to explicitly write to different .coverage.rank###.timestamp files, which is threadsafe and raises no sqlite problems. Afterwards, I will run an action of combining the files and making the final xml report

coverage combine && coverage xml

which I think should be easily uploadable by codecov. How does this sound? I've got a branch with this and warning ignore for numpy and renaming of spectrum that I will open a PR to momentarily.

matthewcarbone commented 2 years ago

This is pretty sick, yeah I'll check out the PR. Thank you!

matthewcarbone commented 2 years ago

@Chiffafox I think we're done with this no?

Chiffafox commented 2 years ago

Yessir! Closing the issue now with the acceptance of PR #55.

matthewcarbone / GGCE

Fix PETSc tests in CI #54