pytest-dev / pytest-cov

Coverage plugin for pytest.
MIT License
1.75k stars 212 forks source link

Parallel (mpi4py) tests with coverage often lead to INTERNALERROR #237

Open EmilyBourne opened 5 years ago

EmilyBourne commented 5 years ago

When running tests in parallel for a program which uses mpi4py, often an INTERNALERROR is signalled. The error arises due to a corruption of the .coverage file. This seems to be due to a race condition (hence why the error does not always arise). More specifically I think it is due to concurrent write commands.

The command which gave this error was: mpirun -n 2 py.test pygyro/diagnostics/ -sxm parallel --cov=pygyro

The operating system is linux, ubuntu bionic beaver 18.04.1. I am using python 3.6.6.

The error generated is as follows:

INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/data.py", line 293, in read_file
INTERNALERROR>     self.read_fileobj(f)
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/data.py", line 271, in read_fileobj
INTERNALERROR>     data = self._read_raw_data(file_obj)

INTERNALERROR> Traceback (most recent call last):INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/data.py", line 314, in _read_raw_data

INTERNALERROR>     return json.load(file_obj)
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/data.py", line 293, in read_file
INTERNALERROR>   File "/usr/lib/python3.6/json/__init__.py", line 299, in loadINTERNALERROR>     self.read_fileobj(f)

INTERNALERROR>     parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/data.py", line 271, in read_fileobj
INTERNALERROR>     data = self._read_raw_data(file_obj)
INTERNALERROR>   File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/data.py", line 314, in _read_raw_dataINTERNALERROR>     return _default_decoder.decode(s)

INTERNALERROR>     return json.load(file_obj)
INTERNALERROR>   File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
INTERNALERROR>   File "/usr/lib/python3.6/json/__init__.py", line 299, in loadINTERNALERROR>     raise JSONDecodeError("Extra data", s, end)

INTERNALERROR>     parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
INTERNALERROR> json.decoder.JSONDecodeError: Extra data: line 1 column 9466 (char 9465)
INTERNALERROR> 
INTERNALERROR>   File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
INTERNALERROR> During handling of the above exception, another exception occurred:INTERNALERROR>     return _default_decoder.decode(s)

INTERNALERROR> 
INTERNALERROR>   File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>     raise JSONDecodeError("Extra data", s, end)
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/_pytest/main.py", line 184, in wrap_session
INTERNALERROR> json.decoder.JSONDecodeError: Extra data: line 1 column 9466 (char 9465)
INTERNALERROR>     session.exitstatus = doit(config, session) or 0INTERNALERROR> 

INTERNALERROR> During handling of the above exception, another exception occurred:
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/_pytest/main.py", line 224, in _mainINTERNALERROR> 

INTERNALERROR> Traceback (most recent call last):INTERNALERROR>     config.hook.pytest_runtestloop(session=session)

INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/_pytest/main.py", line 184, in wrap_session
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 284, in __call__
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>     return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/_pytest/main.py", line 224, in _main
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pluggy/manager.py", line 67, in _hookexec
INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
INTERNALERROR>     return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 284, in __call__
INTERNALERROR>     return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pluggy/manager.py", line 61, in <lambda>
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pluggy/manager.py", line 67, in _hookexec
INTERNALERROR>     firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
INTERNALERROR>     return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pluggy/callers.py", line 203, in _multicall
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pluggy/manager.py", line 61, in <lambda>
INTERNALERROR>     gen.send(outcome)
INTERNALERROR>     firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pytest_cov/plugin.py", line 228, in pytest_runtestloop
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pluggy/callers.py", line 203, in _multicall
INTERNALERROR>     gen.send(outcome)
INTERNALERROR>     self.cov_controller.finish()
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pytest_cov/plugin.py", line 228, in pytest_runtestloop
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pytest_cov/engine.py", line 167, in finish
INTERNALERROR>     self.cov_controller.finish()
INTERNALERROR>     self.cov.stop()
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/pytest_cov/engine.py", line 167, in finish
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/control.py", line 677, in loadINTERNALERROR>     self.cov.stop()

INTERNALERROR>     self.data_files.read(self.data)
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/control.py", line 677, in load
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/data.py", line 653, in readINTERNALERROR>     self.data_files.read(self.data)

INTERNALERROR>     data.read_file(self.filename)
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/data.py", line 653, in read
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/data.py", line 297, in read_fileINTERNALERROR>     data.read_file(self.filename)

INTERNALERROR>     filename, exc.__class__.__name__, exc,
INTERNALERROR>   File "/home/emily/.local/lib/python3.6/site-packages/coverage/data.py", line 297, in read_file
INTERNALERROR>     filename, exc.__class__.__name__, exc,
INTERNALERROR> coverage.misc.CoverageException: Couldn't read data from '/home/emily/Documents/Cours_TUM/projet/Code/PyGyro/.coverage': JSONDecodeError: Extra data: line 1 column 9466 (char 9465)
INTERNALERROR> coverage.misc.CoverageException: Couldn't read data from '/home/emily/Documents/Cours_TUM/projet/Code/PyGyro/.coverage': JSONDecodeError: Extra data: line 1 column 9466 (char 9465)

========================= 1731 passed in 17.66 seconds =========================
========================= 1731 passed in 17.66 seconds =========================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[29764,1],0]
  Exit code:    3
--------------------------------------------------------------------------
blueyed commented 5 years ago

INTERNALERROR> coverage.misc.CoverageException: Couldn't read data from '/home/emily/Documents/Cours_TUM/projet/Code/PyGyro/.coverage': JSONDecodeError: Extra data: line 1 column 9466 (char 9465)

This looks like you're using coverage.py 5, and there you're mixing the (new, default) SQLite database with the legacy JSON format.

Try using --cov-config to pass your coveragerc/config explicitly.

You can also try export COVERAGE_DEBUG=process,config to get info about coverage processes being started and which config they are using.

EmilyBourne commented 5 years ago

My version of coverage is 4.5.2.

I don't get this problem when running anything in serial so I don't think it should be a versionning issue.

The reason I think that it's a concurrent write problem is that I have occasionally seen things such as: :[]}},[]}} at the end of the file.

Using --cov-config doesn't make any difference. Is there anything I need to add to the coveragerc file?

Here is the output from export COVERAGE_DEBUG=process,config I can't see any problems

-- config ----------------------------------------------------
               _include: None
                  _omit: None
 attempted_config_files: .coveragerc
                 branch: False
            concurrency: None
           config_files: .coveragerc
            cover_pylib: False
              data_file: /home/emily/Documents/Cours_TUM/projet/Code/PyGyro/.coverage
                  debug: process
                         config
       disable_warnings: -none-
           exclude_list: #\s*(pragma|PRAGMA)[:\s]?\s*(no|NO)\s*(cover|COVER)
              extra_css: None
             fail_under: 0.0
               html_dir: htmlcov
             html_title: Coverage report
          ignore_errors: False
                   note: None
               parallel: False
    partial_always_list: while (True|1|False|0):
                         if (True|1|False|0):
           partial_list: #\s*(pragma|PRAGMA)[:\s]?\s*(no|NO)\s*(branch|BRANCH)
                  paths: {}
         plugin_options: {}
                plugins: -none-
              precision: 0
         report_include: None
            report_omit: *test*
                         pygyro/advection/*mod*
                         pygyro/initialisation/*mod*
                         pygyro/splines/*mod*
                         pygyro/advection/accelerated_advection_steps.py
                         pygyro/initialisation/initialiser_func.py
                         pygyro/splines/spline_eval_funcs.py
                         pygyro/tools/*
                         pygyro/utilities/*
            run_include: None
               run_omit: None
           show_missing: False
           skip_covered: False
                 source: pygyro
                  timid: False
             xml_output: current_coverage.xml
      xml_package_depth: 99
-- end -------------------------------------------------------
blueyed commented 5 years ago

With export COVERAGE_DEBUG=process,config you should see multiple outputs of the config, and I assume that some of them do not get your "config_files" right. What is your .coveragerc?

EmilyBourne commented 5 years ago

I have the multiple outputs but they all say the same thing: config_files: .coveragerc

My .coveragerc is not very complicated it only contains information about which files to omit and the name of the xml file if it is to be generated:

[report]
omit = *test*
       pygyro/advection/*mod*
       pygyro/initialisation/*mod*
       pygyro/splines/*mod*
       pygyro/advection/accelerated_advection_steps.py
       pygyro/initialisation/initialiser_func.py
       pygyro/splines/spline_eval_funcs.py
       pygyro/tools/*
       pygyro/utilities/*

[xml]
output = current_coverage.xml

I have just had another look at the doc and tried adding

[run]
parallel = True

to the file. This does not remove the error. I have also tried adding the "concurrency" argument too but I am not sure what to specify as the library. concurrency = mpi4py does not change the situation, concurrency = multiprocessing leads to a different problem :

WARNING: Failed to generate report: No data to report.

Coverage.py warning: Couldn't read data from '/home/emily/Documents/Cours_TUM/projet/Code/PyGyro/.coverage.emily-SATELLITE-Z30-A.31964.442210': FileNotFoundError: [Errno 2] No such file or directory: '/home/emily/Documents/Cours_TUM/projet/Code/PyGyro/.coverage.emily-SATELLITE-Z30-A.31964.442210'
EmilyBourne commented 5 years ago

Actually I seem to be getting this error even without the concurrency argument when I set parallel = True:

Coverage.py warning: Couldn't read data from '/home/emily/Documents/Cours_TUM/projet/Code/PyGyro/.coverage.emily-SATELLITE-Z30-A.22702.434517': CoverageException: Doesn't seem to be a coverage.py data file
blueyed commented 5 years ago

Coverage.py warning: Couldn't read data from '/home/emily/Documents/Cours_TUM/projet/Code/PyGyro/.coverage.emily-SATELLITE-Z30-A.22702.434517': CoverageException: Doesn't seem to be a coverage.py data file

How does this file look then?

With COVERAGE_DEBUG=process you should be seeing where this process (22702 or 434517) gets created. There are several other options for COVERAGE_DEBUG, e.g. "dataio" that could be helpful here to debug this. See https://github.com/nedbat/coveragepy/blob/cc12f4b7c40347b7297f7f6d938150bfde8c9ed5/doc/cmd.rst#diagnostics.

1313e commented 4 years ago

There is a very easy way to fix this: Don't write to the same coverage file from multiple cores. This can be done by putting the following line in the setup.cfg file (or whatever you use as the configuration file):

[coverage:run]
parallel = true

and run your tests with:

mpiexec -n 2 coverage run --rcfile=setup.cfg -m mpi4py -m pytest
coverage combine
coverage report -m

This will create a separate .coverage file for every core that executed the tests, and combine them afterward into a single report.

Note that this also makes it easier to test a package (for example) both in serial and in parallel, and combine all reports into a single one, by doing this:

coverage run --rcfile=setup.cfg -m pytest
mpiexec -n 2 coverage run --rcfile=setup.cfg -m mpi4py -m pytest
coverage combine
coverage report -m

Finally, also note that the console output when running pytest in MPI this way, will probably be messed up a little bit, as both processes try to write to the console simultaneously. This has no effect at all on the actual testing process itself, but may make it a bit harder to figure out what process printed what. As far as I know, there is no way to fix this without making pytest somehow MPI-aware.

JulienPalard commented 4 years ago

This can be related to https://github.com/nedbat/coveragepy/issues/883#issuecomment-650562896