tomMoral / dicodile

Experiments for "Distributed Convolutional Dictionary Learning (DiCoDiLe): Pattern Discovery in Large Images and Signals"
https://tommoral.github.io/dicodile/
BSD 3-Clause "New" or "Revised" License
18 stars 9 forks source link

Test mpi versions #20

Open hndgzkn opened 3 years ago

hndgzkn commented 3 years ago

Runs the tests:

Tests with openmpi on ubuntu-18.04 fails due to #12.

Tests with mpich on both ubuntu-18.04 and ubuntu 20.04 fail due to #19.

codecov[bot] commented 3 years ago

Codecov Report

Merging #20 (75a5004) into main (0aad2ea) will not change coverage. The diff coverage is n/a.

:exclamation: Current head 75a5004 differs from pull request most recent head 909cdcf. Consider uploading reports for the commit 909cdcf to get more accurate results Impacted file tree graph

@@           Coverage Diff           @@
##             main      #20   +/-   ##
=======================================
  Coverage   74.29%   74.29%           
=======================================
  Files          41       41           
  Lines        2587     2587           
=======================================
  Hits         1922     1922           
  Misses        665      665           
Flag Coverage Δ
unittests 74.29% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0aad2ea...909cdcf. Read the comment docs.

tomMoral commented 3 years ago

I am not sure why it is still in fail fast. Did you rebased on master? I saw you did, not sure why the tests are stopped then.

It would be nice to have all these tests, potentially with xfail on the configurations known to cause problems?

hndgzkn commented 3 years ago

I am not sure why it is still in fail fast. ~Did you rebased on master?~ I saw you did, not sure why the tests are stopped then.

It would be nice to have all these tests, potentially with xfail on the configurations known to cause problems?

They are stopped because of time out.

The tests with mpich hang at a point due to #19, then it waits until max timeout for github actions. When it is over, they are cancelled.

hndgzkn commented 3 years ago

@tomMoral The main problem for tests with mpich is that we need to run the tests with mpiexec -np 1 pytest .. due to https://github.com/pmodels/mpich/issues/4853 . But when tests are run with mpiexec (for both openmpi and mpich) there is a problem with stopping spawned processes. I do not know how to release resources started by mpi for the tests. (The problem appears only when running tests) Do you have any idea?

tomMoral commented 3 years ago

the problem seems to be on the init of MPI with an issue on argument no?

It seems that the process hangs just before calling dicodile/tests/test_dicodile.py::test_dicodile. image I think one of the issue is that from mpi4py import MPI will only return when MPI_Init complete. This call is triggered by the import so it is hard to think of a way to detect the failure if the call itself is not.

One way to detect this would be to wrap the import with a faulthandler.dump_traceback_later(timeout=120) and a faulthalder.cancel_dump_traceback_later() that would exit if it hangs for more than 2m with info that might help with debugging.

WDYT?

hndgzkn commented 3 years ago

the problem seems to be on the init of MPI with an issue on argument no?

It seems that the process hangs just before calling dicodile/tests/test_dicodile.py::test_dicodile. image I think one of the issue is that from mpi4py import MPI will only return when MPI_Init complete. This call is triggered by the import so it is hard to think of a way to detect the failure if the call itself is not.

One way to detect this would be to wrap the import with a faulthandler.dump_traceback_later(timeout=120) and a faulthalder.cancel_dump_traceback_later() that would exit if it hangs for more than 2m with info that might help with debugging.

WDYT?

@tomMoral As far as I understand, this is message is due to Singleton feature not being implemented in mpich, see mpich issue on github.

details are explained in #19.

I think with mpich we need to run the tests with:

mpirun -np 1 --host localhost:16 pytest

Note: Actually we can use the same command for both mpich and openmpi. As hostfile format for mpich and openmpi are not the same host localhost:16 would avoid to set a hostfile.

When we use the above command with:

I think the openmpi version should be able to stop spawned processes properly. That makes me think that the code to stop spawned processes might not be reliable.

hndgzkn commented 3 years ago

@tomMoral I tried using mpich with a very simple MPI program that spawns a number of processes (gets the hostfile from env.) to see if the problem arises from dicodile code.

With openmpi I can run the prog as:

python prog.py

If I do the same with mpich, I get the above error; ie. unrecognized argument pmi_args. I need to run it as:

mpirun -np 1 python prog.py

I think this is really due to Singleton not being implemented in mpich.

I propose to change the testing command to

mpirun -np 1 --host localhost:16 python -m pytest

and fix the hanging problem and other possible problems afterwards.

WDYT?