ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

MPI_Finalize hangs when using -mca mpi_ft_detector_thread true #55

Closed abouteiller closed 3 years ago

abouteiller commented 4 years ago

Original report by Alexander Hölzl (Bitbucket: [Alexander Hölzl](https://bitbucket.org/Alexander Hölzl), ).


ULFM seems to deadlock when calling MPI_Finalize with mca mpi_ft_detector_thread enabled.
The version I used to test this is ulfm2 4.0.2u1.

I have observed this error on my private computer as well as the Leibniz Supercomputing Center’s
CoolMuc2 Linux Cluster.

I have attached a program that reproduces the bug but it is basically enough to initialize MPI and
call MPI_Finalize immediately afterwards.

If this is an error on my side I apologize, but I have not found any other information regarding this behavior.

abouteiller commented 3 years ago

Original comment by Bitbucket user (Bitbucket: wangh0a, GitHub: wangh0a).


Hi Alexander,

I encountered the same thing. Do you find a clean solution to it? What’s even worse is that when disabling mpi_ft_detector_thread, I always get false positives.

abouteiller commented 3 years ago

Original comment by Bitbucket user (Bitbucket: wangh0a, GitHub: wangh0a).


Well, actually I found a workaround, that is to use --mca mpi_ft_detector false.

Can someone clarity if I set --mca mpi_ft_detector false, what do the --mca mpi_ft_detector_timeout and --mca mpi_ft_detector_period do? Are they simply ignored?

abouteiller commented 3 years ago

Original comment by Alexander Hölzl (Bitbucket: [Alexander Hölzl](https://bitbucket.org/Alexander Hölzl), ).


Hey sorry for my late answer,

I needed ULFM for my bachelor’s thesis and after I finished it I stopped checking this repo.

The runtime options for the ft_detector are explained in more detail here:https://fault-tolerance.org/2019/11/18/ulfm-4-0-2u1/#Run-time_tuning_knobs

But if you set the --mca mpi_ft_detector option to false you essentially switch of the error detection system of ULFM so the timeout and detector_period options should be ignored. But in this case using ULFM doesn’t really make sense as error handling is the main reason to use ULFM.

For my bachelors thesis I had to write a wrapper for MPI that introduced transparent failure tolerance using the PMPI profiling tools which allows you to overwrite the normal MPI functions with own implementations. Usually you would use it to inject some calls to profiling code and than just call normal MPI functions, but as I said, I used them for failure tolerance.

That also allowed me to solve this problem in a very inelegant way, I basically just replaced the call to MPI_Finalize with my own version of MPI_Finalize that used MPI_Allreduce on a special communicator. I forgot how it works exactly but if you need it I can look it up.

I hope that helps you at least a bit, but it seems that the development of ULFM has stopped so I’m not sure if this bug will ever get fixed

Edit: Forget what I said there have been new commits this year, so sorry for my remark about the but not getting fixed :(