upperwal / EntangledMPI

Fault Tolerance framework for High Performance Computing [Supports ULFM, replication and checkpointing]
MIT License
2 stars 1 forks source link

comm corruption during comm_update #24

Closed upperwal closed 6 years ago

upperwal commented 6 years ago

comm_update and MPI_* is using the same comm at the same time. comm_update are updating them but MPI_* uses the corrupt comms. This results in seg fault

upperwal commented 6 years ago

Possible Solution: Sync comm_update and MPI_*

upperwal commented 6 years ago

Now synchronising comm on use

In MPI_* functions: https://github.com/upperwal/EntangledMPI/blob/db65b91021dac7c87d1252ad90d0d7d17eeb182f/src/mpi/init.c#L541 https://github.com/upperwal/EntangledMPI/blob/db65b91021dac7c87d1252ad90d0d7d17eeb182f/src/mpi/init.c#L590

In update_comm https://github.com/upperwal/EntangledMPI/blob/db65b91021dac7c87d1252ad90d0d7d17eeb182f/src/mpi/comm.c#L143 https://github.com/upperwal/EntangledMPI/blob/db65b91021dac7c87d1252ad90d0d7d17eeb182f/src/mpi/comm.c#L180