upperwal / EntangledMPI

Fault Tolerance framework for High Performance Computing [Supports ULFM, replication and checkpointing]
MIT License
2 stars 1 forks source link

Improving MPI_ANY_SOURCE algo #34

Open upperwal opened 6 years ago

upperwal commented 6 years ago

https://github.com/upperwal/EntangledMPI/blob/11bf509dadf063dc6a37eb234fdb62b0a2e7e859/src/mpi/init.c#L1090-L1097

Not a good implementation. Will only support 2 nodes (1 compute and 1 replica). Should be more generic.

One way of making it generic it to do one PMPI_Irecv with MPI_ANY_SOURCE and spawn a new thread to PMPI_Wait for this request to complete. Once its complete this thread can check for number of nodes in this job ("n") using status.MPI_SOURCE and rank2job array and do "n - 1" PMPI_Irecv and exit. Rest will be taken care in MPI_Wait implementation.