upperwal / EntangledMPI

Fault Tolerance framework for High Performance Computing [Supports ULFM, replication and checkpointing]
MIT License
2 stars 1 forks source link

Rank retains previous value across checkpoints #27

Open upperwal opened 6 years ago

upperwal commented 6 years ago

Rank (or any other variable in user in user space [stack]) will retain its value across checkpointing.

Ex:

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

...
...

printf("rank: %d\n", rank);
Could print rank in the previous job before checkpointing

So to counter this problem either the replication map should be retained [all update bit set to 1] or values from MPI_Comm_rank should be consumed ASAP or recalled whenever required.