Closed ghost closed 2 years ago
@dahongli I believe this issue is what you are working on
Thanks for testing the problem. This could be related to some corner case of the hybrid 2pc algorithm. I can't work on it today, but I can take a look during the weekend.
Since there aren't many tanks, can you attach gdb to each of the ranks, and print the seq_nun and target_seq_num maps? I believe they are unordered map, so probably you need to define a print function for them and call the function in gdb.
Rank 0:
seq_num:
elem[0]->left: $1 = 1140850688
elem[0]->right: $2 = 4421956594040832
Map size = 1
target_start_triv_barrier:
Map size = 0
target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1
Rank 1:
seq_num:
elem[0]->right: $2 = 4421952299073536
Map size = 1
target_start_triv_barrier:
Map size = 0
target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1
Rank 2:
seq_num:
elem[0]->right: $2 = 4421943709138944
Map size = 1
target_start_triv_barrier:
Map size = 0
target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1
Rank 3:
seq_num:
elem[0]->left: $1 = 1140850688
elem[0]->right: $2 = 4421939414171648
Map size = 1
target_start_triv_barrier:
Map size = 0
target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1
This data was taken from the first preSuspendBarrier
after sending the checkpoint command
Good afternoon!
Sorry for intervening, but I was facing the same issue with iPIC3D. I have been reading the seq_num.cpp file and it seems to me that check_seq_nums() should be
int check_seq_nums() {
unsigned int comm_id;
unsigned int seq;
int target_reached = 0;
for (comm_seq_pair_t pair : seq_num) {
comm_id = pair.first;
seq = pair.second;
if (target_start_triv_barrier[comm_id] < seq_num[comm_id]) {
target_reached = 1;
break;
}
}
return target_reached;
}
Then the following code fragment works fine.
if (ckpt_pending && check_seq_nums()) {
current_phase = STOP_BEFORE_CS;
while (!freepass && ckpt_pending);
freepass = false;
current_phase = IN_CS;
}
If a single comm has reached the target we stop before proceeding to the critical section. It seems that what was happening is that ranks were allowed to enter the CS even though it is not what they were meant to do. I have been running iPIC3D with this modification for about an hour and no error has been raised. Before, the error appeared quite frequently.
Best, Marc
I have tested this change with the Allgather test case I identified above, and this does not appear to be a complete fix for the issue (I am still seeing the same problem). @xuyao0127 any input?
int check_seq_nums() { unsigned int comm_id; unsigned int seq; int target_reached = 0; for (comm_seq_pair_t pair : seq_num) { comm_id = pair.first; seq = pair.second; if (target_start_triv_barrier[comm_id] < seq_num[comm_id]) { target_reached = 1; break; } } return target_reached; }
This change is incorrect because the function is used to check if all communicators of a rank have reached their targets. Then the ckpt thread can share this information among other ranks to decide when to checkpoint. All communicators of all ranks need to reach their targets before getting checkpointed.
Calling ckpt_seq_num before entering the STOP_BEFORE_CS loop is an optimization to reduce the number of free passes. It's not where ckpt_seq_num is used primarily.
When attempting to checkpoint with densely grouped collective calls, the checkpointing process does not complete. Instead, the ranks are unable to progress beyond the PRESUSPEND barrier.
Based on my testing, it appears that the
current_phase
variable for each rank is not being set to IS_READY, allowing the rank to proceed from the PRESUSPEND barrier. More specifically, it appears that one or two ranks are entering a collective call withcommit_begin
, and checkpointing occurs before the other ranks reach thiscommit_begin
. Therefore, the ranks that have entered the commit havecurrent_phase = IN_CS
, while the other ranks havecurrent_phase = IS_READY
. None of the ranks are then able to proceed, and the ranks that enter the commit early do not reachcommit_finish
.I believe this logic is related to the sequence number changes to the two phase commit. @gc00 @xuyao0127 do you have any pointers on where you think the error might be?
To reproduce, you can run:
python3 $MANA_ROOT/mpi-proxy-split/test/mana_test.py $MANA_ROOT/mpi-proxy-split/test/Allgather_test -i 100000000 -n 4
on Cori, then checkpoint manually.