mpickpt / mana

MANA for MPI
36 stars 24 forks source link

Unable to checkpoint: Checkpointing during dense collective calls hangs #232

Closed ghost closed 2 years ago

ghost commented 2 years ago

When attempting to checkpoint with densely grouped collective calls, the checkpointing process does not complete. Instead, the ranks are unable to progress beyond the PRESUSPEND barrier.

Coordinator:
  Host: nid00223
  Port: 7779
Client List:
#, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE, BARRIER
1, Allgather_test.mana.exe[40000:21271]@nid00223, 658e8dad50cc029e-40000-4dabcde180831, WorkerState::PRESUSPE
ND, MANA-PRESUSPEND-469
2, Allgather_test.mana.exe[41000:21270]@nid00223, 658e8dad50cc029e-41000-4dabcde1743cc, WorkerState::PRESUSPE
ND, MANA-PRESUSPEND-469
3, Allgather_test.mana.exe[42000:21269]@nid00223, 658e8dad50cc029e-42000-4dabcde16dcce, WorkerState::PRESUSPE
ND, MANA-PRESUSPEND-470
4, Allgather_test.mana.exe[43000:21268]@nid00223, 658e8dad50cc029e-43000-4dabcde155b45, WorkerState::PRESUSPE
ND, MANA-PRESUSPEND-470

Based on my testing, it appears that the current_phase variable for each rank is not being set to IS_READY, allowing the rank to proceed from the PRESUSPEND barrier. More specifically, it appears that one or two ranks are entering a collective call with commit_begin, and checkpointing occurs before the other ranks reach this commit_begin. Therefore, the ranks that have entered the commit have current_phase = IN_CS, while the other ranks have current_phase = IS_READY. None of the ranks are then able to proceed, and the ranks that enter the commit early do not reach commit_finish.

I believe this logic is related to the sequence number changes to the two phase commit. @gc00 @xuyao0127 do you have any pointers on where you think the error might be?

To reproduce, you can run: python3 $MANA_ROOT/mpi-proxy-split/test/mana_test.py $MANA_ROOT/mpi-proxy-split/test/Allgather_test -i 100000000 -n 4 on Cori, then checkpoint manually.

ghost commented 2 years ago

@dahongli I believe this issue is what you are working on

xuyao0127 commented 2 years ago

Thanks for testing the problem. This could be related to some corner case of the hybrid 2pc algorithm. I can't work on it today, but I can take a look during the weekend.

Since there aren't many tanks, can you attach gdb to each of the ranks, and print the seq_nun and target_seq_num maps? I believe they are unordered map, so probably you need to define a print function for them and call the function in gdb.

ghost commented 2 years ago

Rank 0:

seq_num:
elem[0]->left: $1 = 1140850688
elem[0]->right: $2 = 4421956594040832
Map size = 1

target_start_triv_barrier:
Map size = 0

target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1

Rank 1:

seq_num:
elem[0]->right: $2 = 4421952299073536
Map size = 1

target_start_triv_barrier:
Map size = 0

target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1

Rank 2:

seq_num:
elem[0]->right: $2 = 4421943709138944
Map size = 1

target_start_triv_barrier:
Map size = 0

target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1

Rank 3:

seq_num:
elem[0]->left: $1 = 1140850688
elem[0]->right: $2 = 4421939414171648
Map size = 1

target_start_triv_barrier:
Map size = 0

target_stop_triv_barrier:
elem[0]->left: $3 = 1140850688
elem[0]->right: $4 = 0
Map size = 1

This data was taken from the first preSuspendBarrier after sending the checkpoint command

Marc-Miranda commented 2 years ago

Good afternoon!

Sorry for intervening, but I was facing the same issue with iPIC3D. I have been reading the seq_num.cpp file and it seems to me that check_seq_nums() should be

int check_seq_nums() {
  unsigned int comm_id;
  unsigned int seq;
  int target_reached = 0;
  for (comm_seq_pair_t pair : seq_num) {
    comm_id = pair.first;
    seq = pair.second;
    if (target_start_triv_barrier[comm_id] < seq_num[comm_id]) {
      target_reached = 1;
      break;
    }
  }
  return target_reached;
}

Then the following code fragment works fine.

if (ckpt_pending && check_seq_nums()) {
      current_phase = STOP_BEFORE_CS;
      while (!freepass && ckpt_pending);
      freepass = false;
      current_phase = IN_CS;
}

If a single comm has reached the target we stop before proceeding to the critical section. It seems that what was happening is that ranks were allowed to enter the CS even though it is not what they were meant to do. I have been running iPIC3D with this modification for about an hour and no error has been raised. Before, the error appeared quite frequently.

Best, Marc

ghost commented 2 years ago

I have tested this change with the Allgather test case I identified above, and this does not appear to be a complete fix for the issue (I am still seeing the same problem). @xuyao0127 any input?

xuyao0127 commented 2 years ago
int check_seq_nums() {
  unsigned int comm_id;
  unsigned int seq;
  int target_reached = 0;
  for (comm_seq_pair_t pair : seq_num) {
    comm_id = pair.first;
    seq = pair.second;
    if (target_start_triv_barrier[comm_id] < seq_num[comm_id]) {
      target_reached = 1;
      break;
    }
  }
  return target_reached;
}

This change is incorrect because the function is used to check if all communicators of a rank have reached their targets. Then the ckpt thread can share this information among other ranks to decide when to checkpoint. All communicators of all ranks need to reach their targets before getting checkpointed.

Calling ckpt_seq_num before entering the STOP_BEFORE_CS loop is an optimization to reduce the number of free passes. It's not where ckpt_seq_num is used primarily.

ghost commented 2 years ago

https://github.com/mpickpt/mana/pull/233