Closed gc00 closed 1 year ago
@gc00, Can the coordinator put the same unique string as a suffix to the .mana
file that DMTCP adds to the checkpoint images?
@leonidbelyaev , Thanks! (Doing too many things at once. :-( ) It's fixed now.
@JainTwinkle wrote:
Can the coordinator put the same unique string as a suffix to the .mana file that DMTCP adds to the checkpoint images?
I don't think that works. The coordinator creates the .mana file first. And then we call mana_launch, which reads it. There is no ckpt image yet. But maybe I misunderstood your comment.
@JainTwinkle wrote:
Can the coordinator put the same unique string as a suffix to the .mana file that DMTCP adds to the checkpoint images?
I don't think that works. The coordinator creates the .mana file first. And then we call mana_launch, which reads it. There is no ckpt image yet. But maybe I misunderstood your comment.
I see your point. The coordinator is started first and does not have information about the unique string that will be a part of the checkpoint image's name in the future. In that case, your current design is fine.
It says that I closed this PR on Sept. 6. If so, it was an accident. I'm re-opening it now.
This adds the SLURM_JOBID to the bin/mana* scripts. This allows multiple SLURM jobs to run safely.
Otherwise, two jobs could both call
bin/mana_coordinator
simultaneously, overwriting the.mana
file with a single file. This later causes two independent jobs to use a single.mana
file when runningbin/mana_launch
orbin/mana_restart
.