mpickpt / mana

MANA for MPI
35 stars 24 forks source link

bin/mana_*: Use $HOME/.mana-slurm-$SLURM_JOB_ID.rc #358

Closed gc00 closed 1 year ago

gc00 commented 1 year ago

This adds the SLURM_JOBID to the bin/mana* scripts. This allows multiple SLURM jobs to run safely.

Otherwise, two jobs could both call bin/mana_coordinator simultaneously, overwriting the .mana file with a single file. This later causes two independent jobs to use a single .mana file when running bin/mana_launch or bin/mana_restart.

JainTwinkle commented 1 year ago

@gc00, Can the coordinator put the same unique string as a suffix to the .mana file that DMTCP adds to the checkpoint images?

gc00 commented 1 year ago

@leonidbelyaev , Thanks! (Doing too many things at once. :-( ) It's fixed now.

gc00 commented 1 year ago

@JainTwinkle wrote:

Can the coordinator put the same unique string as a suffix to the .mana file that DMTCP adds to the checkpoint images?

I don't think that works. The coordinator creates the .mana file first. And then we call mana_launch, which reads it. There is no ckpt image yet. But maybe I misunderstood your comment.

JainTwinkle commented 1 year ago

@JainTwinkle wrote:

Can the coordinator put the same unique string as a suffix to the .mana file that DMTCP adds to the checkpoint images?

I don't think that works. The coordinator creates the .mana file first. And then we call mana_launch, which reads it. There is no ckpt image yet. But maybe I misunderstood your comment.

I see your point. The coordinator is started first and does not have information about the unique string that will be a part of the checkpoint image's name in the future. In that case, your current design is fine.

gc00 commented 1 year ago

It says that I closed this PR on Sept. 6. If so, it was an accident. I'm re-opening it now.