mpickpt / mana

MANA for MPI
35 stars 24 forks source link

[REDONE]: Simplified mtcp-restart plugin interface. #371

Closed karya0 closed 1 year ago

karya0 commented 1 year ago

This uses the newly proposed environment variable-based mechanism.

karya0 commented 1 year ago

Also, can you try make clean and make?

gc00 commented 1 year ago

I am ensuring that MANA_RestartDir is being set.

I'm looking right now at:

setenv(key.c_str(), ckptImages[i].c_str(), 1);

in dmtcp_restart_plugin.cpp

I think that the problem is there. But I'm still testing.

gc00 commented 1 year ago

Yep. That's the root cause. In:

for (size_t i = 0; i < ckptImages.size(); i++)

I'm seeing: ckptImages.size() == 0

gc00 commented 1 year ago

@karya0 , Okay. I fixed some more bugs with the original PR. I pushed this in, as a second commit.

On the DMTCP side, I've also reverted the "Set "stale timeout" (secs), when no peer processes commit. I needed to revert that commit, or else we don't get a checkpoint image after launching using mana_launch -i5 ... (And if you revert it, don't do git submodule update after that, or it will undo the revert.)

On that basis, the code now seems to work. But I will revert the dev/gdc0/simplifyCopyBits branch that we're using for testing. This code in this PR is still too new

gc00 commented 1 year ago

Jenkins reports:

15:15:17 + git submodule update --init
15:15:17 Submodule 'dmtcp' (https://github.com/dmtcp/dmtcp) registered for path 'dmtcp'
15:15:17 Cloning into 'dmtcp'...
15:15:22 fatal: reference is not a tree: afc5b3c78594f0f12ece4b65d5e6eeb65f8591a0
15:15:22 Unable to checkout 'afc5b3c78594f0f12ece4b65d5e6eeb65f8591a0' in submodule path 'dmtcp'
15:15:22 + ./configure

Which branch is this using for DMTCP? Could it be that it's using a branch that's not part of the origin repo for DMTCP?

karya0 commented 1 year ago

This branch and PR are irrelevant now. Closing.