mpickpt / mana

MANA for MPI
35 stars 24 forks source link

Fix second restart #357

Open gc00 opened 1 year ago

gc00 commented 1 year ago

_The right solution for the second commit is to modify DMTCP so that it will automatically munmap mtcprestart for its text/data/heap memory after restart. Then MANA doesn't have to worry about it. Until then, this PR should solve the problem, and allow MANA to handle ckpt-restart-ckpt-restart.

This fixes two one bug that occurred in ckpt-restart-ckpt. Please read carefully the commit messages of the two commits (one to fix the bug, and one to print debugging information during MANA ckpt.)

  1. In DMTCP, jalib::XToString seems to have a bug in it. On a second ckpt for wave_mpi.mana.exe, we're saving the address of maxLibsEnd, instead of its value. (See below.)
  2. In MANA, we save the mtcp_restart in the checkpoint image, even though DMTCP can avoid doing this. Then, on the second restart, we try to restore mtcp_restart from the ckpt image on top of the current mtcp_restart, and it fails. This happens because mtcp_restart is always loaded starting at 0x11200000

[ I was wrong about Point 1. Working too late at night. I read GDB wrong, and wrongly attributed the issues to C++. :-( ]

@karya0 , I think this intersect with the design of DMTCP. Could it be that normal DMTCP removes mtcp_restart just fine before a second ckpt-restart, but maybe the restart-plugin feature fails to remove mtcp_restart? Could you check this? It seems to me that the last commit should be something automatically done by DMTCP, and should not rely on MANA. Thanks.

karya0 commented 1 year ago

@gc00: I am not seeing the issue here:

string maxLibsEndStr = jalib::XToString(maxLibsEnd);
(gdb) p/x maxLibsEnd
$18 = 0x7fa83c8e4000
(gdb) p/x &maxLibsEnd
$19 = 0x7fa639421f68
(gdb) ptype maxLibsEnd
type = void *
(gdb) p &maxLibsEndStr.c_str()
$21 = 0x7fa639421c50 "0x7fa83c8e4000"

Isn't maxLibsEndStr supposed to have the "value" of maxLibsEnd variable, which is 0x7fa83c8e4000? What do you expect it to contain instead?

gc00 commented 1 year ago

@karya0 , Maybe I'm wrong about point #1. I'll check my work again in the morning. But when I review it now, I think probably there is no bug in jalib::XToString. I'll test again in the morning.

So, that leaves only the second point. Normally, DMTCP does not save mtcp_restart in its ckpt image. But for MANA on second ckpt-restart, it seems to have saved mtcp_restart in that second ckpt image. I'm not yet sure why this is happening.

karya0 commented 1 year ago

2 is happening because MANA restart plugin is asking mtcp_restart to skip unmapping this region. MANA doesn't mark it as lower half region during ckpt so DMTCP saves it as a regular memory region. This is the issue I talked to you about on the phone the other day.

gc00 commented 1 year ago

@karya0 wrote:

2 is happening because MANA restart plugin is asking mtcp_restart to skip unmapping this region. MANA doesn't mark it as lower half region during ckpt so DMTCP saves it as a regular memory region.

Thanks. What you say is true. But it's a corollary of a wrong design decision in designing the MTCP restart plugin. Essentially, DMTCP knows where its hole is, and MANA does not. I cannot think of any circumstance where MANA would want to save and restore the mtcp_restart application that is owned by DMTCP. If we agree that MANA never wants to restore this, we should not ask MANA to reverse-engineer to find the DMTCP "hole" and unmap it without the help of DMTCP.

I'll explain here in more detail, since this has already consumed a lot of my time in analyzing it.

If there is no restart plugin, then when DMTCP will automatically unmap the mtcp_restart memory regions. In DMTCP, I can look at the generated ckpt images during first and second ckpt. Those images to not have an mtcp_restart region.

Shouldn't DMTCP follow exactly the same policy when the user has a restart plugin? If a memory region is part of the official DMTCP "hole", then DMTCP should munmap its "hole". Right now, we're expecting MANA to discover where is the DMTCP "hole" (or to look at procmaps to discover the mtcp_restart text/data/heap), and then set mtcp_plugin_skip_memory_region_munmap to detect this special case, and return '0'.

In the newest commit, I am using the value 0x112000000 that mtcp/Makefile defines as the beginning of the hole. And I munmap the hole inside MANA. But DMTCP already knows where the hole is. We shouldn't depend on MANA to reverse-engineer where the DMTCP hole is.

For background: Inside restart_plugin, MANA creates the lower half and runs MPI_Init before restoring the upper half. Our design requires this. Therefore, after the lower half is created, the mtcp_restart program now contains: (a) text/data/heap of mtcp_restart (b) the lower-half created by lh_proxy (c) new memory regions created by MPI_Init (and sometimes MPI_Init uses ioctl to create them, so that we can't interpose).

Now, we're asking the restart_plugin to distinguish between the text/data/heap of mtcp_restart (not created by MANA) and the extra ioctl memory regions of MPI_Init (not created by MANA).

The easiest way for MANA to distinguish the two is to reverse-engineer where DMTCP placed its hole, and assume that this is mtcp_restart. The harder way is to reverse-engineer MPI_Init, and guess where it did an ioctl. But that type of code would be very fragile as MPI_Init evolves.

CONCLUSION: By far the cleanest thing to do is for DMTCP to detect its own "hole" and munmap it. Asking MANA to use mtcp_plugin_skip_memory_region_munmap to declare where is the DMTCP "hole" is fragile, since it depends on the 0x11200000 fixed address. And asking MANA to adapt every time DMTCP changes its policy on the "hole" makes no sense. (Of course, DMTCP could add an interface to tell MANA where the DMTCP "hole" is, but that would be really weird.)

Anyway, sorry for the very long comment here. But this is why I'd like DMTCP to unmap its own "hole", and not ask MANA to do this. If the rationale is still not clear, let's talk on the phone. Thanks.

gc00 commented 1 year ago

@karya0 , Thanks for the phone conversation, where we agreed that DMTCP can take responsibility for doing munmap on the text/data segments of mtcp_restart after the restart begins.

Since then, I've discovered that when mtcp_restart begins at 'main', it has a text/data/heap segment. The call to mtcp_plugin_hook() does not extend the heap. So, the heap is part of mtcp_restart, and should be unmapped -- to avoid an address conflict on a second restart.

gc00 commented 1 year ago

When PR #362 is pushed in, it will make this PR unnecessary, and we can then close it.