openhpc / submissions

OpenHPC Component Submissions Project
8 stars 2 forks source link

DMTCP #26

Closed shlomiya closed 6 years ago

shlomiya commented 6 years ago

Software Name

DMTCP


Public URL

http://dmtcp.sourceforge.net


Technical Overview

DMTCP (Distributed MultiThreaded Checkpointing) transparently checkpoints a single-host or distributed computation in user-space -- with no modifications to user code or to the O/S. It works on most Linux applications. This checkpointing functionality introduces new functionality. Some MPI implementations (e.g., MVAPICH) include an MPI-specific checkpoint-restart service that disconnects the network, uses BLCR for single-host checkpointing, and then re-connects the network. In contrast, DMTCP provides distributed checkpointing at the lower POSIX system call layer, and is independent of the particular MPI implementation being used.


Latest stable version number

DMTCP version 2.5.2


Open-source license type

LGPL-3.0


Relationship to component?

If other, please describe:


Build system

If other, please describe:

Does the current build system support staged path installations? For example: make install DESTDIR=/tmp/foo (or equivalent)


Does component run in user space or are administrative credentials required?


Does component require post-installation configuration.

If yes, please describe briefly:


If component is selected, are you willing and able to collaborate with OpenHPC maintainers during the integration process?


Does the component include test collateral (e.g. regression/verification tests) in the publicly shipped source?

If yes, please briefly describe the intent and location of the tests.

The tests are in TOP_LEVEL/test. To run the full suite, do:

    $ make -j check  

To run a particular test (e.g. the frisbee test), try

    $ make -j check-frisbee

The test invokes:

    $ (cd test && ./autotest.py frisbee)

At a still lower level, one can do from top-level:

    $ make tidy
    $ dmtcp_launch -i5 test/frisbee # checkpoint at interval of 5 seconds
    $ ^C (kill it)
    $ dmtcp_restart ckpt_*.dmtcp

Does the component have additional software dependencies (beyond compilers/MPI) that are not part of standard Linux distributions?

If yes, please list the dependencies and associated licenses.


Does the component include online or installable documentation?

If available online, please provide URL.

http://dmtcp.sourceforge.net/FAQ.html
Also, a DMTCP install will install various man pages for dmtcp_launch, dmtcp_restart, etc.
Most development occurs on:
https://github.com/dmtcp/dmtcp
DMTCP home page:
http://dmtcp.github.io/index.html


[Optional]: Would you like to receive additional review feedback by email?

- [x] yes - [ ] no
shlomiya commented 6 years ago

This is the dmtcp.spec file (renamed here to dmtcp.spec.txt) that was used in the latest Fedora release. Note that since then DMTCP has advanced its stable version to 2.5.2

dmtcp.spec

adrianreber commented 6 years ago

Is there a special reason the package is not part of EPEL-7? I see it is available for EPEL-6 but not for 7? Just curious as that is one of the same platforms OpenHPC is targeting and exactly that platform is missing.

shlomiya commented 6 years ago

We were quite loaded and didn't get to it, and there wasn't an urgent need for it. If this is required for this project then we can push for an EPEL-7 version.

adrianreber commented 6 years ago

It is not a requirement. I was just curious as DMTCP seems to have no dependency on any OpenHPC provided packages and getting it into EPEL-7 would open it up to many more users. Especially as it is already part of EPEL-6 it seems like a small step to get it into EPEL-7.

Anyway, I will have a look at the spec file and provide some feedback.

adrianreber commented 6 years ago

I am still unsure if it makes sense for DMTCP to be part of OpenHPC but the spec file needs a few changes:

OpenHPC packages usually do no differentiate between the main package and a devel package.

With those changes the spec file would match the existing spec files and be ready for inclusion.

shlomiya commented 6 years ago

I've updated the spec files as you've suggested. Regrading necessary module files, I'm not sure what's missing from the spec file.

dmtcp.spec.txt

Regarding the question about being unsure if it makes sense for DMTCP to be part of OpenHPC. We would have thought that would be obvious. The rationale is that many HPC jobs run for long periods, and need the ability to transparently checkpoint and restart in case of a crash. DMTCP transparently supports MPI as well as single-host jobs. It's robust and highly scalable. See: "System-level Scalable Checkpoint-Restart for Petascale Computing" (ICPADS'16); showing scalaibilty to 32,000 cores.
DMTCP already has many users in HPC. If this is not convincing, could you let us know what issues you see in whether DMTCP supports HPC? Thanks!

adrianreber commented 6 years ago

Thanks for the spec file, but it fails in %prep during %setup

As DMTCP has no OpenHPC dependency and as it is already has an EPEL-6 branch, I think it would potentially reach more people by having an EPEL-7 branch. This is not about DMTCP's functionality, I think that it would just get a bigger possible user-base by being part of EPEL-7. I am just trying to understand why OpenHPC instead of EPEL-7? But that is of course totally up to you.

Did you try to build an RPM with the attached spec file?

sidio47 commented 6 years ago

DMTCP is currently in the CentOS 7 EPEL Testing repository: https://centos.pkgs.org/7/epel-testing-x86_64/dmtcp-2.5.2-1.el7.x86_64.rpm.html

adrianreber commented 6 years ago

Nice. I am adding a link to the actual update request:

https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-9dce83b52a

So, what does that mean for this submission?

adrianreber commented 6 years ago

@shlomiya do you still want to submit it for OpenHPC now that it is available in EPEL-7 stable branch?

shlomiya commented 6 years ago

After your previous comments, we decided to make an effort to push it to EPEL-7. I'm a bit unclear regarding the relation between OpenHPC and EPEL-7. Is OpenHPC taking all the packages from EPEL-7, hence DMTCP will be included automatically? If not, we definitely still want to submit it to OpenHPC, and please let me know what are the changes that are required based on the EPEL-7 package.

adrianreber commented 6 years ago

Some OpenHPC packages already depend on EPEL during runtime and a very small number of package (one if I remember correctly) depend on EPEL during build-time. So in practice I expect that almost all OpenHPC deployments based on CentOS also have EPEL enabled. Looking at one of the install recipes (https://github.com/openhpc/ohpc/releases/download/v1.3.3.GA/Install_guide-CentOS7-Warewulf-SLURM-1.3.3-x86_64.pdf) it actually says:

"The public EPEL repository will be enabled automatically upon installation of the ohpc-release package."

So, yes, for CentOS based deployments EPEL packages are available.

koomie commented 6 years ago

@shlomiya, as Adrian mentioned, we do already rely on EPEL as a dependency for installs on CentOS. So, now that you have successfully landed DMTCP in EPEL-7 that should extend exposure to OpenHPC as well. Consequently, the TSC review panel did not see a need to duplicate the effort given that DMTCP it is not dependent on any specific MPI family.