open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.09k stars 848 forks source link

Dmtcp Launch with Mpi example throws segmentation fault 11 #9205

Open Wosch96 opened 2 years ago

Wosch96 commented 2 years ago

Hello together,

I'm trying to use dmtcp on my vm cluster(Centos 7) and want to run an example with mpi. I need to say that I'm only running the yum package versions of dmtcp(2.6.1 if not mistaken) and openmpi (1.10.7). The cluster contains 4 nodes and a master node, which are connected via ssh. The example I'm trying to run is this one (https://cvw.cac.cornell.edu/Checkpoint/dmtcpmpicount). When I'm running mpi this output seems right:

mpirun -n 16 -hostfile hostfile /mpi_count (this is the mpi_count file(https://raw.githubusercontent.com/cornellcac/CR-demos/master/demos/MPI/mpi_count.c)) Message from process 1 on host node1: Hello, world Message from process 2 on host node1: Hello, world Message from process 3 on host node1: Hello, world Message from process 6 on host node2: Hello, world Message from process 7 on host node2: Hello, world Message from process 4 on host node2: Hello, world Message from process 5 on host node2: Hello, world Message from process 8 on host node3: Hello, world Message from process 9 on host node3: Hello, world Message from process 10 on host node3: Hello, world Message from process 12 on host node4: Hello, world Message from process 11 on host node3: Hello, world Message from process 0 on host node1: Hello, world Message from process 13 on host node4: Hello, world Message from process 14 on host node4: Hello, world Message from process 15 on host node4: Hello, world 3:0 2:0 10:0 8:0 9:0 11:0 5:0 12:0 14:0 7:0 4:0 13:0 1:0 6:0 0:0 15:0 3:1 2:1 5:1 12:1 14:1 10:1 8:1 11:1

But when I try to checkpoint the program via the dmtcp command: dmtcp_launch --rm mpiexec -n 1 mpi_count //specifically doing this with only one node for the test

I get this segmentation fault 11 error: mpiexec noticed that process rank 0 with PID 79000 on node master exited on signal 11 (Segmentation fault).

Is there any fault in the mpi_count.c file or in my usage of mpi? I've just started with mpi, so any help is appreciated.

Thank you.

jsquyres commented 2 years ago

I'm afraid that the DMTCP support in Open MPI has long-since been removed; the 1.10.x series is very old and, unfortunately, unsupported at this point.

With a quick glance, I don't see any obvious errors in your mpi_count.c app.

awlauria commented 2 years ago

Is there a core file generated? That might provide some clue.

Wosch96 commented 2 years ago

@awlauria The core file delivers a failure in the calloc() and looks like this: Program terminated with signal 11, Segmentation fault. Reading symbols from /var/nfs/dmtcp_mpi/mpi_count...done. [New LWP 1994] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `./mpi_count'. Program terminated with signal 11, Segmentation fault.

0 0x00007ffff62f1154 in calloc () from /lib64/libc.so.6

Missing separate debuginfos, use: debuginfo-install glibc-2.17-307.el7.1.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 numactl-libs-2.0.12-5.el7.x86_64 openmpi-1.10.7-5.el7.x86_64

So the segmentation fault happens in the calloc(). Any ideas?

jsquyres commented 2 years ago

A segv in malloc/calloc/etc. almost always means that you had memory corruption occur before the specific malloc/calloc/etc. in question. I.e., malloc/calloc/etc. tried to use some data from memory, but the memory was corrupted, and malloc/calloc/etc. therefore behaved unpredictably (e.g., it tried to follow a pointer, but the value of the pointer was corrupted, so it tried to access a nonsense address which resulted in a segv).

awlauria commented 2 years ago

I do notice you use strcpy() to copy in the "Hello, world" string. Can you change that to a safer call, like snprintf(), which will append a null terminator? Your send message isn't guaranteed to be null-terminated since you didn't initialize it to all 0's.

Unlikely to be the cause, but something to verify.

Wosch96 commented 2 years ago

I tried the snprintf(), but sadly no change. Meanwhile I'm also trying my luck at the github page of dmtcp, they might have some good informations, that help me out.

But thanks for the answers.

awlauria commented 2 years ago

Can you run with valgrind?

dmtcp_launch --rm mpiexec -n 1 valgrind mpi_count

will point you to any heap corruption.

Wosch96 commented 2 years ago

No valgrind problems. Anyway, can someone recommend me a mpi framework that supports checkpointing? As it seems openmpi tries to cut all ties with checkpointing. Appreciate any recommendations. I'm trying to run mpi checkpointing and maybe also containerization.

Thank you.

bosilca commented 2 years ago

DMTCP does not need special support in the MPI layer for its checkpoint (at least over IB and sockets). So Instead of using an ancient version of OMPI I would try again with the master branch.