upperwal / EntangledMPI

Fault Tolerance framework for High Performance Computing [Supports ULFM, replication and checkpointing]
MIT License
2 stars 1 forks source link

Replication not working with MPI_Finalize() #4

Closed upperwal closed 6 years ago

upperwal commented 6 years ago

Finalize throw:

[master:122165] *** Process received signal ***
[master:122165] Signal: Segmentation fault (11)
[master:122165] Signal code: Address not mapped (1)
[master:122165] Failing at address: 0x100
INFO: [Rank: 1] | Message: File Check Sleep | Module: ../src/mpi/init.c | Line: 104
INFO: [Rank: 1] | Message: File NOT Updated. | Module: ../src/misc/file.c | Line: 18
[master:122165] [ 0] /lib64/libpthread.so.0[0x37fd40f500]
[master:122165] [ 1] /home/mpiuser/ulfm2_/install/lib/libmpi.so.0(ompi_comm_is_proc_active+0x22)[0x2aaaaaceea72]
[master:122165] [ 2] /home/mpiuser/ulfm2_/install/lib/openmpi/mca_coll_ftbasic.so(+0x54d3)[0x2aaabd6c24d3]
[master:122165] [ 3] /home/mpiuser/ulfm2_/install/lib/openmpi/mca_coll_ftbasic.so(+0x61a1)[0x2aaabd6c31a1]
[master:122165] [ 4] /home/mpiuser/ulfm2_/install/lib/openmpi/mca_coll_ftbasic.so(mca_coll_ftbasic_agreement_era_intra+0x3d)[0x2aaabd6c509d]
[master:122165] [ 5] /home/mpiuser/ulfm2_/install/lib/libmpi.so.0(ompi_comm_shrink_internal+0xbc)[0x2aaaaacee6cc]
[master:122165] [ 6] /home/mpiuser/ulfm2_/install/lib/libmpi.so.0(ompi_mpi_finalize+0x63a)[0x2aaaaad0984a]
[master:122165] [ 7] ./rep_test[0x401195]
[master:122165] [ 8] /lib64/libc.so.6(__libc_start_main+0xfd)[0x37fcc1ecdd]
[master:122165] [ 9] ./rep_test[0x400a39]
[master:122165] *** End of error message ***

Possibly memory leak from the framework.

upperwal commented 6 years ago

Working now but intense testing required. Ref c973b23ceaff38505b5767edaf5f2ac7bad1759f