Open abouteiller opened 4 years ago
Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
During MPI Finalize in runs with OpenIB BTL and failures, we can see the following assert during MPI_FInalize.
The impact is lowered by the fact that the bug manifest in Finalize, when the program has presumably completed.
the error indicates that the root cause would be related to cleaning the RDMA fragments without removing them from the registration cache.
[1,10]<stderr>:12.buddycr: ../../src/opal/class/opal_free_list.c:99: opal_free_list_destruct: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (fl_item))->obj_magic_id' failed. [1,10]<stderr>:[d06:32439] *** Process received signal *** [1,10]<stderr>:[d06:32439] Signal: Aborted (6) [1,10]<stderr>:[d06:32439] Signal code: (-6) [1,10]<stderr>:[d06:32439] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7ffff74d95d0] [1,10]<stderr>:[d06:32439] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7ffff7133207] [1,10]<stderr>:[d06:32439] [ 2] /lib64/libc.so.6(abort+0x148)[0x7ffff71348f8] [1,10]<stderr>:[d06:32439] [ 3] /lib64/libc.so.6(+0x2f026)[0x7ffff712c026] [1,10]<stderr>:[d06:32439] [ 4] /lib64/libc.so.6(+0x2f0d2)[0x7ffff712c0d2] [1,10]<stderr>:[d06:32439] [ 5] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libopen-pal.so.40(+0x2bf02)[0x7ffff6aabf02] [1,10]<stderr>:[d06:32439] [ 6] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_rcache_grdma.so(+0x18e7)[0x7fffeb7ea8e7] [1,10]<stderr>:[d06:32439] [ 7] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_rcache_grdma.so(+0x3504)[0x7fffeb7ec504] [1,10]<stderr>:[d06:32439] [ 8] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libopen-pal.so.40(mca_rcache_base_module_destroy+0x7e)[0x7ffff6b864e5] [1,10]<stderr>:[d06:32439] [ 9] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_btl_openib.so(+0x1437d)[0x7fffea5ad37d] [1,10]<stderr>:[d06:32439] [10] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_btl_openib.so(+0x8d05)[0x7fffea5a1d05] [1,10]<stderr>:[d06:32439] [11] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_btl_openib.so(+0xedc1)[0x7fffea5a7dc1] [1,10]<stderr>:[d06:32439] [12] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_btl_openib.so(mca_btl_openib_finalize+0x8c)[0x7fffea5a7ec7] [1,10]<stderr>:[d06:32439] [13] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libopen-pal.so.40(+0x9ef77)[0x7ffff6b1ef77] [1,10]<stderr>:[d06:32439] [14] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libopen-pal.so.40(mca_base_framework_close+0x113)[0x7ffff6b0260b] [1,10]<stderr>:[d06:32439] [15] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libmpi.so.40(+0xf547f)[0x7ffff77db47f] [1,10]<stderr>:[d06:32439] [16] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libopen-pal.so.40(mca_base_framework_close+0x113)[0x7ffff6b0260b] [1,10]<stderr>:[d06:32439] [17] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libmpi.so.40(ompi_mpi_finalize+0xf3c)[0x7ffff7752661] [1,10]<stderr>:[d06:32439] [18] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libmpi.so.40(PMPI_Finalize+0x60)[0x7ffff779262c] [1,10]<stderr>:[d06:32439] [19] 12.buddycr[0x401d73] [1,10]<stderr>:[d06:32439] [20] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff711f3d5] [1,10]<stderr>:[d06:32439] [21] 12.buddycr[0x401439] [1,10]<stderr>:[d06:32439] *** End of error message ***
Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
During MPI Finalize in runs with OpenIB BTL and failures, we can see the following assert during MPI_FInalize.
The impact is lowered by the fact that the bug manifest in Finalize, when the program has presumably completed.
the error indicates that the root cause would be related to cleaning the RDMA fragments without removing them from the registration cache.