ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

OpenIB Finalize assert in reggache #52

Open abouteiller opened 4 years ago

abouteiller commented 4 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


During MPI Finalize in runs with OpenIB BTL and failures, we can see the following assert during MPI_FInalize.

The impact is lowered by the fact that the bug manifest in Finalize, when the program has presumably completed.

the error indicates that the root cause would be related to cleaning the RDMA fragments without removing them from the registration cache.

[1,10]<stderr>:12.buddycr: ../../src/opal/class/opal_free_list.c:99: opal_free_list_destruct: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (fl_item))->obj_magic_id' failed.
[1,10]<stderr>:[d06:32439] *** Process received signal ***
[1,10]<stderr>:[d06:32439] Signal: Aborted (6)
[1,10]<stderr>:[d06:32439] Signal code:  (-6)
[1,10]<stderr>:[d06:32439] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7ffff74d95d0]
[1,10]<stderr>:[d06:32439] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7ffff7133207]
[1,10]<stderr>:[d06:32439] [ 2] /lib64/libc.so.6(abort+0x148)[0x7ffff71348f8]
[1,10]<stderr>:[d06:32439] [ 3] /lib64/libc.so.6(+0x2f026)[0x7ffff712c026]
[1,10]<stderr>:[d06:32439] [ 4] /lib64/libc.so.6(+0x2f0d2)[0x7ffff712c0d2]
[1,10]<stderr>:[d06:32439] [ 5] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libopen-pal.so.40(+0x2bf02)[0x7ffff6aabf02]
[1,10]<stderr>:[d06:32439] [ 6] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_rcache_grdma.so(+0x18e7)[0x7fffeb7ea8e7]
[1,10]<stderr>:[d06:32439] [ 7] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_rcache_grdma.so(+0x3504)[0x7fffeb7ec504]
[1,10]<stderr>:[d06:32439] [ 8] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libopen-pal.so.40(mca_rcache_base_module_destroy+0x7e)[0x7ffff6b864e5]
[1,10]<stderr>:[d06:32439] [ 9] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_btl_openib.so(+0x1437d)[0x7fffea5ad37d]
[1,10]<stderr>:[d06:32439] [10] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_btl_openib.so(+0x8d05)[0x7fffea5a1d05]
[1,10]<stderr>:[d06:32439] [11] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_btl_openib.so(+0xedc1)[0x7fffea5a7dc1]
[1,10]<stderr>:[d06:32439] [12] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_btl_openib.so(mca_btl_openib_finalize+0x8c)[0x7fffea5a7ec7]
[1,10]<stderr>:[d06:32439] [13] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libopen-pal.so.40(+0x9ef77)[0x7ffff6b1ef77]
[1,10]<stderr>:[d06:32439] [14] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libopen-pal.so.40(mca_base_framework_close+0x113)[0x7ffff6b0260b]
[1,10]<stderr>:[d06:32439] [15] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libmpi.so.40(+0xf547f)[0x7ffff77db47f]
[1,10]<stderr>:[d06:32439] [16] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libopen-pal.so.40(mca_base_framework_close+0x113)[0x7ffff6b0260b]
[1,10]<stderr>:[d06:32439] [17] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libmpi.so.40(ompi_mpi_finalize+0xf3c)[0x7ffff7752661]
[1,10]<stderr>:[d06:32439] [18] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libmpi.so.40(PMPI_Finalize+0x60)[0x7ffff779262c]
[1,10]<stderr>:[d06:32439] [19] 12.buddycr[0x401d73]
[1,10]<stderr>:[d06:32439] [20] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff711f3d5]
[1,10]<stderr>:[d06:32439] [21] 12.buddycr[0x401439]
[1,10]<stderr>:[d06:32439] *** End of error message ***