ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Occasionally assert on req_complete when a failure is reported during wait #19

Closed abouteiller closed 6 years ago

abouteiller commented 6 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


[c02.cauchy:07866] [[55350,1],0] ompi_request_state_ok: Request is part of a collective, and some process died. (rank   2)                                                                                          Rank 1 / 4: Notified of error MPI_ERR_PROC_FAILED: Process Failure. Stayin' alive!
ex0.1.notification: ../../../../../src/ompi/request/request.h:457: ompi_request_wait_completion: Assertion `((void*)1L == (req)->req_complete)' failed.
[c02:07866] *** Process received signal ***
[c02:07866] Signal: Aborted (6)                                                                                                                                                                                     [c02:07866] Signal code:  (-6)                                                                                                                                                                                      [c02:07866] [ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7ffff780c5e0]
[c02:07866] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7ffff746f1f7]                                                                                                                                                 [c02:07866] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7ffff74708e8]
[c02:07866] [ 3] /usr/lib64/libc.so.6(+0x2e266)[0x7ffff7468266]
[c02:07866] [ 4] /usr/lib64/libc.so.6(+0x2e312)[0x7ffff7468312]
[c02:07866] [ 5] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_pml_ob1.so(+0x104e7)[0x7fffe3bc44e7]                                                                                                    [c02:07866] [ 6] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d0)[0x7fffe3bc6a2d]
[c02:07866] [ 7] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libmpi.so.0(+0x10b949)[0x7ffff7b24949]
[c02:07866] [ 8] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libmpi.so.0(ompi_coll_base_barrier_intra_recursivedoubling+0x15a)[0x7ffff7b24de9]
[c02:07866] [ 9] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_dec_fixed+0xa4)[0x7fffe24fcc1a]
[c02:07866] [10] /home/bouteill/ompi/ulfm/ulfm2/debug.build/lib/libmpi.so.0(MPI_Barrier+0x181)[0x7ffff7a9f064]                                                                                                      [c02:07866] [11] /home/bouteill/ompi/ulfm/ulfm-testing/tutorial/ex0.1.notification[0x400a86]                                                                                                                        [c02:07866] [12] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff745bc05]
[c02:07866] [13] /home/bouteill/ompi/ulfm/ulfm-testing/tutorial/ex0.1.notification[0x400949]
[c02:07866] *** End of error message ***
abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


f403bef6