ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Cori: Fallback to IBoGNI after a fault is reported and crash #29

Closed abouteiller closed 5 years ago

abouteiller commented 6 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Looks like on Cori, we use Open IB when uGNI fails.

Error handler called on rank 62 for communicator 0x8f33f0 (try catch step 2)
    error was 76 MPI_ERR_REVOKED: Communication Object Revoked

ex5.transactions: peers.c:504: ibg_find_or_create_peer: Assertion `index < ibg_alps_info.pe_count' failed.
[nid00500:54759] *** Process received signal ***
[nid00500:54759] Signal: Aborted (6)
[nid00500:54759] Signal code:  (-6)
[nid00500:54759] [ 0] /lib64/libpthread.so.0(+0x10b10)[0x2aaaabbe1b10]
[nid00500:54759] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaaabe228d7]
[nid00500:54759] [ 2] /lib64/libc.so.6(abort+0x13a)[0x2aaaabe23caa]
[nid00500:54759] [ 3] /lib64/libc.so.6(+0x2d866)[0x2aaaabe1b866]
[nid00500:54759] [ 4] /lib64/libc.so.6(+0x2d912)[0x2aaaabe1b912]
[nid00500:54759] [ 5] /usr/lib64/libibgni.so.1(ibg_find_or_create_peer+0x368)[0x2aaabd25d798]
[nid00500:54759] [ 6] /usr/lib64/libibgni.so.1(ibg_poll_doorbell+0x135)[0x2aaabd25d8e5]
[nid00500:54759] [ 7] /usr/lib64/libibgni.so.1(ibg_process_hsn_completions+0x5c3)[0x2aaabd259d23]
[nid00500:54759] [ 8] /usr/lib64/libibgni.so.1(+0xdf65)[0x2aaabd25af65]
[nid00500:54759] [ 9] /global/homes/b/bouteill/ulfm2/fast.build/lib/openmpi/mca_btl_openib.so(+0x14d7d)[0x2aaabf63bd7d]
[nid00500:54759] [10] /global/homes/b/bouteill/ulfm2/fast.build/lib/openmpi/mca_btl_openib.so(+0x15d2e)[0x2aaabf63cd2e]
[nid00500:54759] [11] /global/homes/b/bouteill/ulfm2/fast.build/lib/libopen-pal.so.0(opal_progress+0x3c)[0x2aaaac47d5bc]
[nid00500:54759] [12] /global/homes/b/bouteill/ulfm2/fast.build/lib/openmpi/mca_coll_ftbasic.so(mca_coll_ftbasic_agreement_era_intra+0x55)[0x2aaac1722255]
[nid00500:54759] [13] /global/homes/b/bouteill/ulfm2/fast.build/lib/libmpi.so.0(MPIX_Comm_agree+0xab)[0x2aaaaadac48b]
[nid00500:54759] [14] /global/homes/b/bouteill/ulfm2/ulfm-testing/tutorial/ex5.transactions[0x4011f5]
[nid00500:54759] [15] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaabe0e6e5]
[nid00500:54759] [16] /global/homes/b/bouteill/ulfm2/ulfm-testing/tutorial/ex5.transactions[0x400d69]
[nid00500:54759] *** End of error message ***
abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


problem can be mitigated by disabling non-useful BTLs (we should not fallback to OpenIB on GNI machines, that's silly). -mca btl vader,ugni,self Not sure exactly how that can be automated in a meaningful way.