ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Recursive error notification in ERA Agree #51

Closed abouteiller closed 4 years ago

abouteiller commented 4 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Problem

Under some timing, multiple faults may be detected meanwhile an ERA agreement is repairing itself. This is fine by the algorithm, but it may result in a recursive call to era_mark_as_failed, modifying the ERA tree meanwhile we are iterating over the children of the node. This then cause trouble as the view of the tree is not consistent anymore.

0011: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
Assertion failed: (ci->ags->tree[r_in_tree].rank_in_comm == -1 || ci->ags->tree[r_in_tree].rank_in_comm > r_in_comm), function era_tree_rank_from_comm_rank, file ../../../../../ulfm2/ompi/mca/coll/ftagree/coll_ftagree_earlyreturning.c, line 1354.
[aurelien:01098] *** Process received signal ***
[aurelien:01098] Signal: Abort trap: 6 (6)
[aurelien:01098] Signal code:  (0)
[aurelien:01098] [ 0] 0   libsystem_platform.dylib            0x00007fff67b3bb1d _sigtramp + 29
[aurelien:01098] [ 1] 0   ???                                 0x0000000109b87268 0x0 + 4458050152
[aurelien:01098] [ 2] 0   libsystem_c.dylib                   0x00007fff67a11a08 abort + 120
[aurelien:01098] [ 3] 0   libsystem_c.dylib                   0x00007fff67a10cc2 err + 0
[aurelien:01098] [ 4] 0   mca_coll_ftagree.so                 0x0000000102fa8296 era_tree_rank_from_comm_rank + 230
[aurelien:01098] [ 5] 0   mca_coll_ftagree.so                 0x0000000102faac53 era_next_child + 131
[aurelien:01098] [ 6] 0   mca_coll_ftagree.so                 0x0000000102facff9 restart_agreement_from_me + 201
[aurelien:01098] [ 7] 0   mca_coll_ftagree.so                 0x0000000102fac79f era_mark_process_failed + 575
[aurelien:01098] [ 8] 0   mca_coll_ftagree.so                 0x0000000102fa2930 era_on_comm_rank_failure + 672
[aurelien:01098] [ 9] 0   libmpi.40.dylib                     0x0000000101b63b14 ompi_comm_set_rank_failed + 100
[aurelien:01098] [10] 0   libmpi.40.dylib                     0x0000000101b72b98 ompi_errhandler_proc_failed_internal + 520
[aurelien:01098] [11] 0   mca_pml_ob1.so                      0x0000000102f5364f ompi_errhandler_proc_failed + 31
[aurelien:01098] [12] 0   mca_pml_ob1.so                      0x0000000102f51bda mca_pml_ob1_error_handler + 202
[aurelien:01098] [13] 0   mca_btl_tcp.so                      0x0000000102daff60 mca_btl_tcp_endpoint_close + 560
[aurelien:01098] [14] 0   mca_btl_tcp.so                      0x0000000102db2661 mca_btl_tcp_frag_send + 225
[aurelien:01098] [15] 0   mca_btl_tcp.so                      0x0000000102dae8e9 mca_btl_tcp_endpoint_send + 313
[aurelien:01098] [16] 0   mca_btl_tcp.so                      0x0000000102da6b1e mca_btl_tcp_send + 734
[aurelien:01098] [17] 0   mca_coll_ftagree.so                 0x0000000102fa79d8 send_msg + 5368
[aurelien:01098] [18] 0   mca_coll_ftagree.so                 0x0000000102fac876 era_mark_process_failed + 790
[aurelien:01098] [19] 0   mca_coll_ftagree.so                 0x0000000102fa2930 era_on_comm_rank_failure + 672
[aurelien:01098] [20] 0   libmpi.40.dylib                     0x0000000101b63b14 ompi_comm_set_rank_failed + 100
[aurelien:01098] [21] 0   libmpi.40.dylib                     0x0000000101b72b98 ompi_errhandler_proc_failed_internal + 520
[aurelien:01098] [22] 0   mca_pml_ob1.so                      0x0000000102f5364f ompi_errhandler_proc_failed + 31
[aurelien:01098] [23] 0   mca_pml_ob1.so                      0x0000000102f51bda mca_pml_ob1_error_handler + 202
[aurelien:01098] [24] 0   mca_btl_tcp.so                      0x0000000102daff60 mca_btl_tcp_endpoint_close + 560
[aurelien:01098] [25] 0   mca_btl_tcp.so                      0x0000000102db2aef mca_btl_tcp_frag_recv + 799
[aurelien:01098] [26] 0   mca_btl_tcp.so                      0x0000000102db1124 mca_btl_tcp_endpoint_recv_handler + 1012
[aurelien:01098] [27] 0   libopen-pal.40.dylib                0x00000001020e604b opal_libevent2022_event_base_loop + 1915
[aurelien:01098] [28] 0   libopen-pal.40.dylib                0x00000001020695c0 opal_progress_events + 160
[aurelien:01098] [29] 0   libopen-pal.40.dylib                0x00000001020694ed opal_progress + 205
[aurelien:01098] *** End of error message ***

Solution and Thread safety

A potential solution is to defer all progress outside of ERA mark_as_failed; that way the tree is immutable while we are working on it. However, that does not solve the case of asynchronous progress triggering the mark_as_failed function independently.

The ERA module is not thread safe (in general). This is generally fine as it is not possible (by MPI spec) to call multiple Agree on the same communicator from different threads. One has still however to deal with the case where asynchronous events come from an MPI internal progress thread (or an asynchronous BTL) calling in the communication callbacks.

  1. access to the ERA tree structures must be protected (atomic operations that emulate a rwlock)
  2. recursive calls to mark_as_failed must be avoided, deferment to later progress if a writer is already modifying (queue a progress event for later if rwlock is write-take).

Severity

This bug affects multiple-error cases with rare occurrence. Single error cases are immune.

abouteiller commented 4 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Proposed resolution in PR #19

abouteiller commented 4 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Resolved