Under some timing, multiple faults may be detected meanwhile an ERA agreement is repairing itself. This is fine by the algorithm, but it may result in a recursive call to era_mark_as_failed, modifying the ERA tree meanwhile we are iterating over the children of the node. This then cause trouble as the view of the tree is not consistent anymore.
A potential solution is to defer all progress outside of ERA mark_as_failed; that way the tree is immutable while we are working on it. However, that does not solve the case of asynchronous progress triggering the mark_as_failed function independently.
The ERA module is not thread safe (in general). This is generally fine as it is not possible (by MPI spec) to call multiple Agree on the same communicator from different threads. One has still however to deal with the case where asynchronous events come from an MPI internal progress thread (or an asynchronous BTL) calling in the communication callbacks.
access to the ERA tree structures must be protected (atomic operations that emulate a rwlock)
recursive calls to mark_as_failed must be avoided, deferment to later progress if a writer is already modifying (queue a progress event for later if rwlock is write-take).
Severity
This bug affects multiple-error cases with rare occurrence. Single error cases are immune.
Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
Problem
Under some timing, multiple faults may be detected meanwhile an ERA agreement is repairing itself. This is fine by the algorithm, but it may result in a recursive call to era_mark_as_failed, modifying the ERA tree meanwhile we are iterating over the children of the node. This then cause trouble as the view of the tree is not consistent anymore.
Solution and Thread safety
A potential solution is to defer all progress outside of ERA mark_as_failed; that way the tree is immutable while we are working on it. However, that does not solve the case of asynchronous progress triggering the mark_as_failed function independently.
The ERA module is not thread safe (in general). This is generally fine as it is not possible (by MPI spec) to call multiple Agree on the same communicator from different threads. One has still however to deal with the case where asynchronous events come from an MPI internal progress thread (or an asynchronous BTL) calling in the communication callbacks.
Severity
This bug affects multiple-error cases with rare occurrence. Single error cases are immune.