ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Round of review #56

Open abouteiller opened 3 years ago

abouteiller commented 3 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


  1. the MPI_FT attribute should not be inherited

    1. pr #24
  2. (x) instead of communicator/ft/* we should have communicator/ulfm/*

    1. we call it ft everywhere else, not convinced.
  3. (x) why is ompi_group_afp_mutex global instead of being per communicator ?

    1. We update a global group when failures are detected, this avoids duplicating the storage and update cost on all existing communicators. In typical use cases, these comms get discarded immediately after. Because we access a global group, we need a global lock.
  4. reordered filds in the communicator structure to follow type alignments

    1. pr #24
  5. (x) there are calls to opal_progress in ompi_comm_iface_coll_check and ompi_comm_iface_p2p_check_proc. I’m quite sure they should not be there

    1. We may never enter progress again if we are waiting on an operation that depends on a failed proc only, because the iface_xxx_check bails-out at the API level. This can cause revoke propagation to stall and caused deadlocks.
  6. why is the function mca_coll_ftagree_era_free_comm never used ?

    1. Turning on this feature causes COMM_FREE of a revoked communicator to deadlock. The small memory leak is worth the optimization/simplification of not using the strict early cleanup. The MPI library remains finalize-clean, as we remove all era structures during finalize. pr #26 implements this solution.
  7. (x) I see no need for errhandler_ftmpi_lock to be global ?

    1. We could use atomic CAS on field proc->proc_active instead. We’d need to have mark_as_failed return if updated so that the CAS ‘is it updated’ percolates to the proc_failed_internal. Is it worth the trouble though?
    2. commented on rationale in commit 213c49032de9b6a49912c91803df46aa7a6d2007
  8. I really don’t like the “#if 1” all over the code, especially those that protect typedefs

    1. commit 4e9fbbcb
  9. (x) the monitoring pml does not need a revoke function. I saw the base automatically set the base revoke if the PML does not support, we should fall back to that mechanism

    1. I think that would introduce a bug: base_revoke returns ERR_NOT_IMPLEMENTED; the PML monitoring needs to call the host PML revoke (which in OB1 is implemented), that’s what the very small shim does. (I have the patch that does this change, if I happen to be mistaken)
  10. Joseph’s comment from September 2020 in pml_ob1_isend.c is not addressed.

    1. commit 262a7f750fc9a86cc936aa59b535936fb3406db6
  11. Looking at mca_pml_ob1_revoke_comm I am not certain the test “NULL == proc” is correct. The proc is always something, either pointing to the real ompi_proc_t structure, or containing the proc name.

    1. replaced with proc_is_sentinel commit 3eed10fa
    2. fix above is incorrect, added a comment about ob1_comm_procs to remember why not using is_sentinel 61d3ffef
  12. There is an “#if 0” in pml_ob1_sendreq.c

    1. commit 4e9fbbcb
  13. The FT MCA (ompi_mpi_ft_verbose and ompi_mpi_ft_enable) should always exists, but be read-only if FT is not enabled.

    1. commit 765cac2b
  14. (x) assuming I understand correctly the changes on the sync behavior, I don’t think this will work, as we can wait on a sync while waiting on another sync (and you explicitly prohibited this). To give you an example, waiting for a request from a non-blocking collective, which triggers blocking communications inside.

    1. TODO: we need to discuss, I do not see the problem. We take the sync->lock for a very short amount of time, with about the same pattern normal frag completion would do. I do not see a recursive scenario for the wait_sync_lock; it is possible that we receive failure notification from multiple sources at the same time, so calls to global_wakup_mt could appear concurrent, but they are going to be serialized by the lock (as intended). The comment discusses the case where we hold wait_sync_lock while another thread is holding one of the considered sync->lock; but holding the sync->lock is transient and does not recursively take wait_sync_lock, so that double-lock is transient and will self resolve as soon as either wait enters progress or the frag completion is done.
    2. We discussed and we think it is fine. George is double checking on his side just to be sure.