ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Thread-safe Agree #54

Closed abouteiller closed 4 years ago

abouteiller commented 4 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Despite recent progress, COMM_AGREE/IAGREE are not yet completely thread safe.

The following functions are accessing global variables w/o thread safety:

  1. in era_cb_fn the hash-table era_incomplete_message is set/read from potentially concurrent threads; the solution of ‘bouncing’ the message if we can’t acquired the lock is not effective here, as we need to copy the descriptor of the frag NOW. A potential solution is to fine-grain lock(not try-lock) the hash-table, only other access place (era_cb_fn may be concurrent with itself) is from finalize, and would not result in recursive or nested locking with the global era_lock.

  2. in mca_coll_ftagree_era_complete_agreement: this is called (locked) from era_ckeck_status in a variety of message callbacks.

    1. It is also called (non-locked) from agree_req_complete_cb called in ompi_request_complete in iera_intra. Locking before calling the ompi_request_complete should fix (no recursive or nested locking).
    2. called (non-locked) from ompi_request_wait_completion in era_intra and era_inter. More difficult as locking wait_completion is problematic (recursive locking with the message callbacks). This modifies aera_ongoing_agreements era_passed_agreements. Access to AGS(comm)->afr* which is modified in prepare_agreement(there it’s locked). Fine-grain locking of individual structures could result in recursive or nested locking (deadlock) so it is problematic.
    3. (case b is invalid, because we will call ompi_request_complete only from era_decide which is always called during a locked callback.)
  3. in era_free_comm call to collect_passed_agreements accesses global hash-tables era_passed_agreements w/o locking; global locking from up-call (like in prepare_agreement) should fix.

abouteiller commented 4 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


PR #21