ghex-org / GHEX

Generic exascale-ready library for halo-exchange operations on variety of grids/meshes
Other
8 stars 14 forks source link

worker and endpoint cleanup #118

Open angainor opened 3 years ago

angainor commented 3 years ago

Currently we destroy the endpoint with

ucs_status_ptr_t ret = ucp_ep_close_nb(m_ep, UCP_EP_CLOSE_MODE_FORCE);

We have to use UCP_EP_CLOSE_MODE_FORCE instead of UCP_EP_CLOSE_MODE_FLUSH because of a cleanup glitch. Summary: if one rank destroys the worker and the endpoint, and another rank then tries to destroy the endpoint with a FLUSH, it can to communicate with an already closed remote worker. This causes a segfault. UCP_EP_CLOSE_MODE_FORCE fixes the segfault, but the solution suggested by the developers was to use a barrier after endpoint destructor, only then close the workers.