ucs_status_ptr_t ret = ucp_ep_close_nb(m_ep, UCP_EP_CLOSE_MODE_FORCE);
We have to use UCP_EP_CLOSE_MODE_FORCE instead of UCP_EP_CLOSE_MODE_FLUSH because of a cleanup glitch. Summary: if one rank destroys the worker and the endpoint, and another rank then tries to destroy the endpoint with a FLUSH, it can to communicate with an already closed remote worker. This causes a segfault. UCP_EP_CLOSE_MODE_FORCE fixes the segfault, but the solution suggested by the developers was to use a barrier after endpoint destructor, only then close the workers.
Currently we destroy the endpoint with
We have to use
UCP_EP_CLOSE_MODE_FORCE
instead ofUCP_EP_CLOSE_MODE_FLUSH
because of a cleanup glitch. Summary: if one rank destroys the worker and the endpoint, and another rank then tries to destroy the endpoint with a FLUSH, it can to communicate with an already closed remote worker. This causes a segfault.UCP_EP_CLOSE_MODE_FORCE
fixes the segfault, but the solution suggested by the developers was to use a barrier after endpoint destructor, only then close the workers.