Closed boeschf closed 3 years ago
The hang happens during endpoint cleanup in ucx::endpoint_t::destroy()
:
ucs_status_ptr_t ret = ucp_ep_close_nb(m_ep, UCP_EP_CLOSE_MODE_FLUSH);
if(UCS_OK == reinterpret_cast<std::uintptr_t>(ret)) return;
if(UCS_PTR_IS_ERR(ret)) return;
// wait untile the ep is destroyed, free the request
// THIS IS THE LOOP THAT NEVER ENDS
while(UCS_OK != ucp_request_check_status(ret))
ucp_worker_progress(m_worker);
ucp_request_free(ret);
It seems that EP cannot be closed with a FLUSH
, because it is waiting for some final message from the peer (maybe an ACK on a previous communication). In my tests all our communication did already arrive, so the awaited message is some internal UCX data. The peer however is no longer actively progressing the comm, and is not sending the expected response.
The brute-force solution is to use FORCE
instead of FLUSH
, but it's not a clean closure. The new barrier implemented in this PR is cleaner, and is not incorrect code: in addition to a barrier it simply progresses all workers, which can be done at any time.
It's hard to say whether it is our fault (we have to progress the workers until some state has been reached), or whether this is a bug / problem in UCX. Since our messages have all been received, I'd say UCX should be able to handle it's own internal exchanges cleanly. But it doesn't.
As a parallel to this discussion we also discussed using join()
to wait for thread completion. In my opinion this is in general wrong and can cause deadlocks. join()
is in fact a barrier, which doesn't progress our communication. Hence, just like #pragma omp barrier
, or MPI_Barrier()
, it can lead to deadlocks if threads wait for messages from other threads, which in turn hang on a non-progressing barrier/loop:
send()
to thread 2, then calls wait()
on the request, but later doesn't progress the comm and proceeds to join()
. Although the send request is completed, it doesn't mean that the receiver did receive it, nor that the message left the senders context. It merely means that the send buffer can be re-used.recv()
from thread 1 and wait()
on the request.join()
, while thread 2 hangs in wait()
for comm from thread 1To avoid that we have our barrier, which progresses all workers and avoids deadlocks. So IMO, before the end of thread-parallel region (whether #pragma omp parallel
, or std thread function), we have to require an explicit call to our thread barrier. Note that there is an implicit thread synchronization point at the end of #pragma omp parallel
..
A small followup on the barrier / join
issue: in the current example (test_ucx_context
) there is no dependence between threads running on the same host. Hence, join()
cannot deadlock in this particular case. But it can do so in a general case, also in more complex cases that involve indirect dependence through off-node communication.
Finally, regarding the endpoint closure loop: it is potentially a while(true)
loop (as in this case). Maybe we should bail out with a warning after some time / number of iterations, to avoid any lockups in the future.
// wait untile the ep is destroyed, free the request
// THIS IS THE LOOP THAT NEVER ENDS
int i=0;
while(++i<ITER_LIMIT && UCS_OK != ucp_request_check_status(ret))
ucp_worker_progress(m_worker);
if(i==ITER_LIMIT) fprintf(stderr, "failed to cleanly close an endpoint\n");