Prevent UCX lockup in tests

boeschf commented 3 years ago

test_ucx_context (multi-threaded) locked up from time to time, reason remains unclear
lockup happens during endpoint destruction (end of program)
data seem to be sent and received correctly
suspicion: some ucx internals were not fully completed since threads are joined once data array is (locally) reusable
progressing all workers during context destruction removes the issue (as far as we know)
reason why exactly this works is not clear

angainor commented 3 years ago

The hang happens during endpoint cleanup in ucx::endpoint_t::destroy():

ucs_status_ptr_t ret = ucp_ep_close_nb(m_ep, UCP_EP_CLOSE_MODE_FLUSH);
if(UCS_OK == reinterpret_cast<std::uintptr_t>(ret)) return;
if(UCS_PTR_IS_ERR(ret)) return;

// wait untile the ep is destroyed, free the request                                                        
// THIS IS THE LOOP THAT NEVER ENDS
while(UCS_OK != ucp_request_check_status(ret))
    ucp_worker_progress(m_worker);
ucp_request_free(ret);

It seems that EP cannot be closed with a FLUSH, because it is waiting for some final message from the peer (maybe an ACK on a previous communication). In my tests all our communication did already arrive, so the awaited message is some internal UCX data. The peer however is no longer actively progressing the comm, and is not sending the expected response.

The brute-force solution is to use FORCE instead of FLUSH, but it's not a clean closure. The new barrier implemented in this PR is cleaner, and is not incorrect code: in addition to a barrier it simply progresses all workers, which can be done at any time.

It's hard to say whether it is our fault (we have to progress the workers until some state has been reached), or whether this is a bug / problem in UCX. Since our messages have all been received, I'd say UCX should be able to handle it's own internal exchanges cleanly. But it doesn't.

angainor commented 3 years ago

As a parallel to this discussion we also discussed using join() to wait for thread completion. In my opinion this is in general wrong and can cause deadlocks. join() is in fact a barrier, which doesn't progress our communication. Hence, just like #pragma omp barrier, or MPI_Barrier(), it can lead to deadlocks if threads wait for messages from other threads, which in turn hang on a non-progressing barrier/loop:

thread 1 calls send() to thread 2, then calls wait() on the request, but later doesn't progress the comm and proceeds to join(). Although the send request is completed, it doesn't mean that the receiver did receive it, nor that the message left the senders context. It merely means that the send buffer can be re-used.
thread 2 calls recv() from thread 1 and wait() on the request.
thread 1 waits for thread 2 on the join(), while thread 2 hangs in wait() for comm from thread 1

To avoid that we have our barrier, which progresses all workers and avoids deadlocks. So IMO, before the end of thread-parallel region (whether #pragma omp parallel, or std thread function), we have to require an explicit call to our thread barrier. Note that there is an implicit thread synchronization point at the end of #pragma omp parallel..

angainor commented 3 years ago

A small followup on the barrier / join issue: in the current example (test_ucx_context) there is no dependence between threads running on the same host. Hence, join() cannot deadlock in this particular case. But it can do so in a general case, also in more complex cases that involve indirect dependence through off-node communication.

angainor commented 3 years ago

Finally, regarding the endpoint closure loop: it is potentially a while(true) loop (as in this case). Maybe we should bail out with a warning after some time / number of iterations, to avoid any lockups in the future.

// wait untile the ep is destroyed, free the request                                                        
// THIS IS THE LOOP THAT NEVER ENDS
int i=0;
while(++i<ITER_LIMIT && UCS_OK != ucp_request_check_status(ret))
    ucp_worker_progress(m_worker);
if(i==ITER_LIMIT) fprintf(stderr, "failed to cleanly close an endpoint\n");

ghex-org / GHEX

Prevent UCX lockup in tests #127