rapidsai / ucxx

BSD 3-Clause "New" or "Revised" License
19 stars 27 forks source link

Failing or hanging tests #26

Open pentschev opened 1 year ago

pentschev commented 1 year ago

Some of the tests may non-deterministically fail. They seem to be more prevalent in non-InfiniBand setups, such as CI.

C++ tests (some are possibly related to https://github.com/rapidsai/ucxx/issues/15):

Python (unable to reproduce locally so far):

wence- commented 1 year ago

FWIW, I consistently see hangs of DelayedSubmission/RequestTest.ProgressTagMulti/2 when run like so:

./UCXX_TEST --gtest_filter=DelayedSubmission/RequestTest.ProgressTagMulti/*

But not when run on its own

./UCXX_TEST --gtest_filter=DelayedSubmission/RequestTest.ProgressTagMulti/2
pentschev commented 1 year ago

Does it hang or is it just slow (takes ~20 seconds, as described in #15 )? Is this happening with or without changes from https://github.com/rapidsai/ucxx/pull/30 ? And does UCX_PROTO_ENABLE=y change anything?

pentschev commented 1 year ago

I actually see the same, but in my case it's mostly DelayedSubmission/RequestTest.ProgressTagMulti/5 (but /2 hung once too), UCX_PROTO_ENABLE=y didn't help, neither did #30 .

wence- commented 1 year ago

Here's a backtrace (not tried with #30 yet)

Thread 6 (Thread 0x7fbf69109000 (LWP 34708) "UCXX_TEST"):
#0  0x00007fbf73f49fde in epoll_wait () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fbf73dbd502 in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7fbf69106774, timeout_ms=timeout_ms@entry=0, event_set_handler=event_set_handler@entry=0x7fbf73e0b6b0 <uct_tcp_iface_handle_events>, arg=arg@entry=0x7fbf69106770) at sys/event_set.c:198
#2  0x00007fbf73e0a5ce in uct_tcp_iface_progress (tl_iface=0x5608c496a9c0) at tcp/tcp_iface.c:333
#3  0x00007fbf745e1b9a in ucs_callbackq_dispatch (cbq=<optimized out>) at /usr/local/src/conda/ucx-1.14.0/src/ucs/datastruct/callbackq.h:211
#4  uct_worker_progress (worker=<optimized out>) at /usr/local/src/conda/ucx-1.14.0/src/uct/api/uct.h:2768
#5  ucp_worker_progress (worker=0x5608c4996e80) at core/ucp_worker.c:2807
#6  0x00007fbf7469acde in ucxx::Worker::progressOnce (this=this@entry=0x5608c49035a0) at /home/wence/Documents/src/rapids/first-party/ucxx/cpp/src/worker.cpp:178
#7  0x00007fbf7469ad0c in ucxx::Worker::progressPending (this=this@entry=0x5608c49035a0) at /home/wence/Documents/src/rapids/first-party/ucxx/cpp/src/worker.cpp:184
#8  0x00007fbf7469d0d0 in ucxx::Worker::progress (this=0x5608c49035a0) at /home/wence/Documents/src/rapids/first-party/ucxx/cpp/src/worker.cpp:192
#9  0x00007fbf7469e42c in std::function<bool ()>::operator()() const (this=0x7fbf691068a0) at /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/std_function.h:590
#10 ucxx::WorkerProgressThread::progressUntilSync(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>) (progressFunction=..., stop=@0x5608c49965a8: false, startCallback=..., startCallbackArg=<optimized out>, delayedSubmissionCollection=...) at /home/wence/Documents/src/rapids/first-party/ucxx/cpp/src/worker_progress_thread.cpp:24
#11 0x00007fbf7469eb6e in std::__invoke_impl<void, void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> >(std::__invoke_other, void (*&&)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>&&, std::reference_wrapper<bool>&&, std::function<void (void*)>&&, void*&&, std::shared_ptr<ucxx::DelayedSubmissionCollection>&&) (__f=<optimized out>, __f=<optimized out>) at /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/invoke.h:61
#12 std::__invoke<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> >(void (*&&)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>&&, std::reference_wrapper<bool>&&, std::function<void (void*)>&&, void*&&, std::shared_ptr<ucxx::DelayedSubmissionCollection>&&) (__fn=<optimized out>) at /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/invoke.h:96
#13 std::thread::_Invoker<std::tuple<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> > >::_M_invoke<0ul, 1ul, 2ul, 3ul, 4ul, 5ul>(std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul, 5ul>) (this=<optimized out>) at /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/std_thread.h:253
#14 std::thread::_Invoker<std::tuple<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> > >::operator()() (this=<optimized out>) at /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/std_thread.h:260
#15 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> > > >::_M_run() (this=<optimized out>) at /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/std_thread.h:211
#16 0x00007fbf74119a93 in std::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#17 0x00007fbf73eb8b43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#18 0x00007fbf73f49bb4 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 5 (Thread 0x7fbf6a13c000 (LWP 34707) "async"):
#0  0x00007fbf73f49fde in epoll_wait () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fbf73dbd502 in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7fbf6a1398ac, timeout_ms=<optimized out>, event_set_handler=event_set_handler@entry=0x7fbf73da5650 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7fbf6a1398b0) at sys/event_set.c:198
#2  0x00007fbf73da5d94 in ucs_async_thread_func (arg=0x5608c49130b0) at async/thread.c:131
#3  0x00007fbf73eb8b43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#4  0x00007fbf73f49bb4 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 4 (Thread 0x7fbf63d1d000 (LWP 34706) "UCXX_TEST"):
#0  0x00007fbf73eb5197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fbf73eb7ac1 in pthread_cond_wait () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fbf72004759 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
--Type <RET> for more, q to quit, c to continue without paging--
#3  0x00007fbf72095998 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fbf73eb8b43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#5  0x00007fbf73f49bb4 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7fbf68908000 (LWP 34705) "cuda-EvtHandlr"):
#0  0x00007fbf73f3cd7f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fbf72093069 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fbf7214dc0b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fbf72095998 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fbf73eb8b43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#5  0x00007fbf73f49bb4 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7fbf6a93d000 (LWP 34702) "cuda-EvtHandlr"):
#0  0x00007fbf73f3cd7f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fbf72093069 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fbf7214dc0b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fbf72095998 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fbf73eb8b43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#5  0x00007fbf73f49bb4 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7fbf744b4000 (LWP 34697) "UCXX_TEST"):
#0  waitRequestsTagMulti(std::shared_ptr<ucxx::Worker>, std::vector<std::shared_ptr<ucxx::RequestTagMulti>, std::allocator<std::shared_ptr<ucxx::RequestTagMulti> > > const&, std::function<void ()> const&) (worker=..., requests=..., progressWorker=...) at /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/std_function.h:247
#1  0x00005608c2ea0a67 in (anonymous namespace)::RequestTest_ProgressTagMulti_Test::TestBody (this=0x5608c48fced0) at /home/wence/Documents/src/rapids/first-party/ucxx/cpp/tests/request.cpp:188
#2  0x00007fbf7458498e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) () from /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libgtest.so
#3  0x00007fbf74584be1 in testing::Test::Run() () from /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libgtest.so
#4  0x00007fbf74584f0f in testing::TestInfo::Run() () from /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libgtest.so
#5  0x00007fbf74585036 in testing::TestSuite::Run() () from /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libgtest.so
#6  0x00007fbf745855ec in testing::internal::UnitTestImpl::RunAllTests() () from /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libgtest.so
#7  0x00007fbf74585859 in testing::UnitTest::Run() () from /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libgtest.so
#8  0x00007fbf7466407d in main () from /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libgmock_main.so
#9  0x00007fbf73e4dd90 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#10 0x00007fbf73e4de40 in __libc_start_main () from /usr/lib/x86_64-linux-gnu/libc.so.6
#11 0x00005608c2e6e8de in _start ()
wence- commented 1 year ago

I had a try building with -fsanitize=thread but can make neither head nor tail of the results (all the errors look like false positives to me).

pentschev commented 1 year ago

It looks as if we're not progressing the worker, but it seems like we're on an endless loop of that. So we're not exactly deadlocked, but we're waiting for a message to complete.

pentschev commented 1 year ago

I'm stupid. https://github.com/rapidsai/ucxx/pull/32 is supposed to resolve this, I forgot it wasn't merged yet. Can you check if it works with that PR?

wence- commented 1 year ago

Yes, with #32 I don't get a hang, I just have (running the whole test suite) on failure:

/home/wence/Documents/src/rapids/first-party/ucxx/cpp/tests/request.cpp:250: Failure
Value of: status
Expected: is equal to 0
  Actual: 1 (of type ucs_status_t)
[  FAILED  ] ProgressModes/RequestTest.TagUserCallback/9, where GetParam() = (4-byte object <00-00 00-00>, false, 4-byte object <04-00 00-00>, 1) (10 ms)
pentschev commented 1 year ago

Yes, that's why I couldn't merge #32 yet. 🙁

wence- commented 1 year ago

But not all the time :(

pentschev commented 1 year ago

Same for the PR, it happens just enough to make the PR less stable than what is currently running.

wence- commented 1 year ago

Hmm, if I do some not very scientific printf-debugging (basically this diff):

diff --git a/cpp/tests/request.cpp b/cpp/tests/request.cpp
index 723dda7..0b67358 100644
--- a/cpp/tests/request.cpp
+++ b/cpp/tests/request.cpp
@@ -221,6 +221,8 @@ TEST_P(RequestTest, TagUserCallback)

   auto checkStatus = [&requests, &requestStatus](std::shared_ptr<void> data) {
     auto idx = *std::static_pointer_cast<size_t>(data);
+    std::cout << "Running callback" << std::endl;
+    std::cout.flush();
     if (requests[idx] == nullptr) {
       /**
        * Unfortunately, we can't check the status this way if the request completes
@@ -242,12 +244,15 @@ TEST_P(RequestTest, TagUserCallback)
   requests[0] = _ep->tagSend(_sendPtr[0], _messageSize, 0, false, checkStatus, sendIndex);
   requests[1] = _ep->tagRecv(_recvPtr[0], _messageSize, 0, false, checkStatus, recvIndex);
   waitRequests(_worker, requests, _progressWorker);
+  std::cout << "Waited" << std::endl;
+  std::cout.flush();

   copyResults();

   // Assert status was set before user callback is executed
   for (const auto status : requestStatus)
-    ASSERT_THAT(status, UCS_OK);
+    if (status != UCS_OK) abort();
+  //    ASSERT_THAT(status, UCS_OK);
   for (const auto request : requests)
     ASSERT_THAT(request->getStatus(), UCS_OK);

I see that sometimes the "Running Callback" print comes after the "Waited" one. Is there somehow a lack of synchronisation somewhere? I suppose the callback is run on a different thread from the main test thread.

wence- commented 1 year ago

Sometimes address santizer complains like this: (compiled with -g -O0 -fsanitize=address, running with ./UCXX_TEST --gtest_filter=-RMM*)

 RUN      ] DelayedSubmission/RequestTest.TagUserCallback/0
Running callback
Waited
Running callback
=================================================================
==55891==ERROR: AddressSanitizer: heap-use-after-free on address 0x61200002ba28 at pc 0x55b58c340b1d bp 0x7fbdfa618340 sp 0x7fbdfa618338
READ of size 8 at 0x61200002ba28 thread T131
    #0 0x55b58c340b1c in (anonymous namespace)::RequestTest_TagUserCallback_Test::TestBody()::{lambda(std::shared_ptr<void>)#1}::operator()(std::shared_ptr<void>) const (/home/wence/Documents/src/rapids/first-party/ucxx/cpp/build/gtests/UCXX_TEST+0xc7b1c)
    #1 0x55b58c3567d2 in __invoke_impl<void, (anonymous namespace)::RequestTest_TagUserCallback_Test::TestBody()::<lambda(std::shared_ptr<void>)>&, std::shared_ptr<void> > /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/invoke.h:61
    #2 0x55b58c352300 in __invoke_r<void, (anonymous namespace)::RequestTest_TagUserCallback_Test::TestBody()::<lambda(std::shared_ptr<void>)>&, std::shared_ptr<void> > /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/invoke.h:111
    #3 0x55b58c34ce49 in _M_invoke /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/std_function.h:290
    #4 0x7fbe090e0919 in std::function<void (std::shared_ptr<void>)>::operator()(std::shared_ptr<void>) const /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/std_function.h:590
    #5 0x7fbe090e0919 in ucxx::Request::callback(void*, ucs_status_t) /home/wence/Documents/src/rapids/first-party/ucxx/cpp/src/request.cpp:108
    #6 0x7fbe09062a64 in ucp_request_complete_tag_recv /usr/local/src/conda/ucx-1.14.0/src/ucp/core/ucp_request.inl:246
    #7 0x7fbe09062a64 in ucp_eager_tagged_handler tag/eager_rcv.c:107
    #8 0x7fbe09062a64 in ucp_eager_only_handler tag/eager_rcv.c:141
    #9 0x7fbe07de842b in uct_iface_invoke_am /usr/local/src/conda/ucx-1.14.0/src/uct/base/uct_iface.h:904
    #10 0x7fbe07de842b in uct_tcp_ep_comp_recv_am tcp/tcp_ep.c:1289
    #11 0x7fbe07de842b in uct_tcp_ep_progress_am_rx tcp/tcp_ep.c:1421
    #12 0x7fbe07de842b in uct_tcp_ep_progress_data_rx tcp/tcp_ep.c:1526
    #13 0x7fbe07deb6d7 in uct_tcp_iface_handle_events tcp/tcp_iface.c:316
    #14 0x7fbe07d9d558 in ucs_event_set_wait sys/event_set.c:215
    #15 0x7fbe07dea5cd in uct_tcp_iface_progress tcp/tcp_iface.c:333
    #16 0x7fbe0902fb99 in ucs_callbackq_dispatch /usr/local/src/conda/ucx-1.14.0/src/ucs/datastruct/callbackq.h:211
    #17 0x7fbe0902fb99 in uct_worker_progress /usr/local/src/conda/ucx-1.14.0/src/uct/api/uct.h:2768
    #18 0x7fbe0902fb99 in ucp_worker_progress core/ucp_worker.c:2807
    #19 0x7fbe090e8ced in ucxx::Worker::progressOnce() /home/wence/Documents/src/rapids/first-party/ucxx/cpp/src/worker.cpp:178
    #20 0x7fbe090e8d1b in ucxx::Worker::progressPending() /home/wence/Documents/src/rapids/first-party/ucxx/cpp/src/worker.cpp:184
    #21 0x7fbe090eb0df in ucxx::Worker::progress() /home/wence/Documents/src/rapids/first-party/ucxx/cpp/src/worker.cpp:192
    #22 0x7fbe090ec43b in std::function<bool ()>::operator()() const /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/std_function.h:590
    #23 0x7fbe090ec43b in ucxx::WorkerProgressThread::progressUntilSync(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>) /home/wence/Documents/src/rapids/first-party/ucxx/cpp/src/worker_progress_thread.cpp:24
    #24 0x7fbe090ecb7d in void std::__invoke_impl<void, void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> >(std::__invoke_other, void (*&&)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>&&, std::reference_wrapper<bool>&&, std::function<void (void*)>&&, void*&&, std::shared_ptr<ucxx::DelayedSubmissionCollection>&&) /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/invoke.h:61
    #25 0x7fbe090ecb7d in std::__invoke_result<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> >::type std::__invoke<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> >(void (*&&)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>&&, std::reference_wrapper<bool>&&, std::function<void (void*)>&&, void*&&, std::shared_ptr<ucxx::DelayedSubmissionCollection>&&) /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/invoke.h:96
    #26 0x7fbe090ecb7d in void std::thread::_Invoker<std::tuple<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> > >::_M_invoke<0ul, 1ul, 2ul, 3ul, 4ul, 5ul>(std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul, 5ul>) /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/std_thread.h:253
    #27 0x7fbe090ecb7d in std::thread::_Invoker<std::tuple<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> > >::operator()() /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/std_thread.h:260
    #28 0x7fbe090ecb7d in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> > > >::_M_run() /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/std_thread.h:211
    #29 0x7fbe08119a92 in execute_native_thread_routine ../../../../../libstdc++-v3/src/c++11/thread.cc:82
    #30 0x7fbe07eb8b42 in start_thread nptl/pthread_create.c:442
    #31 0x7fbe07f49bb3 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x125bb3)

0x61200002ba28 is located 104 bytes inside of 280-byte region [0x61200002b9c0,0x61200002bad8)
freed by thread T0 here:
    #0 0x7fbe08693d3f in operator delete(void*, unsigned long) ../../../../libsanitizer/asan/asan_new_delete.cpp:172
    #1 0x55b58c2b3ab1 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/shared_ptr_base.h:168
    #2 0x55b58c2b11b6 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/shared_ptr_base.h:705
    #3 0x55b58c31f015 in std::__shared_ptr<ucxx::Request, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/shared_ptr_base.h:1154
    #4 0x55b58c31f031 in std::shared_ptr<ucxx::Request>::~shared_ptr() /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/shared_ptr.h:122
    #5 0x55b58c32ebae in void std::_Destroy<std::shared_ptr<ucxx::Request> >(std::shared_ptr<ucxx::Request>*) /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_construct.h:151
    #6 0x55b58c32c064 in void std::_Destroy_aux<false>::__destroy<std::shared_ptr<ucxx::Request>*>(std::shared_ptr<ucxx::Request>*, std::shared_ptr<ucxx::Request>*) /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_construct.h:163
    #7 0x55b58c329241 in void std::_Destroy<std::shared_ptr<ucxx::Request>*>(std::shared_ptr<ucxx::Request>*, std::shared_ptr<ucxx::Request>*) /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_construct.h:196
    #8 0x55b58c3274fa in void std::_Destroy<std::shared_ptr<ucxx::Request>*, std::shared_ptr<ucxx::Request> >(std::shared_ptr<ucxx::Request>*, std::shared_ptr<ucxx::Request>*, std::allocator<std::shared_ptr<ucxx::Request> >&) /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/alloc_traits.h:848
    #9 0x55b58c32575c in std::vector<std::shared_ptr<ucxx::Request>, std::allocator<std::shared_ptr<ucxx::Request> > >::~vector() /home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_vector.h:680
    #10 0x55b58c341ff6 in (anonymous namespace)::RequestTest_TagUserCallback_Test::TestBody() (/home/wence/Documents/src/rapids/first-party/ucxx/cpp/build/gtests/UCXX_TEST+0xc8ff6)
    #11 0x7fbe085e698d in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libgtest.so+0x4c98d)

previously allocated by thread T0 here:
    #0 0x7fbe086931f2 in operator new(unsigned long) ../../../../libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7fbe090e3c4d in ucxx::createRequestTag(std::shared_ptr<ucxx::Component>, bool, void*, unsigned long, unsigned long, bool, std::function<void (std::shared_ptr<void>)>, std::shared_ptr<void>) /home/wence/Documents/src/rapids/first-party/ucxx/cpp/src/request_tag.cpp:33
    #2 0x7fbe090daa53 in ucxx::Endpoint::tagRecv(void*, unsigned long, unsigned long, bool, std::function<void (std::shared_ptr<void>)>, std::shared_ptr<void>) /home/wence/Documents/src/rapids/first-party/ucxx/cpp/src/endpoint.cpp:234
    #3 0x55b58c3416b2 in (anonymous namespace)::RequestTest_TagUserCallback_Test::TestBody() (/home/wence/Documents/src/rapids/first-party/ucxx/cpp/build/gtests/UCXX_TEST+0xc86b2)
    #4 0x7fbe085e698d in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/wence/Documents/src/rapids/compose/etc/conda/cuda_11.8/envs/rapids/lib/libgtest.so+0x4c98d)

Thread T131 created by T0 here:
    #0 0x7fbe0866f6fb in __interceptor_pthread_create ../../../../libsanitizer/asan/asan_interceptors.cpp:216
    #1 0x7fbe081199f1 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516830325/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/x86_64-conda-linux-gnu/bits/gthr-default.h:663

SUMMARY: AddressSanitizer: heap-use-after-free (/home/wence/Documents/src/rapids/first-party/ucxx/cpp/build/gtests/UCXX_TEST+0xc7b1c) in (anonymous namespace)::RequestTest_TagUserCallback_Test::TestBody()::{lambda(std::shared_ptr<void>)#1}::operator()(std::shared_ptr<void>) const
Shadow bytes around the buggy address:
  0x0c247fffd6f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c247fffd700: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c247fffd710: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c247fffd720: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c247fffd730: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
=>0x0c247fffd740: fd fd fd fd fd[fd]fd fd fd fd fd fd fd fd fd fd
  0x0c247fffd750: fd fd fd fd fd fd fd fd fd fd fd fa fa fa fa fa
  0x0c247fffd760: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c247fffd770: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c247fffd780: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c247fffd790: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==55891==ABORTING
wence- commented 1 year ago

Which kind of aligns with the hypothesis that there's no synchronisation between the test main thread and the one running the callbacks. That use-after-free looks like the main thread has got to the end of the test, deallocated the requests vector, and now the thread running the callback is trying to access it.

Note I have this diff in the test:

diff --git a/cpp/tests/request.cpp b/cpp/tests/request.cpp
index 723dda7..81cfce4 100644
--- a/cpp/tests/request.cpp
+++ b/cpp/tests/request.cpp
@@ -221,6 +221,8 @@ TEST_P(RequestTest, TagUserCallback)

   auto checkStatus = [&requests, &requestStatus](std::shared_ptr<void> data) {
     auto idx = *std::static_pointer_cast<size_t>(data);
+    std::cout << "Running callback" << std::endl;
+    std::cout.flush();
     if (requests[idx] == nullptr) {
       /**
        * Unfortunately, we can't check the status this way if the request completes
@@ -231,6 +233,7 @@ TEST_P(RequestTest, TagUserCallback)
        */
       requestStatus[idx] = UCS_OK;
     } else {
+      if (requests[idx]->getStatus() != UCS_OK) abort();
       requestStatus[idx] = requests[idx]->getStatus();
     }
   };
@@ -242,12 +245,15 @@ TEST_P(RequestTest, TagUserCallback)
   requests[0] = _ep->tagSend(_sendPtr[0], _messageSize, 0, false, checkStatus, sendIndex);
   requests[1] = _ep->tagRecv(_recvPtr[0], _messageSize, 0, false, checkStatus, recvIndex);
   waitRequests(_worker, requests, _progressWorker);
+  std::cout << "Waited" << std::endl;
+  std::cout.flush();

   copyResults();

   // Assert status was set before user callback is executed
-  for (const auto status : requestStatus)
-    ASSERT_THAT(status, UCS_OK);
+  // for (const auto status : requestStatus)
+  //   if (status != UCS_OK) abort();
+  //    ASSERT_THAT(status, UCS_OK);
   for (const auto request : requests)
     ASSERT_THAT(request->getStatus(), UCS_OK);
pentschev commented 1 year ago

Based on your debug output it seems that you're right, interestingly this seems to occur only for UCX_TLS=tcp, but not for UCX_TLS=rc. I'll see if I can understand what kind of synchronization we're missing.

pentschev commented 1 year ago

I think what we need is to actually split the status the callback sees from what the user sees (request's result). By setting the request's status before calling the callback it may really be the case that the user will check the status before the callback runs. I'm thinking the better way is to switch back setting the status after the callback runs, and insert the status as part of the callback, similar to what UCP does ucp_send_nbx_callback_t that it receives the status and the user argument. Essentially we would change ucxx::Request::_callback by std::function<void(ucs_status_t, std::shared_ptr<void>)>. Thoughts on this?

wence- commented 1 year ago

I think it makes sense to mimic ucp_send_nbx_callback_t. That also matches the principle of least surprise for a user coming from the C API.