pentschev / ucx-py-ci

UCX-Py CI Issue Tracker
1 stars 1 forks source link

Nightly Tests for ucx-master from 2021-09-12 22:00: 2 Failures #260

Open pentschev opened 3 years ago

pentschev commented 3 years ago

Test failures

0 in ucx-py-libs-ib-test
0 in ucx-py-ib-test
0 in ucx-py-libs-nvlink-test
0 in ucx-py-nvlink-test
0 in ucx-py-libs-tcp-test
2 in ucx-py-tcp-test ```python size = 16, blocking_progress_mode = True, recv_wait = True data = {'allocator': functools.partial(, dtype=), 'generator': functool...pe=), 'memory_type': 'cuda', 'validator': . at 0x7f04f9c7f700>} @pytest.mark.skipif( not ucp._libs.ucx_api.is_am_supported(), reason="AM only supported in UCX >= 1.11" ) @pytest.mark.asyncio @pytest.mark.parametrize("size", msg_sizes) @pytest.mark.parametrize("blocking_progress_mode", [True, False]) @pytest.mark.parametrize("recv_wait", [True, False]) @pytest.mark.parametrize("data", get_data()) async def test_send_recv_bytes(size, blocking_progress_mode, recv_wait, data): rndv_thresh = 8192 ucp.init( options={"RNDV_THRESH": str(rndv_thresh)}, blocking_progress_mode=blocking_progress_mode, ) ucp.register_am_allocator(data["allocator"], data["memory_type"]) msg = data["generator"](size) recv = [] listener = ucp.create_listener(simple_server(size, recv)) num_clients = 1 clients = [ await ucp.create_endpoint(ucp.get_address(), listener.port) for i in range(num_clients) ] for c in clients: if recv_wait: # By sleeping here we ensure that the listener's # ep.am_recv call will have to wait, rather than return # immediately as receive data is already available. await asyncio.sleep(1) await c.am_send(msg) for c in clients: await c.close() listener.close() if data["memory_type"] == "cuda" and msg.nbytes < rndv_thresh: # Eager messages are always received on the host, if no host # allocator is registered UCX-Py defaults to `bytearray`. > assert recv[0] == bytearray(msg.get()) E IndexError: list index out of range tests/test_send_recv_am.py:114: IndexError scope = 'function' @pytest.fixture() def event_loop(scope="function"): loop = asyncio.new_event_loop() loop.set_exception_handler(handle_exception) ucp.reset() yield loop > ucp.reset() tests/test_send_recv_am.py:62: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ def reset(): """Resets the UCX library by shutting down all of UCX. The library is initiated at next API call. """ global _ctx if _ctx is not None: weakref_ctx = weakref.ref(_ctx) _ctx = None gc.collect() if weakref_ctx() is not None: msg = ( "Trying to reset UCX but not all Endpoints and/or Listeners " "are closed(). The following objects are still referencing " "ApplicationContext: " ) for o in gc.get_referrers(weakref_ctx()): msg += "\n %s" % str(o) > raise UCXError(msg) E ucp.exceptions.UCXError: Trying to reset UCX but not all Endpoints and/or Listeners are closed(). The following objects are still referencing ApplicationContext: E (.server at 0x7f084bc9b160>, , True) ../../../miniconda3/envs/gdf/lib/python3.8/site-packages/ucp/core.py:920: UCXError ```
0 in dask-cuda

Complete test result logs

ucx-py-libs-ib-test ucx-py-ib-test ucx-py-libs-nvlink-test ucx-py-nvlink-test ucx-py-libs-tcp-test ucx-py-tcp-test dask-cuda

pentschev commented 3 years ago

The failing test is test_send_recv_bytes[data2-True-True-16] (2 reported, but it's actually only 1, the second report is its teardown) seems flaky, running the test on the same build for 300 times hasn't caused it to fail.