rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.64k stars 294 forks source link

[BUG]: `test_triangle_count_mg` Hangs on `test_triangles` #4504

Closed nv-rliu closed 1 week ago

nv-rliu commented 3 weeks ago

Version

24.08

Which installation method(s) does this occur on?

Docker, Conda, Pip, Source

Describe the bug.

If I run the entire unit test with pytest --import-mode=append test_triangle_count_mg.py the test hangs once it reaches test_triangles

However, if I just run that test pytest --import-mode=append test_triangle_count_mg.py::test_triangles all 4 tests pass.

I have been able to reproduce this on dgx18 with 2-GPUs and 8-GPUs

Minimum reproducible example

cd cugraph/python/cugraph/cugraph/tests/community/
pytest --import-mode=append test_triangle_count_mg.py

Relevant log output

06/25/24-09:36:30.353826845_UTC>>>> NODE 0: ******** STARTING TESTS FROM: tests/community/test_triangle_count_mg.py, using 2 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.2.2, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=1 min_time=0.000005 max_time=0 calibration_precision=10 warmup=False warmup_iterations=100000)
rapids_pytest_benchmark: 0.0.15
rootdir: /root/cugraph/python/cugraph
configfile: pytest.ini
plugins: benchmark-4.0.0, cov-5.0.0, rapids-pytest-benchmark-0.0.15
collecting ... collected 8 items

tests/community/test_triangle_count_mg.py::test_sg_triangles[True-dataset0] [1719308199.470541] [rno1-m02-d01-dgx1-054:3448418:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719308199.470541] [rno1-m02-d01-dgx1-054:3448418:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719308199.470236] [rno1-m02-d01-dgx1-054:3448421:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719308199.470236] [rno1-m02-d01-dgx1-054:3448421:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)

Dask client/cluster created using LocalCUDACluster
PASSED
tests/community/test_triangle_count_mg.py::test_sg_triangles[True-dataset1] PASSED
tests/community/test_triangle_count_mg.py::test_sg_triangles[False-dataset0] PASSED
tests/community/test_triangle_count_mg.py::test_sg_triangles[False-dataset1] PASSED
tests/community/test_triangle_count_mg.py::test_triangles[True-dataset0] 2024-06-25 02:46:36,775 - distributed.scheduler - WARNING - Worker failed to heartbeat for 573s; attempting restart: <WorkerState 'tcp://127.0.0.1:33497', name: 0, status: running, memory: 1, processing: 1>
2024-06-25 02:46:36,776 - distributed.scheduler - WARNING - Worker failed to heartbeat for 573s; attempting restart: <WorkerState 'tcp://127.0.0.1:36725', name: 1, status: running, memory: 1, processing: 1>
2024-06-25 02:46:40,781 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-06-25 02:46:40,781 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-06-25 02:46:41,504 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:33497' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-1a8448be-f346-4fb3-8c1a-d30e8927ceea'} (stimulus_id='handle-worker-cleanup-1719308801.5040765')
2024-06-25 02:46:41,615 - distributed.nanny - WARNING - Restarting worker
2024-06-25 02:46:41,678 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:36725' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-fe0f6523-9050-4b79-a59d-27b05ae06c06'} (stimulus_id='handle-worker-cleanup-1719308801.6782262')
2024-06-25 02:46:41,771 - distributed.nanny - WARNING - Restarting worker
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
2024-06-25 03:36:32,399 - distributed.nanny - ERROR - Worker process died unexpectedly
2024-06-25 03:36:32,399 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/tornado/ioloop.py", line 759, in _run_callback
    ret = gen.convert_yielded(ret)
  File "/opt/conda/lib/python3.10/functools.py", line 884, in wrapper
    def wrapper(*args, **kw):
KeyboardInterrupt
sys:1: RuntimeWarning: coroutine 'PeriodicCallback._run' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
2024-06-25 03:37:01,016 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:33497 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:33497 after 30 s
2024-06-25 03:37:01,017 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:36725 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:36725 after 30 s

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/opt/conda/lib/python3.10/threading.py:324: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 546, in connect
    stream = await self.client.connect(
  File "/opt/conda/lib/python3.10/site-packages/tornado/tcpclient.py", line 279, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 342, in connect
    comm = await wait_for(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 1961, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/pytest", line 10, in <module>
    sys.exit(console_main())
  File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 206, in console_main
    code = main()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 178, in main
    ret: Union[ExitCode, int] = config.hook.pytest_cmdline_main(
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 332, in pytest_cmdline_main
    return wrap_session(config, _main)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 320, in wrap_session
    config.hook.pytest_sessionfinish(
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 182, in _multicall
    return outcome.get_result()
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_result.py", line 100, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/logging.py", line 872, in pytest_sessionfinish
    return (yield)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/terminal.py", line 867, in pytest_sessionfinish
    result = yield
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/warnings.py", line 140, in pytest_sessionfinish
    return (yield)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 110, in pytest_sessionfinish
    session._setupstate.teardown_exact(None)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 557, in teardown_exact
    raise exceptions[0]
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 546, in teardown_exact
    fin()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1023, in finish
    raise exceptions[0]
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1012, in finish
    fin()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 896, in _teardown_yield_fixture
    next(it)
  File "/root/cugraph/python/cugraph/cugraph/tests/conftest.py", line 52, in dask_client
    stop_dask_client(dask_client, dask_cluster)
  File "/opt/conda/lib/python3.10/site-packages/cugraph/testing/mg_utils.py", line 178, in stop_dask_client
    Comms.destroy()
  File "/opt/conda/lib/python3.10/site-packages/cugraph/dask/comms/comms.py", line 216, in destroy
    __instance.destroy()
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
    self.client.run(
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 2991, in run
    return self.sync(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 358, in sync
    return sync(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 434, in sync
    raise error
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 408, in f
    result = yield future
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 2896, in _run
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/distributed/scheduler.py", line 6458, in send_message
    comm = await self.rpc.connect(addr)
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1674, in connect
    return connect_attempt.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1564, in _connect
    comm = await connect(
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 368, in connect
    raise OSError(
OSError: Timed out trying to connect to tcp://127.0.0.1:33497 after 30 s
Exception ignored in: <function Comms.__del__ at 0x15089c0e4670>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 135, in __del__
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 2991, in run
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 358, in sync
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 425, in sync
  File "/opt/conda/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 227, in add_callback
AttributeError: 'NoneType' object has no attribute 'get_running_loop'
06/25/24-10:37:02.265345794_UTC>>>> ERROR: command timed out after 3600 seconds
06/25/24-10:37:02.266951355_UTC>>>> NODE 0: pytest exited with code: 124, run-py-tests.sh overall exit code is: 124
06/25/24-10:37:02.369564140_UTC>>>> NODE 0: remaining python processes: [ 3437613 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
06/25/24-10:37:02.403092539_UTC>>>> NODE 0: remaining dask processes: [  ]

Environment details

Using nightly containers from here:
gitlab-master.nvidia.com:5005/rratzel/cugraph-testing/cugraph_testing-24.08

Other/Misc.

No response

Code of Conduct

jnke2016 commented 2 weeks ago

Our nightly MG triangle_count hangs on 4+ GPUs with the C++ test below.

INSTANTIATE_TEST_SUITE_P(
  file_tests,
  Tests_MGTriangleCount_File,
  ::testing::Combine(
    // enable correctness checks
    ::testing::Values(TriangleCount_Usecase{0.1, false}),
    ::testing::Values(cugraph::test::File_Usecase("test/datasets/karate.mtx"))));

The hang occurs when the vertex_subset_ratio is relatively small compared the number of rank which translate to certain ranks having no vertex to process.