rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.67k stars 298 forks source link

[BUG]: `test_k_truss_subgraph` Memory Error on 2-GPUs #4617

Open nv-rliu opened 3 weeks ago

nv-rliu commented 3 weeks ago

Version

24.10

Which installation method(s) does this occur on?

Conda

Describe the bug.

When running tests/test_k_truss_subgraph_mg.py on 2-GPU on draco-rno, the test encounters a memory error that causes it to fail.

Minimum reproducible example

pytest -v --import-mode=append test_k_truss_subgraph_mg.py

Relevant log output

08/19/24-09:31:10.033434924_UTC>>>> NODE 0: ******** STARTING TESTS FROM: tests/community/test_k_truss_subgraph_mg.py, using 2 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.3.2, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
rapids_pytest_benchmark: 0.0.15
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=1 min_time=0.000005 max_time=0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /root/cugraph/python/cugraph
configfile: pytest.ini
plugins: cov-5.0.0, rapids-pytest-benchmark-0.0.15, benchmark-4.0.0
collecting ... collected 18 items

tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-True-dataset0] 
Dask client/cluster created using LocalCUDACluster
PASSED
tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-True-dataset1] PASSED
tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-True-dataset2] PASSED
tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-False-dataset0] PASSED
tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-False-dataset1] PASSED
tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-False-dataset2] [rno1-m02-f01-dgx1-116:3535017:0:3535158] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[rno1-m02-f01-dgx1-116:3535020:0:3535160] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:3535158) ====
 0  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(ucs_handle_error+0x2fd) [0x14eafd95dcfd]
 1  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2def4) [0x14eafd95def4]
 2  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2e0ba) [0x14eafd95e0ba]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x14eb71edf520]
 4  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x55c60) [0x14ead105dc60]
 5  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3217e) [0x14ead103a17e]
 6  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x345bf) [0x14ead103c5bf]
 7  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x38e73) [0x14ead1040e73]
 8  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3bfc5) [0x14ead1043fc5]
 9  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3d183) [0x14ead1045183]
10  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(ncclGroupEnd+0x6a) [0x14ead104592a]
11  /opt/conda/lib/python3.10/site-packages/raft_dask/common/comms_utils.cpython-310-x86_64-linux-gnu.so(+0x32573) [0x14eafdb2b573]
12  /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(+0x348a069) [0x14e9e43ef069]
13  /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph6detail24edge_triangle_count_implIiiLb0ELb1EEENS_15edge_property_tINS_12graph_view_tIT_T0_Lb0EXT2_EvEES5_EERKN4raft8handle_tERKNS3_IS4_S5_XT1_EXT2_EvEE+0x86f) [0x14e9e43f6bef]
14  /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph19edge_triangle_countIiiLb1EEENS_15edge_property_tINS_12graph_view_tIT_T0_Lb0EXT1_EvEES4_EERKN4raft8handle_tERKS5_+0xa) [0x14e9e43f994a]
15  /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph7k_trussIiifLb1EEESt5tupleIJN3rmm14device_uvectorIT_EES5_St8optionalINS3_IT1_EEEEERKN4raft8handle_tERKNS_12graph_view_tIS4_T0_Lb0EXT2_EvEES6_INS_20edge_property_view_tISG_PKS7_N6thrust15iterator_traitsISM_E10value_typeEEEESG_b+0x10f9) [0x14e9e55d9bd9]
16  /opt/conda/lib/python3.10/site-packages/pylibcugraph/../../../libcugraph_c.so(+0x1e0b4f) [0x14e958121b4f]
17  /opt/conda/lib/python3.10/site-packages/pylibcugraph/../../../libcugraph_c.so(cugraph_k_truss_subgraph+0xde) [0x14e9581297ae]
18  /opt/conda/lib/python3.10/site-packages/pylibcugraph/k_truss_subgraph.cpython-310-x86_64-linux-gnu.so(+0x6cde) [0x14eaad202cde]
19  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x13ca) [0x560dda68c8fa]
20  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
21  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x560dda68e2b3]
22  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
23  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x320) [0x560dda68b850]
24  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
25  /opt/conda/bin/python3.10(+0x25f60c) [0x560dda7b660c]
26  /opt/conda/bin/python3.10(+0xfdd90) [0x560dda654d90]
27  /opt/conda/bin/python3.10(+0x13c2a3) [0x560dda6932a3]
28  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x5cd5) [0x560dda691205]
29  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
30  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x560dda68e2b3]
31  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
32  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x560dda68bc5c]
33  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
34  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x560dda68e2b3]
35  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
36  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x560dda68bc5c]
37  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
38  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x560dda68bc5c]
39  /opt/conda/bin/python3.10(+0x150804) [0x560dda6a7804]
40  /opt/conda/bin/python3.10(+0x228372) [0x560dda77f372]
41  /opt/conda/bin/python3.10(+0x228324) [0x560dda77f324]
42  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x14eb71f31ac3]
43  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x14eb71fc3850]
=================================
==== backtrace (tid:3535160) ====
 0  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(ucs_handle_error+0x2fd) [0x154101c86cfd]
 1  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2def4) [0x154101c86ef4]
 2  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2e0ba) [0x154101c870ba]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x15416c0f9520]
 4  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x55c60) [0x1540c905dc60]
 5  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3217e) [0x1540c903a17e]
 6  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x345bf) [0x1540c903c5bf]
 7  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x38e73) [0x1540c9040e73]
 8  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3bfc5) [0x1540c9043fc5]
 9  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3d183) [0x1540c9045183]
10  /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(ncclGroupEnd+0x6a) [0x1540c904592a]
11  /opt/conda/lib/python3.10/site-packages/raft_dask/common/comms_utils.cpython-310-x86_64-linux-gnu.so(+0x32573) [0x154101e92573]
12  /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(+0x348a069) [0x153fdea7e069]
13  /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph6detail24edge_triangle_count_implIiiLb0ELb1EEENS_15edge_property_tINS_12graph_view_tIT_T0_Lb0EXT2_EvEES5_EERKN4raft8handle_tERKNS3_IS4_S5_XT1_EXT2_EvEE+0x86f) [0x153fdea85bef]
14  /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph19edge_triangle_countIiiLb1EEENS_15edge_property_tINS_12graph_view_tIT_T0_Lb0EXT1_EvEES4_EERKN4raft8handle_tERKS5_+0xa) [0x153fdea8894a]
15  /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph7k_trussIiifLb1EEESt5tupleIJN3rmm14device_uvectorIT_EES5_St8optionalINS3_IT1_EEEEERKN4raft8handle_tERKNS_12graph_view_tIS4_T0_Lb0EXT2_EvEES6_INS_20edge_property_view_tISG_PKS7_N6thrust15iterator_traitsISM_E10value_typeEEEESG_b+0x10f9) [0x153fdfc68bd9]
16  /opt/conda/lib/python3.10/site-packages/pylibcugraph/../../../libcugraph_c.so(+0x1e0b4f) [0x153f52220b4f]
17  /opt/conda/lib/python3.10/site-packages/pylibcugraph/../../../libcugraph_c.so(cugraph_k_truss_subgraph+0xde) [0x153f522287ae]
18  /opt/conda/lib/python3.10/site-packages/pylibcugraph/k_truss_subgraph.cpython-310-x86_64-linux-gnu.so(+0x6cde) [0x1540a2db9cde]
19  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x13ca) [0x56501c88e8fa]
20  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
21  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x56501c8902b3]
22  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
23  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x320) [0x56501c88d850]
24  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
25  /opt/conda/bin/python3.10(+0x25f60c) [0x56501c9b860c]
26  /opt/conda/bin/python3.10(+0xfdd90) [0x56501c856d90]
27  /opt/conda/bin/python3.10(+0x13c2a3) [0x56501c8952a3]
28  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x5cd5) [0x56501c893205]
29  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
30  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x56501c8902b3]
31  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
32  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x56501c88dc5c]
33  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
34  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x56501c8902b3]
35  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
36  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x56501c88dc5c]
37  /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
38  /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x56501c88dc5c]
39  /opt/conda/bin/python3.10(+0x150804) [0x56501c8a9804]
40  /opt/conda/bin/python3.10(+0x228372) [0x56501c981372]
41  /opt/conda/bin/python3.10(+0x228324) [0x56501c981324]
42  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x15416c14bac3]
43  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x15416c1dd850]
=================================
2024-08-19 02:33:04,329 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:44047' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-9385f2a5-2e1f-4f4f-99c9-556a0d63fd42'} (stimulus_id='handle-worker-cleanup-1724059984.328924')
2024-08-19 02:33:04,424 - distributed.nanny - WARNING - Restarting worker
2024-08-19 02:33:04,487 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:46495' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-ea9ace46-435d-45b8-bbe5-7f0ba731f9de'} (stimulus_id='handle-worker-cleanup-1724059984.487717')
2024-08-19 02:33:04,584 - distributed.nanny - WARNING - Restarting worker
2024-08-19 02:46:12,072 - distributed.nanny - ERROR - Worker process died unexpectedly
2024-08-19 02:46:12,072 - distributed.nanny - ERROR - Worker process died unexpectedly
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1015, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1015, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
2024-08-19 02:46:40,513 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:44047 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:44047 after 30 s
2024-08-19 02:46:40,514 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:46495 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:46495 after 30 s

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/opt/conda/lib/python3.10/threading.py:324: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x145f18c168c0>: ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/pytest", line 10, in <module>
    sys.exit(console_main())
  File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 201, in console_main
    code = main()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 175, in main
    ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 330, in pytest_cmdline_main
    return wrap_session(config, _main)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 318, in wrap_session
    config.hook.pytest_sessionfinish(
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 182, in _multicall
    return outcome.get_result()
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_result.py", line 100, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/logging.py", line 870, in pytest_sessionfinish
    return (yield)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/terminal.py", line 893, in pytest_sessionfinish
    result = yield
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/warnings.py", line 141, in pytest_sessionfinish
    return (yield)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 107, in pytest_sessionfinish
    session._setupstate.teardown_exact(None)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 557, in teardown_exact
    raise exceptions[0]
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 546, in teardown_exact
    fin()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1031, in finish
    raise exceptions[0]
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1020, in finish
    fin()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 906, in _teardown_yield_fixture
    next(it)
  File "/root/cugraph/python/cugraph/cugraph/tests/conftest.py", line 52, in dask_client
    stop_dask_client(dask_client, dask_cluster)
  File "/opt/conda/lib/python3.10/site-packages/cugraph/testing/mg_utils.py", line 182, in stop_dask_client
    Comms.destroy()
  File "/opt/conda/lib/python3.10/site-packages/cugraph/dask/comms/comms.py", line 214, in destroy
    __instance.destroy()
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
    self.client.run(
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3192, in run
    return self.sync(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 363, in sync
    return sync(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 439, in sync
    raise error
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 413, in f
    result = yield future
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3097, in _run
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/distributed/scheduler.py", line 6653, in send_message
    comm = await self.rpc.connect(addr)
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1535, in connect
    return connect_attempt.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1425, in _connect
    comm = await connect(
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 368, in connect
    raise OSError(
OSError: Timed out trying to connect to tcp://127.0.0.1:44047 after 30 s
Exception ignored in: <function Comms.__del__ at 0x146005005120>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 135, in __del__
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3192, in run
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 363, in sync
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 430, in sync
  File "/opt/conda/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 227, in add_callback
AttributeError: 'NoneType' object has no attribute 'get_running_loop'
08/19/24-09:46:41.556836520_UTC>>>> ERROR: command timed out after 900 seconds
08/19/24-09:46:41.557996984_UTC>>>> NODE 0: pytest exited with code: 124, run-py-tests.sh overall exit code is: 124
08/19/24-09:46:41.633951285_UTC>>>> NODE 0: remaining python processes: [ 3526387 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
08/19/24-09:46:41.657663379_UTC>>>> NODE 0: remaining dask processes: [  ]

Environment details

Running on 2-GPUs and 1-Node on draco-rno using LocalCUDACluster.

Other/Misc.

Was unable to reproduce this failure on the lab machines. Also, this failure can be seen without running the entire suite of cugraph MG tests inside an interactive slurm session.

Code of Conduct

nv-rliu commented 3 weeks ago

Occurs when using netscience without renumbering

nv-rliu commented 3 weeks ago

Reproducing strategy: take the non-renumbered data from Python and convert it to C++ vectors to see what the algorithm is doing and why it is running into a memory error