Closed barius closed 4 months ago
seems like my cluster configurations has something to do with this. restarting the scheduler and workers then do a fresh start seems to help
closing as the issue gets resolved on a fresh restart of scheduler and workers
Describe the bug I have 2 nodes with 8 A100s each, using dask-scheduler and dask-cuda-worker to start a 2 node 16 GPUs cluster, KMeans MNMG hangs when K grows to around 8000. Smaller K (1000) works fine, or 1 node with K = 50000 is also fine. When cuKMeans.fit() stucks, the GPU util is always 100% but power consumption is very low. Killing the script does not release the GPUs (still 100%) until dask-cuda-worker are killed. Restarting workers and script does not help.
Steps/Code to reproduce bug
dask-scheduler
on one nodedask-cuda-worker host_ip:8786
on each node, and the cluster starts successfullyExpected behavior 2 nodes MNMG works under large K.
Environment details (please complete the following information):
conda list
and include results hereAdditional context Add any other context about the problem here.