rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 525 forks source link

[BUG] KMeans MNMG hangs when K is larger than ~8000 #5958

Closed barius closed 2 months ago

barius commented 2 months ago

Describe the bug I have 2 nodes with 8 A100s each, using dask-scheduler and dask-cuda-worker to start a 2 node 16 GPUs cluster, KMeans MNMG hangs when K grows to around 8000. Smaller K (1000) works fine, or 1 node with K = 50000 is also fine. When cuKMeans.fit() stucks, the GPU util is always 100% but power consumption is very low. Killing the script does not release the GPUs (still 100%) until dask-cuda-worker are killed. Restarting workers and script does not help.

Steps/Code to reproduce bug

  1. run dask-scheduler on one node
  2. dask-cuda-worker host_ip:8786 on each node, and the cluster starts successfully
  3. client = Client('host_ip:8786')
    num_rows = 1000000
    num_dims = 32
    X_gpu, _ = make_blobs(num_rows,
                      num_dims,
                      centers=2,
                      n_parts=n_total_partitions,
                      cluster_std=0.1,
                      client=client,
                      verbose=True)
    kmeans_model = cuKMeans(n_clusters=50000, init='random', max_iter=50)
    kmeans_model.fit(X_gpu)

Expected behavior 2 nodes MNMG works under large K.

Environment details (please complete the following information):

Additional context Add any other context about the problem here.

barius commented 2 months ago

seems like my cluster configurations has something to do with this. restarting the scheduler and workers then do a fresh start seems to help

barius commented 2 months ago

closing as the issue gets resolved on a fresh restart of scheduler and workers