[BUG] Importing cuml causes all Dask partitions to associate with GPU 0

hcho3 commented 1 year ago

Describe the bug On a LocalCUDACluster with multiple GPUs, I am observing all Dask partitions to be allocated to GPU 0, causing XGBoost to error out. Weirdly, removing import cuml fixes the problem.

Steps/Code to reproduce bug Run this Python script:

import dask_cudf
import glob
from dask import array as da
from dask import dataframe as dd
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import xgboost as xgb

# Un-comment this line to observe the difference in behavior
# import cuml

if __name__ == "__main__":
    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            m = 100000
            n = 100
            X = da.random.random(size=(m, n), chunks=10000)
            y = da.random.random(size=(m, ), chunks=10000)
            X = dask_cudf.from_dask_dataframe(dd.from_dask_array(X))
            y = dask_cudf.from_dask_dataframe(dd.from_dask_array(y))
            params = {
                "verbosity": 2,
                "tree_method": "gpu_hist"
            }
            dtrain = xgb.dask.DaskQuantileDMatrix(client, X, y)
            output = xgb.dask.train(client, params, dtrain, num_boost_round=4, evals=[(dtrain, 'train')])

With import cuml commented out, the Python program runs successfully:

[06:31:12] task [xgboost.dask-0]:tcp://127.0.0.1:36715 got new rank 0
[06:31:12] task [xgboost.dask-1]:tcp://127.0.0.1:40349 got new rank 1
[06:31:12] task [xgboost.dask-2]:tcp://127.0.0.1:39421 got new rank 2
[06:31:12] task [xgboost.dask-3]:tcp://127.0.0.1:44467 got new rank 3
[0]     train-rmse:0.28842
[1]     train-rmse:0.28799
[2]     train-rmse:0.28754
[3]     train-rmse:0.28705
{'train': OrderedDict([('rmse', [0.28842118658057203, 0.2879935986895539, 0.2875382048036173, 0.28704809112503155])])}

If import cuml is un-commented, we get an error:

xgboost.core.XGBoostError: [06:33:40] src/collective/nccl_device_communicator.cuh:49:
Check failed: n_uniques == world (1 vs. 4) :
Multiple processes within communication group running on same CUDA device is not supported.

This is because all the Dask partitions were allocated to GPU 0. See the output from nvidia-smi:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     42613      C   python                           2022MiB |
|    0   N/A  N/A     42661      C   ...da/envs/rapids/bin/python      306MiB |
|    0   N/A  N/A     42664      C   ...da/envs/rapids/bin/python      306MiB |
|    0   N/A  N/A     42669      C   ...da/envs/rapids/bin/python      306MiB |
|    0   N/A  N/A     42672      C   ...da/envs/rapids/bin/python      306MiB |
+-----------------------------------------------------------------------------+

Expected behavior Importing cuML should not affect the behavior of Dask arrays.

Environment details (please complete the following information):

Environment location: GCP, Latest 22.12 Docker image

hcho3 commented 1 year ago

The bug also exists in the latest nightly Docker image (rapidsai/rapidsai-core-nightly:23.02-cuda11.5-base-ubuntu20.04-py3.9).

hcho3 commented 1 year ago

The bug was probably introduced in 22.12. Using the 22.10 Docker image (nvcr.io/nvidia/rapidsai/rapidsai-core:22.10-cuda11.5-base-ubuntu20.04-py3.9) fixes the problem.

rapidsai / cuml

[BUG] Importing cuml causes all Dask partitions to associate with GPU 0 #5206