rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.25k stars 534 forks source link

[QST] How to Run Multiple Concurrent Tasks on a GPU in Dask cuML #6112

Closed mnlcarv closed 1 month ago

mnlcarv commented 1 month ago

I want to test the execution of multiple concurrent tasks on the GPU in Dask cuML. I'm using K-means (code below) and I'm changing the chunk size so that I can create multiple tasks for the fit method and run them in the GPU.

from cuml.dask.cluster import KMeans as cumlKMeans
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import dask.array as da
import time

def main():

    cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0", n_workers=1, threads_per_worker=8)
    client = Client(cluster)

    n_samples = 12500
    n_features = 100
    start_random_state = 170
    # n_block_rows = 12500 # Case 1 (1 task)
    n_block_rows = 1563 # Case 2 (8 tasks)
    n_clusters = 100

    X = da.random.random((n_samples, n_features), chunks=(n_block_rows, n_features))
    dX = X.to_dask_dataframe().to_backend('cudf')

    kmeans_model = cumlKMeans(n_clusters=n_clusters, random_state=start_random_state, max_iter=5)

    wait(dX)
    start_time = time.time()
    kmeans_model.fit(dX)
    wait(kmeans_model._func_fit)
    print(f"Fit took: {time.time() - start_time} sec")

    client.close()

if __name__ == '__main__':
    main()

Since I only have 1 GPU in my machine, I can only use 1 worker. To support multiple tasks in 1 worker, I increased the number of threads per worker to 8 (the maximum number of CPU cores on my machine).

Basically, I'm testing the following cases:

Case 1: 1 task in a GPU Case 2: 8 tasks in a GPU

I measured the execution time (directly in the code also calling wait() to make sure any async task is finished) of the fit method and case 1 is faster than case 2 for a toy dataset:

Case 1 (1 task): 1.54s Case 2 (8 tasks): 2.49s

I also checked the time of the fit method displayed in the Dask dashboard (Task Stream tab):

Case 1 (1 task):  0.60s Case 2 (8 tasks): 0.94s

Finally, I took a look at the graph displayed in the Dask dashboard for each case in the figures below. The fit method is represented by the red task and cuDF operations are represented by the blue tasks. In both cases, the fit method is being represented by a single task. However, I was expecting to see 8 fit tasks in parallel for case 2 (i.e. 8 tasks being executed concurrently in a single GPU).

Graph Case 1 Image

Graph Case 2 Image

Could anyone help me to understand these results? In particular, I have the following questions:

  1. Why is there a difference between the execution times measured directly in the code and the fit execution time displayed in the Dask dashboard? Am I measuring something wrong?

  2. Am I generating multiple fit tasks in this code? If so, how multiple tasks are processed in a GPU for case 2? I have 2 hypotheses: A) cuML uses concurrent CUDA streams to process task's kernels in parallel ; or B) The kernels of each task are being appended to the default CUDA stream and being processed sequentially in the GPU. Since case 2 is slower than case 1, it looks like hypothesis B) is the most likely to be occurring.

Note: I wanted to profile the code with nsys to see how kernels are being executed inside the GPU, but I'm using Ubuntu WSL and apparently there is not support yet for collecting kernel data in such a setup, according to this discussion.


My setup: A laptop with one MX450 GPU and 8 CPU cores Ubuntu 22.04 (WSL) CUDA 11.8 CUDA driver version 565.90 RAPIDS for CUDA 11 (cuml-cu11 24.10.0, dask-cudf-cu11 24.10.1) dask-cuda 24.10.0 dask 2024.9.0

divyegala commented 1 month ago

@mnlcarv for my understanding of your code and what you are trying to achieve, are you expecting cuML to use CPU threads for parallelism? We do not optimize CPU parallelism generally.

marcosnlc4 commented 1 month ago

@divyegala thanks for your reply. Yes, but my goal was to run multiple concurrent tasks on a single GPU. However it seems that cuML uses NCCL to manage inter-GPU communication and NCCL currently supports only one process per GPU. Hence, it's only possible to run one task per GPU, no matter how I partition the dataset into multiple chunks (assuming that each chunk will be assigned to a task).

divyegala commented 1 month ago

@marcosnlc4 yes, that is indeed the case at the moment. Have you looked into Multi-instance GPU? Perhaps that might be what you are looking for at the moment. I'll close this issue for now.