[QST] multi-GPU settings when data does not fit into single GPU

rapidsai / cuml

cuML - RAPIDS Machine Learning Library

https://docs.rapids.ai/api/cuml/stable/

Apache License 2.0

4.14k stars 526 forks source link

[QST] multi-GPU settings when data does not fit into single GPU #3981

Closed hep07 closed 2 years ago

hep07 commented 3 years ago

Hi, I have a couple of questions:

First, I am wondering if multi-node multi-GPU algorithms like knn or kmean support cases when data does not fit on the memory of 1 single GPU? It seems that even with cuda_cudf, the data has to be able to fit into the memory of one single GPU and the multi-GPU is just there to speed up runtime as opposed to scale things up. Is my understanding correct? I got that impression from the following two observations 1) all multi-GPU example code I have seen require you put data on one GPU first before put that into dask_cudf
2) When I use local CUDA dask cluster to load data directly from csv or parquet (using read_csv or read_parquet), I am able to run .fit directly without having to put all data on 1 GPU. However, I found that at least in kmean, even when the number of workers = number of partitions in dask_cudf = number of GPUs, there will be 1 GPU that uses a lot more memory than other GPUs.

Second, the random_state does not give me reproducible kmean clustering results on multi-GPU version if I run this: from cuml.dask.cluster import KMeans km = KMeans(n_clusters_100, random_state=2021) km.fit(dask_cuda_df)

km.c;istercenters will be different everytime I run the above block of code. Why is that the case? The single GPU version has reproducible results.

Thanks a lot!

viclafargue commented 3 years ago

Hi @hep07, thanks for opening this issue.

I am wondering if multi-node multi-GPU algorithms like knn or kmean support cases when data does not fit on the memory of 1 single GPU?

Yes, it does. Features of the cuDF library might indeed come handy to directly load your data on multiple GPUs. This blog post might help you partition the index and query of a KNN search.

However, I found that at least in kmean, even when the number of workers = number of partitions in dask_cudf = number of GPUs, there will be 1 GPU that uses a lot more memory than other GPUs.

Yes, this is indeed a bit annoying. Would be interesting to have more information with the cuDF team, as it might come from input partitioning. However, in my understanding, it's the Dask abstraction that is doing its work in the background. Indeed, the exact GPU in which the partitions will end up is let to the Dask library and is thus undefined/unknown to the user. It should however be possible to set the partitions sizes so that only one can be stored on a given GPU allowing an equal load.

Concerning random_state in distributed KMeans, it looks like something has to be fixed. Thanks for noticing us. This should probably have an issue of its own.

hep07 commented 3 years ago

Thanks @viclafargue! I will open another issue with detailed code and data for the random_state issue with multi-GPU kmeans.

Thanks for bring about the post. I have read that post and the "distribute_data" function in that post motivated my question here. In that function, it looks like we need to first put all the data on 1 single GPU by cp_array = cp.array(np_array) before sending it to multiple GPUs via dask_cudf. This really defeat the purpose. One pressing need I have is to be able to handle data that cannot fit into 1 GPU memory (but can fit on host memory for sure). Is there a way dask_cudf can read_from pandas or numpy array directly and manage the distribution over say 8 GPUs through the dask functionality? saw a read_csv function but wonder if io vs. a csv file on disk is the only way to do this.

Another thing I found lately, from nvidia-smi, is that cuML most of time just use a random set of 7 GPUs out of 8 GPUs during the computation stage of kmean of knn,, is that expected and is it configurable?

Thanks again!

viclafargue commented 3 years ago

Thanks for your reply!

it looks like we need to first put all the data on 1 single GPU [...] before sending it to multiple GPUs via dask_cudf

That's what I am doing in the article, but it is only for the sake of simplicity. Indeed, embeddings were stored in npy files (could not be loaded directly) and could fit in a single GPU. You can find tons of examples on how to load data from csv, json, orc, parquet and AWS S3 in the dask_cudf IO testing directory.

It is also possible to load and distribute your data into a Dask DataFrame/Array backed with host memory and then turn the partitions on each worker into cuDF DFs or cuPy arrays.

With Dask DataFrame : df = df.map_partitions(cudf.from_pandas)
With Dask Array :x = x.map_blocks(cupy.asarray)

See using GPUs with Dask.

cuML most of time just use a random set of 7 GPUs out of 8 GPUs during the computation stage of kmean of knn

That's probably due to the problem I mentioned earlier. Indeed, the 8 partitions might fit into 7 GPUs causing Dask to randomly leave out one GPU.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

viclafargue commented 2 years ago

Closing the issue. Please don't hesitate to re-open if needed.

akshayjadiya commented 1 year ago

Hi @viclafargue Is it necessary to distribute the queries and the index before using the multi node multi GPU implementation? I notice that when I have only 1 partition in the query dask dataframe, the distances and indices returned by the kneighbors() function are correct. However, when the query dask dataframe has multiple partitions, I do not get correct results.