rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.24k stars 532 forks source link

dask pca with two 1080ti is slower then one 1080ti pca #4913

Open keyword1983 opened 2 years ago

keyword1983 commented 2 years ago

Hi i have a try with dask pca, but i got result is weird.

my env : NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 rapids: 22.04

below is my sample code: this is dask PCA with two 1080ti and fit_transform execution time was 23.5 s XT.compute() was 5.87

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import cupy as cp
from cuml.dask.decomposition import PCA
from cuml.dask.datasets import make_blobs
import dask_cudf

cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)

nrows = 80000
ncols = 5000
n_parts = 2

%time X_cudf=dask_cudf.read_parquet('/mnt/gwas_data/gwas_data_80k_5k.parquet.gzip',npartitions=4)
blobs = X_cudf.compute()
X_cudf
print(blobs) 

%time cumlModel = PCA(n_components = 2, whiten=False)
%time XT = cumlModel.fit_transform(X_cudf[X_cudf.columns[X_cudf.dtypes == cp.float32]])
%time print(XT.compute()) 

client.close()
cluster.close()

this is cuml PCA without dask and one 1080ti fit_transform execution time was 3.63 s

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import cupy as cp
from cuml.decomposition import PCA
from cuml.datasets import make_blobs
import cudf

%time X_cudf = cudf.read_parquet('/mnt/gwas_data/gwas_data_80k_5k.parquet.gzip')

%time
#blobs = X_cudf.compute()
X_cudf
#print(blobs) 

%time ddf = X_cudf[X_cudf.columns[X_cudf.dtypes == cp.float32]]

%time cumlModel = PCA(n_components = 2, whiten=False)
%time XT = cumlModel.fit_transform(ddf)
%time print(XT)

My question is , is it supposed dask pca with two 1080ti should faster then pca with one 1080ti?

PS: I did this experiment with v100 gpus got same reault.

viclafargue commented 2 years ago

Dask estimators have massive overhead because of indispensable inter-GPU transfers and are only ever useful when workers are given sufficient workload. Then when in a presence of a sufficiently large dataset worth distributing, the compute time can be reduced through the use of NVLink and Infiniband for faster transfers when available.

Also, fit_transform is supposed to execute lazily and only start the actual work on the compute call. The 23.5s execution time could be explained by the cuDF operation.