rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.24k stars 532 forks source link

[BUG] Dask Multi-GPU logistic regression should convert dtypes if needed rather than fail #5552

Open beckernick opened 1 year ago

beckernick commented 1 year ago

The new Dask multi-GPU logistic regression should convert dtypes if needed rather than fail due to input data structures not adhering to the dtype expectations of the C++ implemenatation.

The exact example in the docstring works:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf
import cudf
import numpy as np
from cuml.dask.linear_model import LogisticRegression

cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1")
client = Client(cluster)

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float64)
X['col2'] = np.array([1,2,2,3], dtype = np.float64)
y = cudf.Series(np.array([0.0, 0.0, 1.0, 1.0], dtype=np.float64))

X_ddf = dask_cudf.from_cudf(X, npartitions=2)
y_ddf = dask_cudf.from_cudf(y, npartitions=2)

reg = LogisticRegression()
reg.fit(X_ddf, y_ddf)

As it does using all float32 dtypes:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf
import cudf
import numpy as np
from cuml.dask.linear_model import LogisticRegression

cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1")
client = Client(cluster)

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)
y = cudf.Series(np.array([0.0, 0.0, 1.0, 1.0], dtype=np.float32))

X_ddf = dask_cudf.from_cudf(X, npartitions=2)
y_ddf = dask_cudf.from_cudf(y, npartitions=2)

reg = LogisticRegression()
reg.fit(X_ddf, y_ddf)
<cuml.dask.linear_model.logistic_regression.LogisticRegression at 0x7f8d1e230d60>

But a mixing of float32 and float64 fails:

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float64)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)
y = cudf.Series(np.array([0.0, 0.0, 1.0, 1.0], dtype=np.float32))

X_ddf = dask_cudf.from_cudf(X, npartitions=2)
y_ddf = dask_cudf.from_cudf(y, npartitions=2)

reg = LogisticRegression()
reg.fit(X_ddf, y_ddf)
2023-08-11 15:37:06,339 - distributed.worker - WARNING - Compute Failed
Key:       _func_fit-e228a067-be45-42ca-bc34-02e7b2009e3a
Function:  _func_fit
args:      (LogisticRegressionMG(), [(   col1  col2
2   2.0   2.0
3   2.0   3.0, 2    1.0
3    1.0
dtype: float32)], 4, 2, [(1, 2), (0, 2)], 0)
kwargs:    {}
Exception: 'TypeError("Expected input to be of type in [dtype(\'float64\')] but got float32")'
...

As does mixing in an int32:

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.int32)
X['col2'] = np.array([1,2,2,3], dtype = np.int32)
y = cudf.Series(np.array([0.0, 0.0, 1.0, 1.0], dtype=np.float32))

X_ddf = dask_cudf.from_cudf(X, npartitions=2)
y_ddf = dask_cudf.from_cudf(y, npartitions=2)

reg = LogisticRegression()
reg.fit(X_ddf, y_ddf)
2023-08-11 15:39:30,280 - distributed.worker - WARNING - Compute Failed
Key:       _func_fit-2a49ecfa-7dd9-4a31-9170-dac07357d902
Function:  _func_fit
args:      (LogisticRegressionMG(), [(   col1  col2
2     2     2
3     2     3, 2    1.0
3    1.0
dtype: float32)], 4, 2, [(1, 2), (0, 2)], 0)
kwargs:    {}
Exception: 'TypeError("Expected input to be of type in [dtype(\'float32\'), dtype(\'float64\')] but got int32")'
...
conda list | grep rapids
# packages in environment at /home/nicholasb/miniconda3/envs/rapids-23.08:
cudf_kafka                23.08.00a       cuda11_py310_230811_g9d794877fd_216    rapidsai-nightly
cusignal                  23.08.00a       py310_230811_ga644c53_8    rapidsai-nightly
libcucim                  23.08.00a       cuda11_230811_g2ecf819_25    rapidsai-nightly
libcudf                   23.08.00a       cuda11_230811_g9d794877fd_216    rapidsai-nightly
libcudf_kafka             23.08.00a       cuda11_230811_g9d794877fd_216    rapidsai-nightly
libcugraph                23.08.00a       cuda11_230811_g15f8bbaf_83    rapidsai-nightly
libcugraph_etl            23.08.00a       cuda11_230811_g15f8bbaf_83    rapidsai-nightly
libcugraphops             23.08.00a       cuda11_230811_g9c081845_21    rapidsai-nightly
libcuml                   23.08.00a       cuda11_230811_g07176ea74_59    rapidsai-nightly
libcumlprims              23.08.00a       cuda11_230710_gd32fef7_2    rapidsai-nightly
libcuspatial              23.08.00a       cuda11_230811_gf105464c_65    rapidsai-nightly
libkvikio                 23.08.00a       cuda11_230811_gc644fca_30    rapidsai-nightly
libraft                   23.08.00a       cuda11_230811_g48625467_89    rapidsai-nightly
libraft-headers           23.08.00a       cuda11_230811_g48625467_89    rapidsai-nightly
libraft-headers-only      23.08.00a       cuda11_230811_g48625467_89    rapidsai-nightly
librmm                    23.08.00a       cuda11_230811_g314b669a_27    rapidsai-nightly
libxgboost                1.7.4           rapidsai_ha9c50b3_6    rapidsai-nightly
py-xgboost                1.7.4           rapidsai_py310h1395376_6    rapidsai-nightly
rapids                    23.08.00a       cuda11_py310_230803_g72f0ca7_35    rapidsai-nightly
rapids-xgboost            23.08.00a       cuda11_py310_230803_g72f0ca7_35    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
schmidt-ai commented 1 year ago

+1