rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.2k stars 525 forks source link

[FEA] Support cuDF dataframes with Decimal dtype columns as input to estimators #3580

Open beckernick opened 3 years ago

beckernick commented 3 years ago

As cuDF continues to implement support for Decimal columns, we should accept Decimal columns as part of dataframe inputs to cuML estimators.

In PySpark, we can do this with the following:

from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.types import DecimalType
from pyspark.ml.feature import VectorAssembler
​
spark = SparkSession.builder \
    .master("local") \
    .getOrCreate()
​
df = (spark
  .createDataFrame([
      [0.2,1,2.2,5,1],
      [0.3,2,3.2,4,0],
      [0.9,2,4.2,4,1]
  ], schema=["a","b","c","d", "label"])
     )
​
df = (df
      .withColumn('a', df['a'].cast('decimal(3,2)'))
      .withColumn('b', df['b'].cast('decimal(3,2)'))
      .withColumn('c', df['c'].cast('decimal(3,2)'))
      .withColumn('d', df['d'].cast('decimal(3,2)'))
)
​
assembler = VectorAssembler(
    inputCols=["a", "b", "c", "d"],
    outputCol="features")
​
training = assembler.transform(df)
​
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))
Coefficients: [0.0,0.0,0.0,0.0]
Intercept: 0.6931471805599453
import cudf
import cuml
import pyarrow as pa
​
df = cudf.DataFrame({
    "a":[0.1,0.2,0.9],
    "b":[0.3,0.9,20],
    "c":[-3.2, 0.2, 9.2],
    "label":[1,1,0]
})
​
pa_decimal_b = df.b.to_arrow().cast(pa.decimal128(precision=7, scale=2))
df["b"] = cudf.core.column.DecimalColumn.from_arrow(pa_decimal_b)
​
clf = cuml.linear_model.LogisticRegression()
clf.fit(df[["a","b","c"]], df["label"])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-32-3b21064a47b7> in <module>
     16 
     17 clf = cuml.linear_model.LogisticRegression()
---> 18 clf.fit(df[["a","b","c"]], df["label"])

/raid/nicholasb/miniconda3/envs/rapids-gpu-bdb-automated-tests/lib/python3.7/site-packages/cuml/internals/api_decorators.py in inner_with_setters(*args, **kwargs)
    408                                 target_val=target_val)
    409 
--> 410                 return func(*args, **kwargs)
    411 
    412         @wraps(func)

cuml/linear_model/logistic_regression.pyx in cuml.linear_model.logistic_regression.LogisticRegression.fit()

/raid/nicholasb/miniconda3/envs/rapids-gpu-bdb-automated-tests/lib/python3.7/site-packages/cuml/internals/api_decorators.py in inner_with_setters(*args, **kwargs)
    408                                 target_val=target_val)
    409 
--> 410                 return func(*args, **kwargs)
    411 
    412         @wraps(func)

cuml/solvers/qn.pyx in cuml.solvers.qn.QN.fit()

/raid/nicholasb/miniconda3/envs/rapids-gpu-bdb-automated-tests/lib/python3.7/site-packages/cuml/internals/api_decorators.py in inner(*args, **kwargs)
    359         def inner(*args, **kwargs):
    360             with self._recreate_cm(func, args):
--> 361                 return func(*args, **kwargs)
    362 
    363         return inner

/raid/nicholasb/miniconda3/envs/rapids-gpu-bdb-automated-tests/lib/python3.7/site-packages/cuml/common/input_utils.py in input_to_cuml_array(X, order, deepcopy, check_dtype, convert_to_dtype, check_cols, check_rows, fail_on_order, force_contiguous)
    313             X_m = CumlArray(data=X.as_gpu_matrix(order='F'))
    314         else:
--> 315             X_m = CumlArray(data=X.as_gpu_matrix(order=order))
    316 
    317     elif isinstance(X, CumlArray):

/raid/nicholasb/miniconda3/envs/rapids-gpu-bdb-automated-tests/lib/python3.7/site-packages/cudf/core/dataframe.py in as_gpu_matrix(self, columns, order)
   3545         if any(
   3546             (is_categorical_dtype(c) or np.issubdtype(c, np.dtype("object")))
-> 3547             for c in cols
   3548         ):
   3549             raise TypeError("non-numeric data not yet supported")

/raid/nicholasb/miniconda3/envs/rapids-gpu-bdb-automated-tests/lib/python3.7/site-packages/cudf/core/dataframe.py in <genexpr>(.0)
   3545         if any(
   3546             (is_categorical_dtype(c) or np.issubdtype(c, np.dtype("object")))
-> 3547             for c in cols
   3548         ):
   3549             raise TypeError("non-numeric data not yet supported")

/raid/nicholasb/miniconda3/envs/rapids-gpu-bdb-automated-tests/lib/python3.7/site-packages/numpy/core/numerictypes.py in issubdtype(arg1, arg2)
    386     """
    387     if not issubclass_(arg1, generic):
--> 388         arg1 = dtype(arg1).type
    389     if not issubclass_(arg2, generic):
    390         arg2 = dtype(arg2).type

TypeError: Cannot interpret '<cudf.core.column.decimal.DecimalColumn object at 0x7f1ca5310170>' as a data type
beckernick commented 3 years ago

The underlying error actually comes from cuDF, but using this issue to track the ability to accept these inputs in cuML

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.