rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.19k stars 527 forks source link

[FEA] Request building KNNImputer, IterativeImputer, PowerTransform #4694

Open jamesee opened 2 years ago

jamesee commented 2 years ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I wish I could use cuML to do [...]

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context, code examples, or references to existing implementations about the feature request here.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

omare334 commented 1 year ago

I wish I could use Cuml to implement on sci-kit to learn the KNN imputation. Imputation is a very large yet sometimes overlooked process in data science an I think speeding it up is very useful. As of now I have not found a quick way to impute large amounts of data would appreciate it if someone can add this.

beckernick commented 1 year ago

Would you be able to share any information about your use case such as:

omare334 commented 1 year ago

The current dimensions of my data are 450,000-550,000 observations depending on the cleaning situation and what question we would like to answer using the data. The variables can range from 60 which is okay but in most cases will be between 90-140 however sometime the amount of (p) variables will come close to or even outweigh the amount of (n) observation in the case of genetic data. We deal with this using dimensionality reduction methods like PCA which we would like to accelerate as well. Also as implied we deal with most health data and some economic and lifestyle data on the side. We are attempting to use scikit learn KNN imputer . The default parameters for this where used, uniform elucidean distance, k=5 and so on. The system we are using is connected as a batch job on a HPC with 9 CPU ,32GB ram and a walltime of 72 hours. Previous attempts at 24 hours did not work as this may be linked to a scaling issue as 50,000 observations took only 12 minutes. Also we did not specify what CPU but i believe any CPU we use is in the intel range.

beckernick commented 1 year ago

Thanks for the additional context. In case it's relevant in the short term, the scikit-learn team has made some fantastic performance enhancements to pairwise distance primitives in v1.1 and v1.2. The primitives are used in their KNNImputer implementation, so if you're using an older scikit-learn you may get a nice boost by upgrading versions.

We may not be able to prioritize KNNImputer in the short term, but we'll evaluate the feasibility and share any updates in this thread. We'd also welcome this feature as a community contribution and can help provide reviews and feedback.

higgins4286 commented 1 year ago

I am also looking for cuML to have the ability to do KNNInputer. Is there a way to boost the sklearn KNNImputer with a cuML product?

vkhodygo commented 1 year ago

I concur, running IterativeImputer with RandomForestRegressor as an estimator takes ages even when I do this on a Threadripper machine with 128 threads.

@beckernick

What I have at the moment is about 1.000.000 records, roughly 40 features, give or take. This dataset is the results of a survey, a mix of categorical and numerical columns. That's a starter that includes:

This number is likely to grow up to probably 5.000.000 - 10.000.000 rows and 100 - 120 features with roughly the same characteristics.

Current config:

    imp_num = IterativeImputer(
        estimator=RandomForestRegressor(n_estimators=128, n_jobs=128,
                                        random_state=1, verbose=0),
        skip_complete=True,
        min_value=min_vals,
        max_value=max_vals,
        initial_strategy='mean',
        max_iter=100,
        random_state=0,
        verbose=2,
        add_indicator=True
        )

I run it on a machine with 3990x and it takes about 950 seconds per iteration. It looks to me like it converges, but extremely slowly. I could probably get access to a cluster or something, but for prototyping I'd like to employ an off-the-shelf solution.

beckernick commented 1 year ago

Thanks for sharing details about your use case, system, and current performance @vkhodygo . This helps us understand the potential impact.

Would you be willing to test using cuML's RandomForestRegressor within the scikit-learn IterativeImputer and sharing the performance results compared to the CPU version?

E.g., something like this (but on your larger dataset and imputation configuration):


import numpy as np
from sklearn.impute import IterativeImputer
from sklearn.datasets import make_classification
import cuml

# Create some data
X, y = make_classification(
    n_samples=20000,
    n_features=5
)

# Randomly set some elements as null
null_pct = 0.15
mask = np.random.choice([True, False], size=X.shape, p=[null_pct, 1-null_pct])
X[mask] = None

clf = cuml.ensemble.RandomForestRegressor(n_estimators=100)

imp = IterativeImputer(
    estimator=clf,
    random_state=0,
    verbose=1
)
%time imp.fit(X)
%time imp.transform(X)
[IterativeImputer] Completing matrix with shape (20000, 5)

/home/nicholasb/miniconda3/envs/rapids-23.04-pytorch-py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py:188: UserWarning: To use pickling first train using float32 data to fit the estimator
  ret = func(*args, **kwargs)

[IterativeImputer] Change: 6.242554921969771, scaled tolerance: 0.005423677066770994 
[IterativeImputer] Change: 1.7837739564820407, scaled tolerance: 0.005423677066770994 
[IterativeImputer] Change: 1.4404752068462856, scaled tolerance: 0.005423677066770994 
[IterativeImputer] Change: 0.8634601400972206, scaled tolerance: 0.005423677066770994 
[IterativeImputer] Change: 1.2868782977721804, scaled tolerance: 0.005423677066770994 
[IterativeImputer] Change: 1.0378326375086948, scaled tolerance: 0.005423677066770994 
[IterativeImputer] Change: 0.885562925881626, scaled tolerance: 0.005423677066770994 
[IterativeImputer] Change: 1.6769846357842726, scaled tolerance: 0.005423677066770994 
[IterativeImputer] Change: 1.7261951505714752, scaled tolerance: 0.005423677066770994 
[IterativeImputer] Change: 1.236274386925659, scaled tolerance: 0.005423677066770994 
CPU times: user 1min 41s, sys: 27.4 s, total: 2min 9s
Wall time: 41.1 s
[IterativeImputer] Completing matrix with shape (20000, 5)

/home/nicholasb/miniconda3/envs/rapids-23.04-pytorch-py38/lib/python3.8/site-packages/sklearn/impute/_iterative.py:785: ConvergenceWarning: [IterativeImputer] Early stopping criterion not reached.
  warnings.warn(

CPU times: user 57.7 s, sys: 5.28 s, total: 1min 2s
Wall time: 17.2 s

array([[-0.11293499, -0.44194113,  0.01140711, -1.42840088,  0.4762866 ],
       [-0.0854148 ,  0.25017095, -0.50090561, -0.6310341 ,  1.1444239 ],
       [-0.0881311 ,  0.5684028 ,  0.46720709, -1.51880184, -0.33367754],
       ...,
       [-0.03001158,  0.12153241, -0.56244333,  0.13445305,  1.02336402],
       [-0.19626069, -0.43141381, -0.1737855 , -2.31158473,  1.1256754 ],
       [ 0.10166457, -0.65108419,  0.70507894,  0.67257507, -1.49917859]])
vkhodygo commented 1 year ago

@beckernick Apologies for the delay, some other issues required my full attention.

I tried to run my original code, and (a slightly different version of) my dataset takes about 500 seconds per iteration. Your code with the rest of the parameters staying the same reduces this to something like 400. Note, however, that the number of estimators is lower here, and max_depth is limited to 16. Increasing n_estimators to the initial value of 128 results in roughly the same timings meaning no performance improvements whatsoever.

I strongly suspect employing dask here should improve the situation, but I have no prior experience running it on GPUs.

vkhodygo commented 1 year ago

@beckernick I left it running for a night, and the result is pretty disappointing. Not only it fails to converge - the change just keeps jumping around - it also eats away all available VRAM (24GB) and terminates right after that. Does this look like a memory leak?

MessDeveloper commented 7 months ago

Hello, I've been running using an Intel i7 7700K / 32GB RAM / Linux Ubuntu for close to one day and still working on it. imputer = KNNImputer(n_neighbors=5) df_imputed = pd.DataFrame(imputer.fit_transform(df_final), columns=df_final.columns)

In a dataset like next: image

CPU benchmarks themselves are not bad, I mean that the computer is not a bad system (or at least I believe that...), I have a 12 GB RTX 4070 Ti GPU that I'd like to use for these topics. I'm wasting time and increasing electricity costs.