rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.2k stars 525 forks source link

[FEA] Add missing data imputation methods, such as MissForest #2822

Closed miguelusque closed 2 years ago

miguelusque commented 4 years ago

Is your feature request related to a problem? Please describe. Hi,

I might be wrong, but I have not seen any cuML method to perform missing data imputation?

I am referring to something similar to MissForest. I assume there is something already implemented, considering RandomForest makes use of it, but I am not sure if it is exposed in the Python layer.

Describe the solution you'd like To have some data imputation methods available.

wphicks commented 4 years ago

I don't think we currently have anything along these lines, but sklearn supports mean, median, and mode imputation. It might be a good idea for us to start by adding that functionality and then move to more sophisticated methods like MissForest.

One question for you, @miguelusque: You said that "RandomForest makes use of [MissForest]," but if I'm understanding correctly, the dependency goes the other way: MissForest uses RandomForest models to predict missing values. Am I misreading that? Just want to make sure I'm interpreting the feature you're looking for correctly.

wphicks commented 4 years ago

Oh! It looks like #2645 includes the data imputation provided by sklearn. So, once that's landed, we can consider adding in MissForest and/or other techniques.

miguelusque commented 4 years ago

I don't think we currently have anything along these lines, but sklearn supports mean, median, and mode imputation. It might be a good idea for us to start by adding that functionality and then move to more sophisticated methods like MissForest.

One question for you, @miguelusque: You said that "RandomForest makes use of [MissForest]," but if I'm understanding correctly, the dependency goes the other way: MissForest uses RandomForest models to predict missing values. Am I misreading that? Just want to make sure I'm interpreting the feature you're looking for correctly.

You are fully right. It works as you said.

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

miguelusque commented 3 years ago

Hi!

I think this feature request is still relevant.

Regards, Miguel

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

beckernick commented 2 years ago

Standard missing data imputation is now covered by SimpleImputer, and KNNImputer is work in progress (https://github.com/rapidsai/cuml/pull/4820).

I'm going to close this issue, as the broader technique MissForest is likely out-of-scope for cuML itself, but we'd love to see another library utilize cuML to provide it. @miguelusque , is there a Python library that provides support for MissForest today?