Increase performance of Miss Forest Imputation

Imipenem commented 2 years ago

Is your feature request related to a problem? Please describe.

Miss forest imputation is a nice imputation algorithm with good results. But it comes at the cost of speed when imputing over multiple columns. This can be annoying especially on larger datasets.

Describe the solution you would like

Depending on the size of the dataset and the number of features involved, iterative fitting may take a long time. Try using all your processors for the job by setting n_jobs=-1 for both RandomForestRegressor and RandomForestClassifier and you may try to patch Scikit-Learn with Intel(R) Extension for Scikit-learn as described here: pypi.org/project/scikit-learn-intelex

See https://stackoverflow.com/questions/64900801/implementing-knn-imputation-on-categorical-variables-in-an-sklearn-pipeline if this could be a possible solution and helps with speed.

Zethson commented 2 years ago

Out settings object has a number to control the number of CPUs and therefore jobs here. This implementation has to make use of it if it doesn't already.

Zethson commented 2 years ago

Multi-output and sparse data is not supported by the intel extension. Is this an issue?

theislab / ehrapy

Increase performance of Miss Forest Imputation #283