theislab / ehrapy

Electronic Health Record Analysis with Python.
https://ehrapy.readthedocs.io/
Apache License 2.0
215 stars 20 forks source link

Increase performance of Miss Forest Imputation #283

Closed Imipenem closed 2 years ago

Imipenem commented 2 years ago

Is your feature request related to a problem? Please describe.

Miss forest imputation is a nice imputation algorithm with good results. But it comes at the cost of speed when imputing over multiple columns. This can be annoying especially on larger datasets.

Describe the solution you would like

Depending on the size of the dataset and the number of features involved, iterative fitting may take a long time. Try using all your processors for the job by setting n_jobs=-1 for both RandomForestRegressor and RandomForestClassifier and you may try to patch Scikit-Learn with Intel(R) Extension for Scikit-learn as described here: pypi.org/project/scikit-learn-intelex

See https://stackoverflow.com/questions/64900801/implementing-knn-imputation-on-categorical-variables-in-an-sklearn-pipeline if this could be a possible solution and helps with speed.

Zethson commented 2 years ago

Out settings object has a number to control the number of CPUs and therefore jobs here. This implementation has to make use of it if it doesn't already.

Zethson commented 2 years ago

Multi-output and sparse data is not supported by the intel extension. Is this an issue?