theislab / ehrapy

Electronic Health Record Analysis with Python.
https://ehrapy.readthedocs.io/
Apache License 2.0
214 stars 19 forks source link

MissForest imputation does not impute correctly when no column names are specified #299

Closed Zethson closed 2 years ago

Zethson commented 2 years ago

Describe the bug

Imputation not done.

To Reproduce

Steps to reproduce the behavior:

import ehrapy as ep
adata = ep.dt.diabetes_130(encoded=False, columns_obs_only=["age",
                                                            "age_mean",
                                                            "weight_mean",
                                                            "weight",
                                                            "patient_nbr",
                                                            "payer_code"
                                                            ])
adata=ep.pp.encode(adata,autodetect=True)
ep.pp.miss_forest_impute(adata)

ep.pp.pca(adata)

Crashes with message that the data contains NaNs.

Why is this not caught by the test?

Imipenem commented 2 years ago

This is due to https://github.com/theislab/ehrapy/blob/development/ehrapy/preprocessing/_data_imputation.py#L337 (guess it was introduced in the last PR).

Should be: elif isinstance(var_names, (dict, type(None))): instead.

EDIT: Going through the code, I guess it does not make any sense to pass a list as var_names to miss_forest_impute. Currently, they're just passed to the num_imputer and this could lead to problems if some of the passed columns are non_numerical. That's the reason why I implemented it using a dict instead. IMO this should be removed.

EDIT 2: And we should always impute first and encode after, since when encoding first, all missing values in (at least) non_numerical values are lost (at least in X).