MissForest imputation does not impute correctly when no column names are specified

theislab / ehrapy

Electronic Health Record Analysis with Python.

Apache License 2.0

214 stars 19 forks source link

import ehrapy as ep adata = ep.dt.diabetes_130(encoded=False, columns_obs_only=["age", "age_mean", "weight_mean", "weight", "patient_nbr", "payer_code" ]) adata=ep.pp.encode(adata,autodetect=True) ep.pp.miss_forest_impute(adata) ep.pp.pca(adata)

This is due to https://github.com/theislab/ehrapy/blob/development/ehrapy/preprocessing/_data_imputation.py#L337 (guess it was introduced in the last PR).

Should be: elif isinstance(var_names, (dict, type(None))): instead.

EDIT: Going through the code, I guess it does not make any sense to pass a list as var_names to miss_forest_impute. Currently, they're just passed to the num_imputer and this could lead to problems if some of the passed columns are non_numerical. That's the reason why I implemented it using a dict instead. IMO this should be removed.

EDIT 2: And we should always impute first and encode after, since when encoding first, all missing values in (at least) non_numerical values are lost (at least in X).

theislab / ehrapy

MissForest imputation does not impute correctly when no column names are specified #299