kearnz / autoimpute

Python package for Imputation Methods
MIT License
237 stars 19 forks source link

Question about applying MICEImputer to new data. #54

Closed ragAgar closed 3 years ago

ragAgar commented 3 years ago

Hi,

Thank you for sharing a very useful and flexible implementation ! It's interesting because I've been wanting to do Multiple Imputation in Python.

I have two questions about applying MICEImputer to new data.

Now I want to do mice like mice package in R, but I also want to apply mice to new data as frequent situations in machine learning. I tried the following code to apply MICEImputer to new data, but I got an ValueError: Variable name p_pred already exists.. When I used binary logistic and least square instead of bayesian methods, there were no errors.

First, binary logistic and least square the same as logreg and norm in mice? Second, can beyesian methods strategy be applied to new data?

Thank you for reading my long request and sorry for poor English.

# make experiment dataset 

import numpy as np
import pandas as pd
import warnings; warnings.simplefilter('ignore')
from autoimpute.imputations import MiceImputer

N_sample = 2000
N_missing = 1000

## make data cont & cat columns
np.random.seed(0)
cat = np.random.choice([0,1], N_sample)
cont = np.random.normal(0,1,N_sample)

df_imcomplete = pd.DataFrame(np.c_[cat,cont], columns=["cat", "cont"])

## filling with NA
np.random.seed(1)
ix = np.random.choice(np.arange(N_sample), N_missing)
iy = np.random.choice([0,1], N_missing)
for i in range(N_missing):
    df_imcomplete.iloc[ix[i], iy[i]] = np.nan

## split
X_train = df_imcomplete[:1000]
X_test  = df_imcomplete[1000:]

# Mice imputation
imp = MiceImputer(
    n = 2,
    seed = 2,
    strategy = {"cat":"bayesian binary logistic", "cont":"bayesian least squares"},
    return_list = True,
)

# learn & apply impute
imputer = imp.fit(X_train)
X_train_imputed = imputer.transform(X_train)
X_test_imputed = imputer.transform(X_test, )
kearnz commented 3 years ago

Hi @ragAgar, will look into this in the next few days and let you know.

kearnz commented 3 years ago

Hi @ragAgar, apologies it took so long to get back to you.

You've found a bug in how the MiceImputer implements bayesian methods. Note that both bayesian binary and bayesian least squares suffer from this issue. Essentially, pymc3, the underlying package we leverage for building bayesian models, does not allow you to redefine an existing deterministic variable. autoimpute tries to do that when it iterates through imputations. I will work on a fix for this bug this weekend when I'm tackling another issue.

If you're interested, here's where that pops up in autoimpute. Both the MultipleImputer and the MiceImputer create n SingleImputer instances under the hood (in your example, n=2). In the MultipleImputer, each of those n_i instances iterates k=1 time. So if you use a bayesian method, the bayesian model variables are created 1 time for each n instances. Perfectly valid. But for the MiceImputer, each n_1 instances of the SingleImputer iterate k=5 (by default) times. So each instance tries to recreate bayesian variables k times, and that throws an error.

I'll keep you updated when release is ready. For now I'd recommend just using default strategies.

kearnz commented 3 years ago

Closing this issue and creating a separate bug report.