Multiple imputation: Try and try again

oldoc63 commented 1 year ago

Imagine you are taking a final exam for a science class. As you go through the test, you find some questions that you can’t remember the answer to, so you decide to take a guess. Later on in the exam, you have jogged our memory a bit because some of the later questions have clues about the earlier answers. Luckily, you can use this new knowledge to go back and fill in the previous guesses — hopefully, you will get a better score on the exam!

This kind of iterative process happens all the time in various data and analytical systems, and is something that we can apply to missing data as well. This kind of technique is known as multiple imputation.

oldoc63 commented 1 year ago

What is Multiple Imputation?

Multiple imputation is a technique for filling in missing data, in which we replace the missing data multiple times. Multiple imputation, in particular, is used when we have missing data across multiple categorical columns in our dataset. After we have tried different values, we use an algorithm to pick the best values to replace our missing data. By doing this, we are able to, over time, find the correct value for our missing data.

oldoc63 commented 1 year ago

When to use it

Multiple imputation is a powerful technique for replacing missing data, but there are some requirements and considerations to take before using multiple imputation.

Multiple imputation is best for MAR data, so we should ensure that our data fits that description. With MAR missing data, there is an assumption that there is an underlying reason to have missing data, and we have a good understanding of why that data is missing. Since it is not completely random, using random data to fill in the blanks is not sufficient, so we must use the context of the rest of the data to help.

Assuming we meet the criteria for using multiple imputation, our dataset will receive a couple key benefits.

We can safely assume that our data won’t be biased, since we start the process off with a random assignment of values for the missing data.
Because the goal of multiple imputation is to have a model that fits the data, we can be pretty confident that the resulting data will be a close approximation of the real data. This would include calculations like standard error and overall statistics.

oldoc63 commented 1 year ago

How to use it

Now that we understand what multiple imputation is trying to do, let’s go ahead and try it out! We don’t have to create our own algorithm to fill in our data, as there are many different approaches and pre-built libraries that can help us out.

One place to start would be with the IterativeImputer module within sklearn. This module provides a library to perform Multiple Imputation, leveraging the existing frameworks with sklearn and pandas DataFrames. Let’s assume that we have the following dataset:

X | Y | Z -- | -- | -- 5.4 | 18.0 | 7.6 13.8 | 27.4 | 4.6 14.7 | | 4.2 17.6 | 18.3 | | 49.6 | 4.7 1.1 | 48.9 | 8.5 12.9 | | 3.5 3.4 | 13.6 | | 16.1 | 1.8 10.2 | 42.7 | 4.7

If we wanted to use the IterativeImputer module here, our code would look like the following:

oldoc63 commented 1 year ago

After running our code, our dataset now looks like this (the imputed values are in bold):

X | Y | Z -- | -- | -- 5.4 | 18.0 | 7.6 13.8 | 27.4 | 4.6 14.7 | 17.4 | 4.2 17.6 | 18.3 | 5.6 11.2 | 49.6 | 4.7 1.1 | 48.9 | 8.5 12.9 | 17.4 | 3.5 3.4 | 13.6 | 5.7 11.2 | 16.1 | 1.8 10.2 | 42.7 | 4.7

oldoc63 commented 1 year ago

As we can see, the imputed data looks and behaves much like the rest of the dataset. With only a few lines of code, we can use a library like this to fill in our missing data to the best of our ability.

oldoc63 commented 1 year ago

Use the IterativeImputer module within sklearn to impute the missing data values.

Use 10 iterations.
Set random_state to 1.

Set dfComplete equal to the resulting DataFrame.

oldoc63 / learningDS