njtierney / naniar

Tidy data structures, summaries, and visualisations for missing data
http://naniar.njtierney.com/
Other
652 stars 53 forks source link

geom_imputed_* and friends #35

Open njtierney opened 7 years ago

njtierney commented 7 years ago

This would be a new geom built for imputed data / imputed dataframes.

Not sure how the specifics of this would work, but something like:

ggplot(data = data_imputed,
       aes(x = var1,
           y = var2)) + 
  geom_imputed_point()

Then this could display something similar to geom_missing_point(), but instead show the imputed values in addition to the regular data.

This might use shadow_bind or shadow_augment or similar to represent the imputations somehow.

njtierney commented 7 years ago

Perhaps this could include something like stat_function where users suggest an imputation method and then pass args to it.

something like

ggplot(data = data,
       aes(x = var1,
           y = var2)) + 
  geom_imputed_point(fun = mice,
                     args = list(mice_options...))

ggplot(data = data,
       aes(x = var1)) + 
  geom_imputed_density(fun = mice,
                     args = list(mice_options...))

Just an idea to keep track of

njtierney commented 7 years ago

In this way, having geom_impute_* would have similar options to geom_smooth() - where you can specify method = "lm", "loess",, etc.

However, I am not convinced that imputations should have the same treatment - you often want to use these values again.

So I think that imputing values should be a separate data tidying step.

There needs to be a clever way to keep track of the values that are imputed, without blowing out the size of the dataframe by storing the entire dataset twice (or m times, for multiple imputation). This is the idea behind the shadow matrix, but I wonder if there should be a better way to store this info in a nice print method, where users don't see some shadow vector/index that sits behind the data.

Need to collate all of these thoughts together.

dicook commented 7 years ago

imputed should be a stat, not a geom, if it is to be included with ggplot

njtierney commented 7 years ago

Thanks Di!

So, this should be something along the lines of stat_impute - which you can almost imagine existing for this example here.

Here, the code might look like the following:

set.seed(1492)
df <- data.frame(
  x = rnorm(100)
)

df[sample(x = 100, size = 10),] <- NA

df

x <- df$x
base <- ggplot(df, aes(x)) + geom_density()

base + stat_impute(fun = impute_lm, 
                   colour = "red",
                   args = list(x ~ .))
dicook commented 7 years ago

stat_impute doesn't work for me with the current naniar, but that is ok. I get the gist of it.

colour should be associated with geom_density, and it should be mapped to a variable indicating missing status.