njtierney / mputr

Package for handling multiple imputations in a tidy format
Other
13 stars 0 forks source link

`impl` an imputation data structures in naniar #6

Open njtierney opened 7 years ago

njtierney commented 7 years ago

You get some data, and it has missing values, which you might identify with a tool like visdat::vis_miss

library(visdat)
vis_miss(airquality)

You might then decide that you are going to impute the values of Solar.R using all available data. The package simputation makes this really easy:

library(simputation)
da1 <- impute_lm(airquality, Solar.R ~ .)

We can look at where the values were imputed:

library(tibble)
as_tibble(airquality)

#> # A tibble: 153 × 6
#>    Ozone Solar.R  Wind  Temp Month   Day
#>    <int>   <int> <dbl> <int> <int> <int>
#> 1     41     190   7.4    67     5     1
#> 2     36     118   8.0    72     5     2
#> 3     12     149  12.6    74     5     3
#> 4     18     313  11.5    62     5     4
#> 5     NA      NA  14.3    56     5     5
#> 6     28      NA  14.9    66     5     6
#> 7     23     299   8.6    65     5     7
#> 8     19      99  13.8    59     5     8
#> 9      8      19  20.1    61     5     9
#> 10    NA     194   8.6    69     5    10
#> # ... with 143 more rows

as_tibble(da1)
#> # A tibble: 153 × 6
#>    Ozone  Solar.R  Wind  Temp Month   Day
#> *  <int>    <dbl> <dbl> <int> <int> <int>
#> 1     41 190.0000   7.4    67     5     1
#> 2     36 118.0000   8.0    72     5     2
#> 3     12 149.0000  12.6    74     5     3
#> 4     18 313.0000  11.5    62     5     4
#> 5     NA       NA  14.3    56     5     5
#> 6     28 194.8581  14.9    66     5     6
#> 7     23 299.0000   8.6    65     5     7
#> 8     19  99.0000  13.8    59     5     8
#> 9      8  19.0000  20.1    61     5     9
#> 10    NA 194.0000   8.6    69     5    10
#> # ... with 143 more rows

But it is not very clear now which values were imputed.

You can do something with vis_compare

vis_compare(airquality, da1)
#> vis_compare is in BETA! If you have suggestions or errors
#> post an issue at https://github.com/njtierney/visdat/issues

Unfortunately it picks up on the different class, as Solar.R changed from integer to double.

To help manage these values there should be a way to store which values were imputed, while remaining firmly in the tidyverse.

I think that something like an impl is needed:

imp_df = imputation + tbl_df

An impl object stores initial missings from when data was imputed, but also updates with the data as the data was changed, but behaves as normal for dplyr functions. It may even have its own dplyr verbs, similar to sf mutate.impl, summarise.impl etc.

The print method would look really similar to a tibble, with some additional features, it would add braces around the missing values that were imputed, and also state some information about the number of imputed values overall. An idea of what this might look like is shown below

#>   # An impl: 153 × 6
#>   # <value> imputed values
#>     Ozone Solar.R  Wind  Temp Month   Day
#>     <int>   <int> <dbl> <int> <int> <int>
#> 1     41     190   7.4    67     5     1
#> 2     36     118   8.0    72     5     2
#> 3     12     149  12.6    74     5     3
#> 4     18     313  11.5    62     5     4
#> 5    {21}   {312} 14.3    56     5     5
#> 6     28    {109} 14.9    66     5     6
#> 7     23     299   8.6    65     5     7
#> 8     19      99  13.8    59     5     8
#> 9      8      19  20.1    61     5     9
#> 10    NA     194   8.6    69     5    10
#> # ... with 143 more rows
#> # ... and  <value> more imputed values

Now, this object may then play well with dplyr verbs, perhaps an impl object will then allow for numerical summaries of missing values to be completed as normal, and might for example behave differently for summarise functions in dplyr, where something like (sans imputation)

library(dplyr)
airquality %>%
  group_by(Month) %>%
  summarise(mean_ozone = mean(Ozone, na.rm = TRUE))
#> # A tibble: 5 × 2
#>   Month mean_ozone
#>   <int>      <dbl>
#> 1     5   23.61538
#> 2     6   29.44444
#> 3     7   59.11538
#> 4     8   59.96154
#> 5     9   31.44828

It may provide something different with and impl objects, giving summaries about the updated imputed values

#> # A tibble: 5 × 2
#>   Month mean_ozone mean_ozone_imp
#>   <int>      <dbl>      <dbl>
#> 1     5   23.61538   <value>
#> 2     6   29.44444   <value>
#> 3     7   59.11538   <value>
#> 4     8   59.96154   <value>
#> 5     9   31.44828   <value>

This may also play nicely with ggplot2, further down the track.

Related to this topic, is how these values will be stored, and how imputation will occur in naniar.

This is related to njtierney/naniar#35 and njtierney/naniar#28

njtierney commented 7 years ago

This sort of approach (as in njtierney/naniar#48), might be more appropriate when you are working with larger data structures.

I think that ideally you should first instead use the pattern where you use bind_shadow() to the data, then impute

airquality %>%
bind_shadow %>%
na_impute(var ~ ., method = ...)

Although from here I need to think more carefully about what happens when the imputation does not impute all of the values in a variable.

njtierney commented 7 years ago

The which_na function stores the rows and cols of the missing values in naniar. This could be stored behind the scenes to give locations of imputed values, and missing values.

naniar::which_na(airquality)
row col
 [1,]   5   1
 [2,]  10   1
 [3,]  25   1
 [4,]  26   1
 [5,]  27   1
.
.
.
[40,]  11   2
[41,]  27   2
[42,]  96   2
[43,]  97   2
[44,]  98   2

an impl should then use this information to identify which values are still missing, and which are imputed.