Open njtierney opened 7 years ago
This sort of approach (as in njtierney/naniar#48), might be more appropriate when you are working with larger data structures.
I think that ideally you should first instead use the pattern where you use bind_shadow()
to the data, then impute
airquality %>%
bind_shadow %>%
na_impute(var ~ ., method = ...)
Although from here I need to think more carefully about what happens when the imputation does not impute all of the values in a variable.
The which_na
function stores the rows and cols of the missing values in naniar
. This could be stored behind the scenes to give locations of imputed values, and missing values.
naniar::which_na(airquality)
row col
[1,] 5 1
[2,] 10 1
[3,] 25 1
[4,] 26 1
[5,] 27 1
.
.
.
[40,] 11 2
[41,] 27 2
[42,] 96 2
[43,] 97 2
[44,] 98 2
an impl
should then use this information to identify which values are still missing, and which are imputed.
You get some data, and it has missing values, which you might identify with a tool like
visdat::vis_miss
You might then decide that you are going to impute the values of Solar.R using all available data. The package
simputation
makes this really easy:We can look at where the values were imputed:
But it is not very clear now which values were imputed.
You can do something with
vis_compare
Unfortunately it picks up on the different class, as Solar.R changed from integer to double.
To help manage these values there should be a way to store which values were imputed, while remaining firmly in the tidyverse.
I think that something like an
impl
is needed:An
impl
object stores initial missings from when data was imputed, but also updates with the data as the data was changed, but behaves as normal for dplyr functions. It may even have its own dplyr verbs, similar tosf
mutate.impl
,summarise.impl
etc.The print method would look really similar to a tibble, with some additional features, it would add braces around the missing values that were imputed, and also state some information about the number of imputed values overall. An idea of what this might look like is shown below
Now, this object may then play well with
dplyr
verbs, perhaps animpl
object will then allow for numerical summaries of missing values to be completed as normal, and might for example behave differently forsummarise
functions indplyr
, where something like (sans imputation)It may provide something different with and
impl
objects, giving summaries about the updated imputed valuesThis may also play nicely with ggplot2, further down the track.
Related to this topic, is how these values will be stored, and how imputation will occur in
naniar
.This is related to njtierney/naniar#35 and njtierney/naniar#28