njtierney / naniar

Tidy data structures, summaries, and visualisations for missing data
http://naniar.njtierney.com/
Other
651 stars 54 forks source link

Useful missing data data structure and visualisation #165

Open njtierney opened 6 years ago

njtierney commented 6 years ago
library(tidyverse)
library(naniar)

which_are_shadow <- function(data) which(are_shadow(data))

aq_shadow_gather <- airquality %>%
  bind_shadow() %>%
  gather(key = "key",
         value = "value",
         -which_are_shadow(.)) %>%
  select(key, value, everything()) %>%
  gather(key = "key_NA",
         value = "value_NA",
         which_are_shadow(.))

aq_shadow_gather %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_density(alpha = 0.5) + 
  facet_grid(key~key_NA,
             scales = "free",
             switch = "y")
#> Warning: Removed 264 rows containing non-finite values (stat_density).


# and now only showing the variables that contain missings

aq_shadow_gather <- airquality %>%
  bind_shadow(only_miss = TRUE) %>%
  gather(key = "key",
         value = "value",
         -which_are_shadow(.)) %>%
  select(key, value, everything()) %>%
  gather(key = "key_NA",
         value = "value_NA",
         which_are_shadow(.))

aq_shadow_gather %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_density(alpha = 0.5) + 
  facet_grid(key~key_NA,
             scales = "free",
             switch = "y")
#> Warning: Removed 88 rows containing non-finite values (stat_density).

Created on 2018-05-23 by the reprex package (v0.2.0).

njtierney commented 6 years ago
library(tidyverse)
library(naniar)
shadow_gather <- function(shadow_data){

  shadow_data %>%
    tidyr::gather(key = "variable",
                  value = "value",
                  -which_are_shadow(.)) %>%
    tidyr::gather(key = "variable_NA",
                  value = "value_NA",
                  which_are_shadow(.))
}

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_gather(ocean_imp_mean)

gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#>    variable value variable_NA   value_NA
#>    <chr>    <dbl> <chr>         <chr>   
#>  1 year      1997 sea_temp_c_NA !NA     
#>  2 year      1997 sea_temp_c_NA !NA     
#>  3 year      1997 sea_temp_c_NA !NA     
#>  4 year      1997 sea_temp_c_NA !NA     
#>  5 year      1997 sea_temp_c_NA !NA     
#>  6 year      1997 sea_temp_c_NA !NA     
#>  7 year      1997 sea_temp_c_NA !NA     
#>  8 year      1997 sea_temp_c_NA !NA     
#>  9 year      1997 sea_temp_c_NA !NA     
#> 10 year      1997 sea_temp_c_NA !NA     
#> # ... with 17,654 more rows

ggplot(gathered_ocean_imp_mean,
       aes(x = value,
           fill = value_NA)) + 
  geom_histogram() +
  facet_grid(variable ~ variable_NA,
             scales = "free_x",
             switch = "y")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

Some notes on implementation

naming

The function name should be gather_shadow. This function already exists, but is rarely used. To help overcome this, this is where the new class system defined in #189 would be very helpful, and brings us to the next point

Methods

gather_shadow should have nabular, data.frame, and shadow methods.

Options for extra variables

There should be options to leave certain variables in the dataframe untouched. For example, the any_missing column that is created by add_label_shadow. This would involve having ... and then quoting this input, and adding it to the end of the gather statements.

Notes on the visualisation method

I spent a while trying to NOT use facet_grid - but you need to, otherwise you combine the different datasets.

This smells like a bit of a leaky abstraction.

There should be a nice way to get only the variables and their imputed values into shape for this kind of visualisation. This means getting the visualisations on the diagonal - doing a filter where variable == variable_NA.

some work so far on this:

gathered_ocean_imp_mean %>%
  filter(variable %in% c("air_temp_c",
                         "humidity",
                         "sea_temp_c")) %>%
  mutate(temp = paste0(variable,"_NA")) %>%
  filter(variable == temp)
njtierney commented 6 years ago

OK so here is the progress on this:

library(tidyverse)
library(naniar)

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)

gathered_ocean_imp_mean %>%
  filter(variable %in% c("air_temp_c",
                         "humidity",
                         "sea_temp_c")) %>%
  filter(variable_NA == paste0(variable,"_NA")) %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_histogram() +
  facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

I think that the abstraction here would be to specify the variables that you want to focus on, which would be filtered out.

njtierney commented 6 years ago

Actually I just added that filtering step to the shadow_long function. this actually abstracts away a nice chunk of the code:

library(tidyverse)
library(naniar)

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)

gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#>    variable value variable_NA   value_NA
#>    <chr>    <dbl> <chr>         <chr>   
#>  1 year      1997 sea_temp_c_NA !NA     
#>  2 year      1997 sea_temp_c_NA !NA     
#>  3 year      1997 sea_temp_c_NA !NA     
#>  4 year      1997 sea_temp_c_NA !NA     
#>  5 year      1997 sea_temp_c_NA !NA     
#>  6 year      1997 sea_temp_c_NA !NA     
#>  7 year      1997 sea_temp_c_NA !NA     
#>  8 year      1997 sea_temp_c_NA !NA     
#>  9 year      1997 sea_temp_c_NA !NA     
#> 10 year      1997 sea_temp_c_NA !NA     
#> # ... with 17,654 more rows

gathered_ocean_imp_mean %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_histogram() +
  facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).