Open njtierney opened 6 years ago
library(tidyverse)
library(naniar)
shadow_gather <- function(shadow_data){
shadow_data %>%
tidyr::gather(key = "variable",
value = "value",
-which_are_shadow(.)) %>%
tidyr::gather(key = "variable_NA",
value = "value_NA",
which_are_shadow(.))
}
ocean_imp_mean <- oceanbuoys %>%
bind_shadow(only_miss = TRUE) %>%
impute_mean_all()
gathered_ocean_imp_mean <- shadow_gather(ocean_imp_mean)
gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#> variable value variable_NA value_NA
#> <chr> <dbl> <chr> <chr>
#> 1 year 1997 sea_temp_c_NA !NA
#> 2 year 1997 sea_temp_c_NA !NA
#> 3 year 1997 sea_temp_c_NA !NA
#> 4 year 1997 sea_temp_c_NA !NA
#> 5 year 1997 sea_temp_c_NA !NA
#> 6 year 1997 sea_temp_c_NA !NA
#> 7 year 1997 sea_temp_c_NA !NA
#> 8 year 1997 sea_temp_c_NA !NA
#> 9 year 1997 sea_temp_c_NA !NA
#> 10 year 1997 sea_temp_c_NA !NA
#> # ... with 17,654 more rows
ggplot(gathered_ocean_imp_mean,
aes(x = value,
fill = value_NA)) +
geom_histogram() +
facet_grid(variable ~ variable_NA,
scales = "free_x",
switch = "y")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Created on 2018-08-13 by the reprex package (v0.2.0).
The function name should be gather_shadow
. This function already exists, but is rarely used. To help overcome this, this is where the new class system defined in #189 would be very helpful, and brings us to the next point
gather_shadow
should have nabular
, data.frame
, and shadow
methods.
There should be options to leave certain variables in the dataframe untouched. For example, the any_missing
column that is created by add_label_shadow
. This would involve having ...
and then quoting this input, and adding it to the end of the gather statements.
I spent a while trying to NOT use facet_grid
- but you need to, otherwise you combine the different datasets.
This smells like a bit of a leaky abstraction.
There should be a nice way to get only the variables and their imputed values into shape for this kind of visualisation. This means getting the visualisations on the diagonal - doing a filter where variable == variable_NA.
some work so far on this:
gathered_ocean_imp_mean %>%
filter(variable %in% c("air_temp_c",
"humidity",
"sea_temp_c")) %>%
mutate(temp = paste0(variable,"_NA")) %>%
filter(variable == temp)
OK so here is the progress on this:
library(tidyverse)
library(naniar)
ocean_imp_mean <- oceanbuoys %>%
bind_shadow(only_miss = TRUE) %>%
impute_mean_all()
gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)
gathered_ocean_imp_mean %>%
filter(variable %in% c("air_temp_c",
"humidity",
"sea_temp_c")) %>%
filter(variable_NA == paste0(variable,"_NA")) %>%
ggplot(aes(x = value,
fill = value_NA)) +
geom_histogram() +
facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Created on 2018-08-13 by the reprex package (v0.2.0).
I think that the abstraction here would be to specify the variables that you want to focus on, which would be filtered out.
Actually I just added that filtering step to the shadow_long
function. this actually abstracts away a nice chunk of the code:
library(tidyverse)
library(naniar)
ocean_imp_mean <- oceanbuoys %>%
bind_shadow(only_miss = TRUE) %>%
impute_mean_all()
gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)
gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#> variable value variable_NA value_NA
#> <chr> <dbl> <chr> <chr>
#> 1 year 1997 sea_temp_c_NA !NA
#> 2 year 1997 sea_temp_c_NA !NA
#> 3 year 1997 sea_temp_c_NA !NA
#> 4 year 1997 sea_temp_c_NA !NA
#> 5 year 1997 sea_temp_c_NA !NA
#> 6 year 1997 sea_temp_c_NA !NA
#> 7 year 1997 sea_temp_c_NA !NA
#> 8 year 1997 sea_temp_c_NA !NA
#> 9 year 1997 sea_temp_c_NA !NA
#> 10 year 1997 sea_temp_c_NA !NA
#> # ... with 17,654 more rows
gathered_ocean_imp_mean %>%
ggplot(aes(x = value,
fill = value_NA)) +
geom_histogram() +
facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Created on 2018-08-13 by the reprex package (v0.2.0).
Created on 2018-05-23 by the reprex package (v0.2.0).