njtierney / naniar

Tidy data structures, summaries, and visualisations for missing data
http://naniar.njtierney.com/
Other
652 stars 53 forks source link

Represent relative missingness across time for many variables #254

Open mpaulacaldas opened 4 years ago

mpaulacaldas commented 4 years ago

Hi! I leave below an of a type of plot that I have used often when exploring the missingness pattern of longitudinal data. Basically, it gives a 'big picture' overview of the relative number of missing values per panel (in this case, country) across time and for many (or all) variables. It can be read 'horizontally' to identify the overall time frame where most variable information is available, or 'vertically' to identify variables with little information across time.

I don't know if my use case is very common, but it is somewhat related to #188 and I thought I might share it in case might inspire a future feature. It is also similar to gg_miss_fct() in that it uses a fill geom to represent relative missingness.

library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
library(tidyr)

who_na_counts <- who %>%
  group_by(year) %>% 
  summarise_at(vars(-c(country:iso3)), ~ sum(is.na(.x))) %>% 
  ungroup() %>% 
  pivot_longer(
    -year, 
    names_to = "variable", 
    values_to = "n_country_missing"
  )
who_na_counts
#> # A tibble: 1,904 x 3
#>     year variable     n_country_missing
#>    <int> <chr>                    <int>
#>  1  1980 new_sp_m014                210
#>  2  1980 new_sp_m1524               210
#>  3  1980 new_sp_m2534               210
#>  4  1980 new_sp_m3544               210
#>  5  1980 new_sp_m4554               210
#>  6  1980 new_sp_m5564               210
#>  7  1980 new_sp_m65                 210
#>  8  1980 new_sp_f014                210
#>  9  1980 new_sp_f1524               210
#> 10  1980 new_sp_f2534               210
#> # … with 1,894 more rows

who_na_counts %>% 
  ggplot(aes(
    x = variable, 
    y = forcats::fct_rev(factor(year)), 
    fill = n_country_missing
  )) +
  geom_raster() +
  labs(fill = "Number of \ncountries with\nmissing values") +
  scale_fill_viridis_c() +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1),
    axis.title = element_blank()
  )

Created on 2020-04-30 by the reprex package (v0.3.0)

njtierney commented 4 years ago

Thank you so much for this!

I think it is roughly the same as the gg_miss_fct in this instance (also note that you can get the same result as your code with group_by() and miss_var_summary() below.)

Although, your code was substantially faster than mine!

So this reminds me to take a look at how I've implemented this as there are some speed gains to be had here!

library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
library(tidyr)
library(naniar)

who_na_counts <- who %>%
  group_by(year) %>% 
  summarise_at(vars(-c(country:iso3)), ~ sum(is.na(.x))) %>% 
  ungroup() %>% 
  pivot_longer(
    -year, 
    names_to = "variable", 
    values_to = "n_country_missing"
  )

who_na_counts
#> # A tibble: 1,904 x 3
#>     year variable     n_country_missing
#>    <int> <chr>                    <int>
#>  1  1980 new_sp_m014                210
#>  2  1980 new_sp_m1524               210
#>  3  1980 new_sp_m2534               210
#>  4  1980 new_sp_m3544               210
#>  5  1980 new_sp_m4554               210
#>  6  1980 new_sp_m5564               210
#>  7  1980 new_sp_m65                 210
#>  8  1980 new_sp_f014                210
#>  9  1980 new_sp_f1524               210
#> 10  1980 new_sp_f2534               210
#> # … with 1,894 more rows

naniar_who_na_counts <- who %>% 
  group_by(year) %>% 
  miss_var_summary()

naniar_who_na_counts
#> # A tibble: 2,006 x 4
#> # Groups:   year [34]
#>     year variable     n_miss pct_miss
#>    <int> <chr>         <int>    <dbl>
#>  1  1980 new_sn_m014     212      100
#>  2  1980 new_sn_m1524    212      100
#>  3  1980 new_sn_m2534    212      100
#>  4  1980 new_sn_m3544    212      100
#>  5  1980 new_sn_m4554    212      100
#>  6  1980 new_sn_m5564    212      100
#>  7  1980 new_sn_m65      212      100
#>  8  1980 new_sn_f014     212      100
#>  9  1980 new_sn_f1524    212      100
#> 10  1980 new_sn_f2534    212      100
#> # … with 1,996 more rows

gg_miss_fct(who, year)

Created on 2020-05-08 by the reprex package (v0.3.0)

A few differences:

In terms of additions to naniar, I could add an option for gg_miss_fct to contain numbers of missings (similar to gg_miss_var), but I'm not sure if this is a new plot, although I think your plot is great, generalising the differences between your plot and gg_miss_fct I think might be a challenge.

Let me know what you think, happy to try and improve gg_miss_fct if you think there is something missing (no pun intended).

Thanks again!

mpaulacaldas commented 4 years ago

Oh I hadn't thought about putting miss_var_summary() with gg_miss_fct()! Thank you so much for the pointer!

Regarding gg_miss_fct(), no need to add an extra argument, I can do with the % missings. I think I used the count in my example because I was adapting some old code, but generally I find the % missings more useful.

Although it wasn't very clear from my original message, I guess what I was trying to say is that it would be nice to see more visualisation options for longitudinal data in naniar (in the form of new gg_miss_*() functions, for example). I understand that this may not be a priority, but in case you decided to develop something in that direction, I thought I would share a graph that I have found useful. I agree though that this particular example is already very similar to the existing functionality of gg_miss_fct(), and not really worth developing.

(P.S. Glad to know my code helped!)