njtierney / naniar

Tidy data structures, summaries, and visualisations for missing data
http://naniar.njtierney.com/
Other
650 stars 54 forks source link

Large tibble with very few non-missings shows 100% missing in miss_var_summary() #284

Closed jzadra closed 1 year ago

jzadra commented 3 years ago

I've been puzzling over why I was seeing a column with 100 percent missing data in a large tibble (~30,000 rows) made up of several tibbles combined with bind_rows() despite seeing that one of the individual tibbles does not show 100% missing for that column.

After a bunch of wild goose chases, I realized that the issue was that miss_var_summary() (and probably similar naniar functions) was rounding the pct_miss column up.

library(tidyverse)
library(naniar)

df <- tibble(x = rep(NA_real_, 30000)) %>% 
  add_row(x = 0)

df %>% miss_var_summary()
#> # A tibble: 1 x 3
#>   variable n_miss pct_miss
#>   <chr>     <int>    <dbl>
#> 1 x         30000     100.

df %>% filter(!is.na(x))
#> # A tibble: 1 x 1
#>       x
#>   <dbl>
#> 1     0

In this example, the percent missing is actually 99.9967%.

All that to say, I'd like to suggest that for the edge cases of near-zero and near-100 percent missings not be rounded to avoid this confusion.

njtierney commented 3 years ago

Thanks for the reprex! I can reproduce this.

This is actually an issue with how tibble prints - if we coerce to a data.frame we get the value back out.

library(tidyverse)
library(naniar)

df <- tibble(x = rep(NA_real_, 30000)) %>% 
  add_row(x = 0)

df %>% miss_var_summary() 
#> # A tibble: 1 x 3
#>   variable n_miss pct_miss
#>   <chr>     <int>    <dbl>
#> 1 x         30000     100.

df %>% miss_var_summary() %>% as.data.frame()
#>   variable n_miss pct_miss
#> 1        x  30000 99.99667

Created on 2021-05-13 by the reprex package (v2.0.0)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.5 (2021-03-31) #> os macOS Big Sur 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Perth #> date 2021-05-13 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] standard (@0.2.1) #> backports 1.2.1 2020-12-09 [1] standard (@1.2.1) #> broom 0.7.5 2021-02-19 [1] CRAN (R 4.0.2) #> cellranger 1.1.0 2016-07-27 [1] standard (@1.1.0) #> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.2) #> colorspace 2.0-0 2020-11-11 [1] standard (@2.0-0) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.2) #> dbplyr 2.1.0 2021-02-03 [1] CRAN (R 4.0.2) #> digest 0.6.27 2020-10-24 [1] standard (@0.6.27) #> dplyr * 1.0.6 2021-05-05 [1] CRAN (R 4.0.2) #> ellipsis 0.3.1 2020-05-15 [1] standard (@0.3.1) #> evaluate 0.14 2019-05-28 [1] standard (@0.14) #> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.2) #> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.0.2) #> fs 1.5.0 2020-07-31 [1] standard (@1.5.0) #> generics 0.1.0 2020-10-31 [1] standard (@0.1.0) #> ggplot2 * 3.3.3 2020-12-30 [1] CRAN (R 4.0.2) #> glue 1.4.2 2020-08-27 [1] standard (@1.4.2) #> gtable 0.3.0 2019-03-25 [1] standard (@0.3.0) #> haven 2.3.1 2020-06-01 [1] standard (@2.3.1) #> highr 0.8 2019-03-20 [1] standard (@0.8) #> hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2) #> httr 1.4.2 2020-07-20 [1] standard (@1.4.2) #> jsonlite 1.7.2 2020-12-09 [1] standard (@1.7.2) #> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.2) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2) #> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.0.2) #> magrittr 2.0.1 2020-11-17 [1] standard (@2.0.1) #> modelr 0.1.8 2020-05-19 [1] standard (@0.1.8) #> munsell 0.5.0 2018-06-12 [1] standard (@0.5.0) #> naniar * 0.6.0.9000 2020-12-23 [1] local #> pillar 1.6.0 2021-04-13 [1] CRAN (R 4.0.2) #> pkgconfig 2.0.3 2019-09-22 [1] standard (@2.0.3) #> purrr * 0.3.4 2020-04-17 [1] standard (@0.3.4) #> R6 2.5.0 2020-10-28 [1] standard (@2.5.0) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.2) #> readr * 1.4.0 2020-10-05 [1] standard (@1.4.0) #> readxl 1.3.1 2019-03-13 [1] standard (@1.3.1) #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.2) #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.0.2) #> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.2) #> rstudioapi 0.13 2020-11-12 [1] standard (@0.13) #> rvest 1.0.0 2021-03-09 [1] CRAN (R 4.0.2) #> scales 1.1.1 2020-05-11 [1] standard (@1.1.1) #> sessioninfo 1.1.1 2018-11-05 [1] standard (@1.1.1) #> stringi 1.5.3 2020-09-09 [1] standard (@1.5.3) #> stringr * 1.4.0 2019-02-10 [1] standard (@1.4.0) #> styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.2) #> tibble * 3.1.1 2021-04-18 [1] CRAN (R 4.0.3) #> tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [1] standard (@1.1.0) #> tidyverse * 1.3.0 2019-11-21 [1] standard (@1.3.0) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.2) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.2) #> visdat 0.5.3 2019-02-15 [1] CRAN (R 4.0.2) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.3) #> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.2) #> xml2 1.3.2 2020-04-23 [1] standard (@1.3.2) #> yaml 2.2.1 2020-02-01 [1] standard (@2.2.1) #> #> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library ```

I'll cross post this to tibble, this isn't ideal behaviour, thanks for reporting.

krlmlr commented 3 years ago

Black hat: more power

library(tidyverse)
library(naniar)

N <- 30000000

df <- tibble(x = rep(NA_real_, N)) %>%
  add_row(x = 0)

df %>% miss_var_summary()
#> # A tibble: 1 × 3
#>   variable   n_miss pct_miss
#>   <chr>       <int>    <dbl>
#> 1 x        30000000     100.

df %>%
  miss_var_summary() %>%
  as.data.frame()
#>   variable   n_miss pct_miss
#> 1        x 30000000      100

df %>%
  miss_var_summary() %>%
  mutate(pct_miss = num(pct_miss, digits = trunc(log10(N) + 2)))
#> # A tibble: 1 × 3
#>   variable   n_miss     pct_miss
#>   <chr>       <int>    <num:.9!>
#> 1 x        30000000 99.999996667

df %>%
  miss_var_summary() %>%
  mutate(pct_miss = num(pct_miss, digits = trunc(log10(N) + 2))) %>%
  as.data.frame()
#>   variable   n_miss     pct_miss
#> 1        x 30000000 99.999996667

Created on 2021-08-01 by the reprex package (v2.0.0.9000)

pillar::num() (reexported as tibble::num()) allow specifying arbitrary digits or significant figures in this specific example.

krlmlr commented 3 years ago

You could also store pct_miss as a value between 0 and 1 with num(scale = 100) .

njtierney commented 3 years ago

Oh that's awesome! Thanks so much, @krlmlr - I'll add this feature in the next release.

njtierney commented 1 year ago

OK so here is the old way

library(tidyverse)
library(naniar)

N <- 30000000

df <- tibble(x = rep(NA_real_, N)) %>%
  add_row(x = 0)

df %>% miss_var_summary()
#> # A tibble: 1 × 3
#>   variable   n_miss pct_miss
#>   <chr>       <int>    <dbl>
#> 1 x        30000000     100.

df %>%
  miss_var_summary() %>%
  as.data.frame()
#>   variable   n_miss pct_miss
#> 1        x 30000000      100

Created on 2023-04-10 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.3 (2023-03-15) #> os macOS Ventura 13.2 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Australia/Hobart #> date 2023-04-10 #> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0) #> broom 1.0.3 2023-01-25 [1] CRAN (R 4.2.0) #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0) #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0) #> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.0) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0) #> dbplyr 2.3.0 2023-01-16 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr * 1.1.1 2023-03-22 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.0) #> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.0) #> gargle 1.3.0 2023-01-30 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> ggplot2 * 3.4.1 2023-02-10 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.0) #> googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.0) #> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0) #> haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.0) #> hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0) #> httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0) #> jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> lubridate 1.9.1 2023-01-24 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> modelr 0.1.10 2022-11-11 [1] CRAN (R 4.2.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> naniar * 1.0.0.9000 2023-04-10 [1] local #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.0) #> readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rlang 1.1.0 2023-03-14 [1] CRAN (R 4.2.0) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0) #> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0) #> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0) #> styler 1.9.0 2023-01-15 [1] CRAN (R 4.2.0) #> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.0) #> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.0) #> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0) #> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0) #> vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.2.0) #> visdat 0.6.0 2023-02-02 [1] local #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.37 2023-01-31 [1] CRAN (R 4.2.0) #> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

And the new way

library(tidyverse)
library(naniar)

N <- 30000000

df <- tibble(x = rep(NA_real_, N)) %>%
  add_row(x = 0)

df %>% miss_var_summary()
#> # A tibble: 1 × 3
#>   variable   n_miss pct_miss
#>   <chr>       <int>    <num>
#> 1 x        30000000     100.
df %>% miss_var_summary(digits = 6)
#> # A tibble: 1 × 3
#>   variable   n_miss  pct_miss
#>   <chr>       <int> <num:.6!>
#> 1 x        30000000 99.999997

Created on 2023-04-10 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.3 (2023-03-15) #> os macOS Ventura 13.2 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Australia/Hobart #> date 2023-04-10 #> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0) #> broom 1.0.3 2023-01-25 [1] CRAN (R 4.2.0) #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0) #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0) #> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.0) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0) #> dbplyr 2.3.0 2023-01-16 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr * 1.1.1 2023-03-22 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.0) #> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.0) #> gargle 1.3.0 2023-01-30 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> ggplot2 * 3.4.1 2023-02-10 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.0) #> googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.0) #> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0) #> haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.0) #> hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0) #> httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0) #> jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> lubridate 1.9.1 2023-01-24 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> modelr 0.1.10 2022-11-11 [1] CRAN (R 4.2.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> naniar * 1.0.0.9000 2023-04-10 [1] local #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.0) #> readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rlang 1.1.0 2023-03-14 [1] CRAN (R 4.2.0) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0) #> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0) #> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0) #> styler 1.9.0 2023-01-15 [1] CRAN (R 4.2.0) #> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.0) #> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.0) #> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0) #> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0) #> vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.2.0) #> visdat 0.6.0 2023-02-02 [1] local #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.37 2023-01-31 [1] CRAN (R 4.2.0) #> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```