Closed jzadra closed 1 year ago
Thanks for the reprex! I can reproduce this.
This is actually an issue with how tibble
prints - if we coerce to a data.frame
we get the value back out.
library(tidyverse)
library(naniar)
df <- tibble(x = rep(NA_real_, 30000)) %>%
add_row(x = 0)
df %>% miss_var_summary()
#> # A tibble: 1 x 3
#> variable n_miss pct_miss
#> <chr> <int> <dbl>
#> 1 x 30000 100.
df %>% miss_var_summary() %>% as.data.frame()
#> variable n_miss pct_miss
#> 1 x 30000 99.99667
Created on 2021-05-13 by the reprex package (v2.0.0)
I'll cross post this to tibble, this isn't ideal behaviour, thanks for reporting.
library(tidyverse)
library(naniar)
N <- 30000000
df <- tibble(x = rep(NA_real_, N)) %>%
add_row(x = 0)
df %>% miss_var_summary()
#> # A tibble: 1 × 3
#> variable n_miss pct_miss
#> <chr> <int> <dbl>
#> 1 x 30000000 100.
df %>%
miss_var_summary() %>%
as.data.frame()
#> variable n_miss pct_miss
#> 1 x 30000000 100
df %>%
miss_var_summary() %>%
mutate(pct_miss = num(pct_miss, digits = trunc(log10(N) + 2)))
#> # A tibble: 1 × 3
#> variable n_miss pct_miss
#> <chr> <int> <num:.9!>
#> 1 x 30000000 99.999996667
df %>%
miss_var_summary() %>%
mutate(pct_miss = num(pct_miss, digits = trunc(log10(N) + 2))) %>%
as.data.frame()
#> variable n_miss pct_miss
#> 1 x 30000000 99.999996667
Created on 2021-08-01 by the reprex package (v2.0.0.9000)
pillar::num()
(reexported as tibble::num()
) allow specifying arbitrary digits or significant figures in this specific example.
You could also store pct_miss
as a value between 0 and 1 with num(scale = 100)
.
Oh that's awesome! Thanks so much, @krlmlr - I'll add this feature in the next release.
OK so here is the old way
library(tidyverse)
library(naniar)
N <- 30000000
df <- tibble(x = rep(NA_real_, N)) %>%
add_row(x = 0)
df %>% miss_var_summary()
#> # A tibble: 1 × 3
#> variable n_miss pct_miss
#> <chr> <int> <dbl>
#> 1 x 30000000 100.
df %>%
miss_var_summary() %>%
as.data.frame()
#> variable n_miss pct_miss
#> 1 x 30000000 100
Created on 2023-04-10 with reprex v2.0.2
And the new way
library(tidyverse)
library(naniar)
N <- 30000000
df <- tibble(x = rep(NA_real_, N)) %>%
add_row(x = 0)
df %>% miss_var_summary()
#> # A tibble: 1 × 3
#> variable n_miss pct_miss
#> <chr> <int> <num>
#> 1 x 30000000 100.
df %>% miss_var_summary(digits = 6)
#> # A tibble: 1 × 3
#> variable n_miss pct_miss
#> <chr> <int> <num:.6!>
#> 1 x 30000000 99.999997
Created on 2023-04-10 with reprex v2.0.2
I've been puzzling over why I was seeing a column with 100 percent missing data in a large tibble (~30,000 rows) made up of several tibbles combined with
bind_rows()
despite seeing that one of the individual tibbles does not show 100% missing for that column.After a bunch of wild goose chases, I realized that the issue was that
miss_var_summary()
(and probably similar naniar functions) was rounding thepct_miss
column up.In this example, the percent missing is actually 99.9967%.
All that to say, I'd like to suggest that for the edge cases of near-zero and near-100 percent missings not be rounded to avoid this confusion.