njtierney / naniar

Tidy data structures, summaries, and visualisations for missing data
http://naniar.njtierney.com/
Other
650 stars 54 forks source link

miss_var_summary returns the wrong percentage #255

Closed njtierney closed 1 year ago

njtierney commented 4 years ago

For example:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(naniar)
airquality %>% 
  group_by(Month) %>% 
  miss_var_summary()
#> # A tibble: 25 x 4
#> # Groups:   Month [5]
#>    Month variable n_miss pct_miss
#>    <int> <chr>     <int>    <dbl>
#>  1     5 Ozone         5     16.1
#>  2     5 Solar.R       4     12.9
#>  3     5 Wind          0      0  
#>  4     5 Temp          0      0  
#>  5     5 Day           0      0  
#>  6     6 Ozone        21     70  
#>  7     6 Solar.R       0      0  
#>  8     6 Wind          0      0  
#>  9     6 Temp          0      0  
#> 10     6 Day           0      0  
#> # … with 15 more rows

Created on 2020-05-13 by the reprex package (v0.3.0)

It should instead be:

# A tibble: 25 x 4
   Month variables n_miss pct_miss
   <int> <chr>      <int>    <dbl>
 1     5 Ozone          5     3.27
 2     5 Solar.R        4     2.61
 3     5 Wind           0     0   
 4     5 Temp           0     0   
 5     5 Day            0     0   
 6     6 Ozone         21    13.7 
 7     6 Solar.R        0     0   
 8     6 Wind           0     0   
 9     6 Temp           0     0   
10     6 Day            0     0   
njtierney commented 4 years ago

The problem comes from pct_miss showing the percentage relative to the number of rows per group...not sure if this is a problem.

njtierney commented 1 year ago

I think that this is fine