ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.11k stars 79 forks source link

skimr only shows the first three characters of a factor level when showing the top_counts #307

Closed AndreaPi closed 6 years ago

AndreaPi commented 6 years ago

Hi all,

not sure if this is intended or not: I couldn't find it in the documentation, but I may have missed it. When skimming a factor, the skimmer top_counts shows only the first three characters of the levels with the top count. This isn't helpful when the first three characters are the same for all levels, and can lead to confusion. See:

library(skimr)
#> Warning: package 'skimr' was built under R version 3.4.4

foo <- structure(c(33L, 1L, 5L, 27L, 18L, 20L, 31L, 7L, 25L, 6L, 2L, 
                   11L, 11L, 12L, 2L, 36L, 8L, 32L, 22L, 26L, 26L, 18L, 11L, 4L, 
                   21L, 26L, 20L, 1L, 5L, 36L, 28L, 21L, 22L, 37L, 36L, 30L, 14L, 
                   36L, 13L, 7L, 21L, 8L, 33L, 24L, 4L, 1L, 34L, 18L, 17L, 27L, 
                   24L, 24L, 23L, 31L, 19L, 6L, 13L, 20L, 22L, 14L, 23L, 16L, 23L, 
                   31L, 16L, 1L, 35L, 24L, 33L, 35L, 9L, 27L, 4L, 18L, 10L, 30L, 
                   29L, 18L, 18L, 37L, 21L, 15L, 2L, 28L, 17L, 24L, 18L, 10L, 2L, 
                   3L, 31L, 35L, 9L, 28L, 27L, 1L, 23L, 21L, 34L, 25L), 
                 .Label = c("zb-025", "ZB-048", "zb-051", "ZB-053", "zb-060", 
                            "zb-064", "ZB-080", "ZB-092", "ZB-101", "ZB-104", 
                            "ZB-106", "ZB-136", "zb-147", "ZB-155", "ZB-156",
                            "ZB-158", "zb-175", "zb-182", "ZB-188", "ZB-198", 
                            "zb-205", "ZB-216", "ZB-224", "Zb-228", "ZB-238", 
                            "ZB-240", "ZB-255", "ZB-259", "ZB-262", "ZB-264", 
                            "ZB-269", "ZB-275", "ZB-277", "ZB-282", "ZB-309", 
                            "zb-355", "zb-361"), class = "factor")

skim(foo)
#> Skim summary statistics
#> 
#> Variable type: factor 
#>  variable missing complete   n n_unique                     top_counts
#>       foo       0      100 100       37 zb-: 7, zb-: 5, zb-: 5, Zb-: 5
#>  ordered
#>    FALSE
sessionInfo()
#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 16299)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252   
#> [3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=Italian_Italy.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] bindrcpp_0.2 skimr_1.0.1 
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.14     knitr_1.20       bindr_0.1        magrittr_1.5    
#>  [5] tidyselect_0.2.2 R6_2.2.2         rlang_0.2.0.9000 stringr_1.2.0   
#>  [9] dplyr_0.7.4      tools_3.4.3      htmltools_0.3.6  yaml_2.1.14     
#> [13] rprojroot_1.2    digest_0.6.13    assertthat_0.2.0 tibble_1.4.1    
#> [17] purrr_0.2.3      tidyr_0.7.2      glue_1.1.1       evaluate_0.10.1 
#> [21] rmarkdown_1.8    stringi_1.1.6    compiler_3.4.3   pander_0.6.1    
#> [25] pillar_1.0.1     backports_1.1.1  pkgconfig_2.0.1

As you can see, at a first glance one could think the top counts arezb-: 7, zb-: 5, zb-: 5 and Zb-: 5. Instead these are the first three characters of the top count levels, followed by the levels. However, since the first three characters are the same for all levels, the skim summary doesn't really allow me to see which the top counts are. What about adding an option which would allow me to choose whether level namesin top_counts are shown only as abbreviations or at full length? Default could be abbreviations, if you are opinionated about that, but I would be allowed to choose.

Of course, since level names can be arbitrarily long, this would mean one or more line breaks could be present into skim output. It doesn't seem a problem to me. What do you think?

michaelquinn32 commented 6 years ago

Hi Andrea!

That option exists when you customize formatting. https://github.com/ropenscilabs/skimr/blob/3605be99e6a170e57bbe75bda992c6b1c13bdcfc/R/formats.R#L12

Best wishes, Michael

AndreaPi commented 6 years ago

Thanks a lot @michaelquinn32 ! However, either skim_format has some issues or I haven't understood how to use it. According to the help, this should show 4 characters in factor levels:

# Show 4-character names in factor levels
skim_format(.levels = list(nchar = 4))

However, I get an error:

library(skimr)
#> Warning: package 'skimr' was built under R version 3.4.4

foo <- structure(c(33L, 1L, 5L, 27L, 18L, 20L, 31L, 7L, 25L, 6L, 2L, 
                   11L, 11L, 12L, 2L, 36L, 8L, 32L, 22L, 26L, 26L, 18L, 11L, 4L, 
                   21L, 26L, 20L, 1L, 5L, 36L, 28L, 21L, 22L, 37L, 36L, 30L, 14L, 
                   36L, 13L, 7L, 21L, 8L, 33L, 24L, 4L, 1L, 34L, 18L, 17L, 27L, 
                   24L, 24L, 23L, 31L, 19L, 6L, 13L, 20L, 22L, 14L, 23L, 16L, 23L, 
                   31L, 16L, 1L, 35L, 24L, 33L, 35L, 9L, 27L, 4L, 18L, 10L, 30L, 
                   29L, 18L, 18L, 37L, 21L, 15L, 2L, 28L, 17L, 24L, 18L, 10L, 2L, 
                   3L, 31L, 35L, 9L, 28L, 27L, 1L, 23L, 21L, 34L, 25L), 
                 .Label = c("zb-025", "ZB-048", "zb-051", "ZB-053", "zb-060", 
                            "zb-064", "ZB-080", "ZB-092", "ZB-101", "ZB-104", 
                            "ZB-106", "ZB-136", "zb-147", "ZB-155", "ZB-156",
                            "ZB-158", "zb-175", "zb-182", "ZB-188", "ZB-198", 
                            "zb-205", "ZB-216", "ZB-224", "Zb-228", "ZB-238", 
                            "ZB-240", "ZB-255", "ZB-259", "ZB-262", "ZB-264", 
                            "ZB-269", "ZB-275", "ZB-277", "ZB-282", "ZB-309", 
                            "zb-355", "zb-361"), class = "factor")

skim_format(.levels = list(nchar = 4))
skim(foo)
#> Error in substr(names(x), 1, options$formats$.levels$max_char): invalid substring arguments

sessionInfo()
#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 16299)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252   
#> [3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=Italian_Italy.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] skimr_1.0.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.14     assertthat_0.2.0 dplyr_0.7.4      digest_0.6.13   
#>  [5] rprojroot_1.2    R6_2.2.2         backports_1.1.1  magrittr_1.5    
#>  [9] evaluate_0.10.1  pillar_1.0.1     rlang_0.2.0.9000 stringi_1.1.6   
#> [13] bindrcpp_0.2     rmarkdown_1.8    tools_3.4.3      stringr_1.2.0   
#> [17] pander_0.6.1     glue_1.1.1       purrr_0.2.3      yaml_2.1.14     
#> [21] compiler_3.4.3   pkgconfig_2.0.1  htmltools_0.3.6  bindr_0.1       
#> [25] knitr_1.20       tidyselect_0.2.2 tibble_1.4.1
michaelquinn32 commented 6 years ago

Hi Andrea!

Instead, try

skim_format(.levels = list(max_char = 4))

Here's an example:

library(skimr)
skim_format(.levels = list(max_char = 4))
skim(iris, Species)
#> Skim summary statistics
#>  n obs: 150 
#>  n variables: 5 
#> 
#> Variable type: factor 
#>  variable missing complete   n n_unique
#>   Species       0      150 150        3
#>                           top_counts ordered
#>  seto: 50, vers: 50, virg: 50, NA: 0   FALSE

Best wishes, Michael