ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.1k stars 78 forks source link

Printing more rows than default max has unexpected behavior #734

Open szimmer opened 1 year ago

szimmer commented 1 year ago

If I have a summary which is long, I sometimes want to add print(n=bignumber) to the end to get all of the summary to print but this doesn't behave as I would expect. Below, I would expect Example 1 and Example 2 to print the same thing but they don't. This is more important when I have even more rows as is true in example 3 and 4. It seems n is being ignored altogether

Here's a reprex to show:

library(skimr)
#> Warning: package 'skimr' was built under R version 4.2.3
library(tidyverse)

sum_cut <- diamonds %>% group_by(cut) %>% skim(carat) %>% yank("numeric")

sum_cut_color <- diamonds %>% group_by(cut, color) %>% skim(carat) %>% yank("numeric")

sum_cut #example 1

Variable type: numeric

skim_variable cut n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
carat Fair 0 1 1.05 0.52 0.22 0.70 1.00 1.20 5.01 ▇▂▁▁▁
carat Good 0 1 0.85 0.45 0.23 0.50 0.82 1.01 3.01 ▇▆▂▁▁
carat Very Good 0 1 0.81 0.46 0.20 0.41 0.71 1.02 4.00 ▇▃▁▁▁
carat Premium 0 1 0.89 0.52 0.20 0.41 0.86 1.20 4.01 ▇▆▁▁▁
carat Ideal 0 1 0.70 0.43 0.20 0.35 0.54 1.01 3.50 ▇▂▁▁▁
sum_cut %>% print(n=50) # example 2
#> 
#> ── Variable type: numeric ──────────────────────────────────────────────────────
#>   skim_variable cut    n_mis…¹ compl…²  mean    sd   p0  p25  p50  p75 p100 hist
#> 1 carat         Fair         0       1 1.05  0.516 0.22 0.7  1    1.2  5.01 ▇▂▁…
#> 2 carat         Good         0       1 0.849 0.454 0.23 0.5  0.82 1.01 3.01 ▇▆▂…
#> 3 carat         Very …       0       1 0.806 0.459 0.2  0.41 0.71 1.02 4    ▇▃▁…
#> 4 carat         Premi…       0       1 0.892 0.515 0.2  0.41 0.86 1.2  4.01 ▇▆▁…
#> 5 carat         Ideal        0       1 0.703 0.433 0.2  0.35 0.54 1.01 3.5  ▇▂▁…
#> # … with abbreviated variable names ¹​n_missing, ²​complete_rate

sum_cut_color # example 3

Variable type: numeric

skim_variable cut color n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
carat Fair D 0 1 0.92 0.41 0.25 0.70 0.90 1.01 3.40 ▆▇▁▁▁
carat Fair E 0 1 0.86 0.36 0.22 0.55 0.90 1.01 2.04 ▇▇▇▂▁
carat Fair F 0 1 0.90 0.42 0.25 0.60 0.90 1.01 2.58 ▇▇▂▁▁
carat Fair G 0 1 1.02 0.49 0.23 0.70 0.98 1.07 2.60 ▅▇▂▂▁
carat Fair H 0 1 1.22 0.55 0.33 0.90 1.01 1.51 4.13 ▇▃▂▁▁
carat Fair I 0 1 1.20 0.52 0.41 0.88 1.01 1.50 3.02 ▇▇▃▂▁
carat Fair J 0 1 1.34 0.73 0.30 0.90 1.03 1.69 5.01 ▇▃▁▁▁
carat Good D 0 1 0.74 0.36 0.23 0.42 0.70 1.00 2.04 ▇▅▅▁▁
carat Good E 0 1 0.75 0.38 0.23 0.41 0.70 1.00 3.00 ▇▅▁▁▁
carat Good F 0 1 0.78 0.37 0.23 0.49 0.71 1.01 2.67 ▇▆▁▁▁
carat Good G 0 1 0.85 0.43 0.23 0.50 0.90 1.01 2.80 ▇▇▂▁▁
carat Good H 0 1 0.91 0.50 0.25 0.51 0.90 1.09 3.01 ▇▇▂▁▁
carat Good I 0 1 1.06 0.58 0.30 0.70 1.00 1.50 3.01 ▇▆▃▂▁
carat Good J 0 1 1.10 0.54 0.28 0.71 1.02 1.50 3.00 ▇▇▅▂▁
carat Very Good D 0 1 0.70 0.37 0.23 0.40 0.61 1.00 2.58 ▇▅▁▁▁
carat Very Good E 0 1 0.68 0.38 0.20 0.37 0.57 0.94 2.51 ▇▆▁▁▁
carat Very Good F 0 1 0.74 0.39 0.23 0.40 0.70 1.01 2.48 ▇▇▂▁▁
carat Very Good G 0 1 0.77 0.42 0.23 0.40 0.70 1.02 2.52 ▇▆▂▁▁
carat Very Good H 0 1 0.92 0.50 0.23 0.47 0.90 1.20 3.00 ▇▇▂▁▁
carat Very Good I 0 1 1.05 0.55 0.24 0.70 1.00 1.50 4.00 ▇▆▂▁▁
carat Very Good J 0 1 1.13 0.56 0.24 0.71 1.06 1.51 2.74 ▇▇▆▃▁
carat Premium D 0 1 0.72 0.40 0.20 0.40 0.58 1.01 2.57 ▇▅▂▁▁
carat Premium E 0 1 0.72 0.41 0.20 0.38 0.58 1.00 3.05 ▇▃▁▁▁
carat Premium F 0 1 0.83 0.42 0.20 0.43 0.76 1.04 3.01 ▇▆▂▁▁
carat Premium G 0 1 0.84 0.48 0.23 0.40 0.76 1.12 3.01 ▇▆▂▁▁
carat Premium H 0 1 1.02 0.54 0.23 0.51 1.01 1.30 3.24 ▇▇▃▁▁
carat Premium I 0 1 1.14 0.61 0.23 0.59 1.14 1.54 4.01 ▇▇▃▁▁
carat Premium J 0 1 1.29 0.61 0.30 0.81 1.25 1.70 4.01 ▇▇▃▁▁
carat Ideal D 0 1 0.57 0.30 0.20 0.33 0.50 0.71 2.75 ▇▂▁▁▁
carat Ideal E 0 1 0.58 0.31 0.20 0.33 0.50 0.72 2.28 ▇▂▁▁▁
carat Ideal F 0 1 0.66 0.37 0.23 0.35 0.53 0.90 2.45 ▇▃▁▁▁
carat Ideal G 0 1 0.70 0.41 0.23 0.34 0.54 1.03 2.54 ▇▃▂▁▁
carat Ideal H 0 1 0.80 0.49 0.23 0.36 0.70 1.11 3.50 ▇▅▁▁▁
carat Ideal I 0 1 0.91 0.55 0.23 0.41 0.74 1.22 3.22 ▇▃▂▁▁
carat Ideal J 0 1 1.06 0.58 0.23 0.54 1.03 1.41 3.01 ▇▆▃▂▁
sum_cut_color %>% print(n=50) # example 4
#> 
#> ── Variable type: numeric ──────────────────────────────────────────────────────
#>    skim_…¹ cut color n_mis…² compl…³  mean    sd   p0   p25   p50  p75 p100 hist
#>  1 carat   Fa… D           0       1 0.920 0.405 0.25 0.7   0.9   1.01 3.4  ▆▇▁…
#>  2 carat   Fa… E           0       1 0.857 0.365 0.22 0.552 0.9   1.01 2.04 ▇▇▇…
#>  3 carat   Fa… F           0       1 0.905 0.419 0.25 0.6   0.9   1.01 2.58 ▇▇▂…
#>  4 carat   Fa… G           0       1 1.02  0.493 0.23 0.7   0.98  1.07 2.6  ▅▇▂…
#>  5 carat   Fa… H           0       1 1.22  0.548 0.33 0.9   1.01  1.51 4.13 ▇▃▂…
#>  6 carat   Fa… I           0       1 1.20  0.522 0.41 0.885 1.01  1.50 3.02 ▇▇▃…
#>  7 carat   Fa… J           0       1 1.34  0.734 0.3  0.905 1.03  1.68 5.01 ▇▃▁…
#>  8 carat   Go… D           0       1 0.745 0.363 0.23 0.42  0.7   1    2.04 ▇▅▅…
#>  9 carat   Go… E           0       1 0.745 0.381 0.23 0.41  0.7   1    3    ▇▅▁…
#> 10 carat   Go… F           0       1 0.776 0.370 0.23 0.49  0.71  1.01 2.67 ▇▆▁…
#> 11 carat   Go… G           0       1 0.851 0.433 0.23 0.5   0.9   1.01 2.8  ▇▇▂…
#> 12 carat   Go… H           0       1 0.915 0.498 0.25 0.51  0.9   1.09 3.01 ▇▇▂…
#> 13 carat   Go… I           0       1 1.06  0.576 0.3  0.7   1     1.5  3.01 ▇▆▃…
#> 14 carat   Go… J           0       1 1.10  0.537 0.28 0.71  1.02  1.5  3    ▇▇▅…
#> 15 carat   Ve… D           0       1 0.696 0.369 0.23 0.4   0.61  1    2.58 ▇▅▁…
#> 16 carat   Ve… E           0       1 0.676 0.378 0.2  0.37  0.57  0.94 2.51 ▇▆▁…
#> 17 carat   Ve… F           0       1 0.741 0.389 0.23 0.4   0.7   1.01 2.48 ▇▇▂…
#> 18 carat   Ve… G           0       1 0.767 0.418 0.23 0.4   0.7   1.02 2.52 ▇▆▂…
#> 19 carat   Ve… H           0       1 0.916 0.503 0.23 0.467 0.9   1.2  3    ▇▇▂…
#> 20 carat   Ve… I           0       1 1.05  0.552 0.24 0.7   1.00  1.5  4    ▇▆▂…
#> 21 carat   Ve… J           0       1 1.13  0.556 0.24 0.71  1.06  1.51 2.74 ▇▇▆…
#> 22 carat   Pr… D           0       1 0.722 0.397 0.2  0.4   0.58  1.01 2.57 ▇▅▂…
#> 23 carat   Pr… E           0       1 0.718 0.410 0.2  0.38  0.58  1    3.05 ▇▃▁…
#> 24 carat   Pr… F           0       1 0.827 0.420 0.2  0.43  0.76  1.04 3.01 ▇▆▂…
#> 25 carat   Pr… G           0       1 0.841 0.480 0.23 0.4   0.755 1.12 3.01 ▇▆▂…
#> 26 carat   Pr… H           0       1 1.02  0.544 0.23 0.51  1.01  1.3  3.24 ▇▇▃…
#> 27 carat   Pr… I           0       1 1.14  0.614 0.23 0.59  1.14  1.54 4.01 ▇▇▃…
#> 28 carat   Pr… J           0       1 1.29  0.614 0.3  0.81  1.25  1.7  4.01 ▇▇▃…
#> 29 carat   Id… D           0       1 0.566 0.299 0.2  0.33  0.5   0.71 2.75 ▇▂▁…
#> 30 carat   Id… E           0       1 0.578 0.313 0.2  0.33  0.5   0.72 2.28 ▇▂▁…
#> 31 carat   Id… F           0       1 0.656 0.375 0.23 0.35  0.53  0.9  2.45 ▇▃▁…
#> 32 carat   Id… G           0       1 0.701 0.411 0.23 0.34  0.54  1.03 2.54 ▇▃▂…
#> 33 carat   Id… H           0       1 0.800 0.487 0.23 0.36  0.7   1.11 3.5  ▇▅▁…
#> 34 carat   Id… I           0       1 0.913 0.554 0.23 0.41  0.74  1.22 3.22 ▇▃▂…
#> 35 carat   Id… J           0       1 1.06  0.582 0.23 0.54  1.03  1.41 3.01 ▇▆▃…
#> # … with abbreviated variable names ¹​skim_variable, ²​n_missing, ³​complete_rate

Created on 2023-04-03 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 (2022-10-31 ucrt) #> os Windows 10 x64 (build 19045) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.utf8 #> ctype English_United States.utf8 #> tz America/New_York #> date 2023-04-03 #> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.2) #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0) #> base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.0) #> broom 1.0.1 2022-08-29 [1] CRAN (R 4.2.2) #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.2) #> cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.2) #> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.2) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.2) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.2) #> dbplyr 2.2.1 2022-06-27 [1] CRAN (R 4.2.2) #> digest 0.6.30 2022-10-18 [1] CRAN (R 4.2.2) #> dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.2.2) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.2) #> evaluate 0.18 2022-11-07 [1] CRAN (R 4.2.2) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.2) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.2) #> forcats * 0.5.2 2022-08-19 [1] CRAN (R 4.2.2) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.2) #> gargle 1.2.1 2022-09-08 [1] CRAN (R 4.2.2) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.2) #> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.2) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.2) #> googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.2) #> googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.2) #> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.2) #> haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.2) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.2) #> hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.2) #> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.2) #> httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.2) #> jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.2) #> knitr 1.40 2022-08-24 [1] CRAN (R 4.2.2) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.2) #> lubridate 1.9.0 2022-11-06 [1] CRAN (R 4.2.2) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.2) #> modelr 0.1.9 2022-08-19 [1] CRAN (R 4.2.2) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.2) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.2) #> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.2) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.2) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.2) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.2) #> readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.2) #> readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.2) #> repr 1.1.6 2023-01-26 [1] CRAN (R 4.2.3) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.2) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.2) #> rmarkdown 2.17 2022-10-07 [1] CRAN (R 4.2.2) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.2) #> rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.2) #> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.2) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.2) #> skimr * 2.1.5 2022-12-23 [1] CRAN (R 4.2.3) #> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.1) #> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.2) #> styler 1.8.1 2022-11-07 [1] CRAN (R 4.2.2) #> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.2) #> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.2) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.2) #> tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.2) #> timechange 0.1.1 2022-11-04 [1] CRAN (R 4.2.2) #> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.2) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.2) #> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.2.2) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.2) #> xfun 0.34 2022-10-18 [1] CRAN (R 4.2.2) #> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.2) #> yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.1) #> #> [1] C:/Program Files/R/R-4.2.2/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
elinw commented 1 year ago

Thanks for this report. Can you clarify what you mean by printing "the same thing"?

sum_cut and sum_cut_color are both class

[1] "one_skim_df" "tbl_df" "tbl"
[4] "data.frame"

sum_cut has 5 rows while sum_cut_color is 35 rows. sum_cut has has 12 columns, sum_cut_color has 13.

So you wouldn't expect them to be identical when they print because they are not identical objects.

Can you clarify what you mean by expecting them to print the same thing?

szimmer commented 1 year ago

One prints as a nice html table and one does not. The n parameter on print also seems to be ignored

elinw commented 1 year ago

I think it's the use of the print() function I don't see how they should be impacted by the value of n. When I use n=3 I get 3 rows so I think the parameter is not ignored. However it does seem like knit_print.skim_one_df() is not being used.

However, using knit_print() instead of print() seems to solve that. See:

https://rpubs.com/elinw/1024699

@michaelquinn32 It seems like something is going wrong with the print dispatch? Or is calling print explicitly essentially an override?

elinw commented 1 year ago

I did notice that things seemed to go a bit wild with several hundred variables all of the same type. Maybe we need to be thinking about pagination.