ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.1k stars 79 forks source link

Using skim_with() and focus() together returns columns in unexpected order #668

Closed mvanaman closed 2 years ago

mvanaman commented 3 years ago

I created the following my_skim and then used it in conjunction with focus:

# create my_skim
my_skim <- skim_with(numeric = sfl(p25 = NULL, p75 = NULL, min = min, max = max), append = TRUE)

# use with focus
iris %>% 
my_skim %>% 
focus(
skim_variable, 
n_missing,
complete_rate, 
numeric.mean, 
numeric.sd, 
numeric.min, 
numeric.p50, 
numeric.max, 
numeric.hist
)

Based on the behavior of dplyr::select(), the columns should be in the order they are listed. If my output replicates, you'll actually see the following order: skim_variable n_missing complete_rate mean sd p50 hist min max

I'm wondering if it is true that the focus() function ignores ordering in some cases, or if I might have overlooked something in my code?

Thanks for your time, and thanks for creating and maintaining such an intuitive package! Matthew

michaelquinn32 commented 2 years ago

Thanks for pointing that out! And sorry for the delay getting back to you.

The issue is the print method, which as currently implemented doesn't necessarily respect the order of columns in the skim_df. Some of this is intentional, but this is clear a bad interaction with focus.

iris %>% 
  my_skim %>% 
  focus(
    skim_variable, n_missing, complete_rate,  numeric.mean, numeric.sd, 
    numeric.min, numeric.p50, numeric.max, numeric.hist
  )
#> ── Data Summary ────────────────────────
#>                            Values    
#> Name                       Piped data
#> Number of rows             150       
#> Number of columns          5         
#> _______________________              
#> Column type frequency:               
#>   factor                   1         
#>   numeric                  4         
#> ________________________             
#> Group variables            None      
#> 
#> ── Variable type: factor ────────────────────────
#>   skim_variable n_missing complete_rate
#> 1 Species               0             1
#> 
#> ── Variable type: numeric ───────────────────────
#>   skim_variable n_missing complete_rate  mean    sd   p50 hist    min   max
#> 1 Sepal.Length          0             1  5.84 0.828  5.8  ▆▇▇▅▂   4.3   7.9
#> 2 Sepal.Width           0             1  3.06 0.436  3    ▁▆▇▂▁   2     4.4
#> 3 Petal.Length          0             1  3.76 1.77   4.35 ▇▁▆▇▂   1     6.9
#> 4 Petal.Width           0             1  1.20 0.762  1.3  ▇▁▇▅▃   0.1   2.5

That's the printed output, but the order of the columns in the data have changed, FWIW.

iris %>% 
  my_skim %>% 
  focus(
    skim_variable, n_missing, complete_rate,  numeric.mean, numeric.sd, 
    numeric.min, numeric.p50, numeric.max, numeric.hist
  ) %>%
  str()                                                                                            
#> skim_df [5 × 10] (S3: skim_df/tbl_df/tbl/data.frame)                                               
#>  $ skim_type    : chr [1:5] "factor" "numeric" "numeric" "numeric" ...                             
#>  $ skim_variable: chr [1:5] "Species" "Sepal.Length" "Sepal.Width" "Petal.Length" ...              
#>  $ n_missing    : int [1:5] 0 0 0 0 0                                                              
#>  $ complete_rate: num [1:5] 1 1 1 1 1                                                              
#>  $ numeric.mean : num [1:5] NA 5.84 3.06 3.76 1.2                                                  
#>  $ numeric.sd   : num [1:5] NA 0.828 0.436 1.765 0.762                                             
#>  $ numeric.min  : num [1:5] NA 4.3 2 1 0.1                                                         
#>  $ numeric.p50  : num [1:5] NA 5.8 3 4.35 1.3                                                      
#>  $ numeric.max  : num [1:5] NA 7.9 4.4 6.9 2.5                                                     
#>  $ numeric.hist : chr [1:5] NA "▆▇▇▅▂" "▁▆▇▂▁" "▇▁▆▇▂" ...                                         
#>  - attr(*, "data_rows")= int 150                                                                   
#>  - attr(*, "data_cols")= int 5                                                                     
#>  - attr(*, "df_name")= chr "`.`"                                                                   
#>  - attr(*, "dt_key")= logi NA
#>  - attr(*, "groups")= list()
#>  - attr(*, "base_skimmers")= chr [1:2] "n_missing" "complete_rate"
#>  - attr(*, "skimmers_used")=List of 2
#>   ..$ numeric: chr [1:8] "mean" "sd" "p0" "p50" ...
#>   ..$ factor : chr [1:3] "ordered" "n_unique" "top_counts"                                                                                  

What hasn't changed is the ordering of the skimmers_used attribute. This is what we'd need to fix.

Part of this is a documentation issue too. The base skimmers are always the first in the output, and we need to do a better job of communicating that.

elinw commented 2 years ago

I think it is happening here https://github.com/ropensci/skimr/blob/840b6237597e3da7c8970a6581086ef73c561678/R/skim_with.R#L280

elinw commented 2 years ago

The attributes keep the order for skimmers_used from the original my_skim. I'm not sure why this would b the case but maybe we need to update those to the order in focus? Yes what's happening is that in partition()it uses the attributes, which are in the wrong order.

elinw commented 2 years ago

@michaelquinn32 @mvanaman I created a PR for this. Can you take a look and see if it could be more elegant? Also try some more test cases. If you think it works I'll add news, tests, documentation.

mvanaman commented 2 years ago

Sorry about the delayed response. @michaelquinn32, do you mean that placing the base_skimmers first needs to stay where it is - regardless of their ordering in focus - but that the function should be updated so that the custom skimmers my_skim do respect ordering?

If that's the idea, I tried out @elinw's solution and it seems to accomplish exactly that as far as I can tell!

michaelquinn32 commented 2 years ago

@mvanaman My concerns are mostly about internals, and I'm very glad to see that Elin's approach is working! I should be able to take a stab at simplifying it later in the week.

As Elin mentioned above, any other relevant test cases would be much appreciated.

michaelquinn32 commented 2 years ago

Fixed by #670