ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.1k stars 78 forks source link

`as.list` method for use with `writexl::write_xlsx` #665

Open jbkunst opened 3 years ago

jbkunst commented 3 years ago

Hi everyone, thank so much for this package.

I want to know your opinion about implement a as.list or similar for the skim_df class.

For example the next function transform a skim_df object into a list so you can export all elements (partition(x)) to a excel file using writexl::write_xlsx. And have something like this:

image

 library(skimr)

 sk <- skim(head(iris))

 as.list.skim_df <- function(x, ...){

   tsummary <- as.data.frame(summary(x))
   tsummary <- tibble::as_tibble(tsummary)
   tsummary <- dplyr::select(tsummary, 1, 3)
   tsummary <- setNames(tsummary, c("",""))

   tdetails <- skimr::partition(x)
   tdetails <- purrr::map(tdetails, tibble::as_tibble)

   out <- c(list(summary = tsummary), tdetails)

   out

 }

 as.list(sk)
#> $summary
#> # A tibble: 9 x 2
#>   ``                           ``          
#>   <fct>                        <fct>       
#> 1 "Name"                       "head(iris)"
#> 2 "Number of rows "            "6"         
#> 3 "Number of columns "         "5"         
#> 4 "_______________________ "   " "         
#> 5 "Column type frequency: "    " "         
#> 6 "  factor"                   "1"         
#> 7 "  numeric"                  "4"         
#> 8 "________________________  " " "         
#> 9 "Group variables"            "None"      
#> 
#> $factor
#> # A tibble: 1 x 6
#>   skim_variable n_missing complete_rate ordered n_unique top_counts            
#>   <chr>             <int>         <dbl> <lgl>      <int> <chr>                 
#> 1 Species               0             1 FALSE          1 set: 6, ver: 0, vir: 0
#> 
#> $numeric
#> # A tibble: 4 x 11
#>   skim_variable n_missing complete_rate  mean     sd    p0   p25   p50   p75
#>   <chr>             <int>         <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Sepal.Length          0             1 4.95  0.288    4.6  4.75  4.95  5.07
#> 2 Sepal.Width           0             1 3.38  0.343    3    3.12  3.35  3.58
#> 3 Petal.Length          0             1 1.45  0.138    1.3  1.4   1.4   1.48
#> 4 Petal.Width           0             1 0.233 0.0816   0.2  0.2   0.2   0.2 
#> # ... with 2 more variables: p100 <dbl>, hist <chr>

 # so you can do:
 # writexl::write_xlsx(as.list(sk), path = "/some/path/excel_file.xlsx")

Created on 2021-06-02 by the reprex package (v2.0.0)

michaelquinn32 commented 3 years ago

Thanks for the suggestion!

We handle a lot of this within reshape.R. Have a look: https://github.com/ropensci/skimr/blob/master/R/reshape.R

We could expand partition with an argument include.summary to do something like that. We would need to be a little careful about the partition/bind round trip behavior, but I think this is reasonable.

@elinw, what do you think?

elinw commented 3 years ago

I was thinking about this. I think that if the goal is to add the summary as an element of the list specifically for the purpose of being able to write it to an Excel file I think it's better to start with that idea rather than starting with the solution of putting summary into a list with everything else. We should think about what the general use case for a list of skim objects would be. First, there are other packages that support writing to Excel files (e.g. openxlsx) so we should try to be general enough to support all of them. Second, in terms of data, just one thought, why is it that we would put summary in the same sheet with everything else? Maybe it would make more sense to default to one partition per worksheet and then put the summary on its own sheet. It might make sense even to have two different methods, write.skim_df() and then write.summary_skim_df().

michaelquinn32 commented 3 years ago

Yes @elinw I would really like a generic for this too from other packages. Should we

Then we could make changes on our side as well to accommodate this.

elinw commented 2 years ago

Because we are talking about lists of skimmed objects ere we should also be conscious of any interaction with purrr #671 and print.

elinw commented 2 years ago

This actually works fine for me:

writel::write_xlsx(partition(skim(iris)), path = "irisskim.xls")

I that would work for opensxl::write.xlsx but it expects data frames.

elinw commented 2 years ago

Okay actually ... the list is skim_list and not a listbut if you add the "list" as a class openxlsx also works but with warnings.

Warning messages:
1: In (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  :
  row names were found from a short variable and have been discarded
2: In (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  :
  row names were found from a short variable and have been discarded
elinw commented 2 years ago

@michaelquinn32 Are you fine with adding the "list" class in a second slot of skim_list objects? Do you see any downside?

michaelquinn32 commented 2 years ago

FWICT, reading and writing with writexl and readxl is working fine, as long as you partition first. https://colab.research.google.com/drive/1osM9l78MtsR8af_XONn-utauGtDEzCAm?usp=sharing

I don't think we need to inherit from a list with a skim_list. I believe that it's always implied. https://stackoverflow.com/questions/19607652/why-doesnt-classdata-frame-show-list-inheritance

iris %>%
  skim() %>%
  partition() %>%
  is.list()
#> TRUE

One change I would like to see, though, is for the updated summaries to be included as a frame in the skim_list.

elinw commented 2 years ago

But for openxlsx it isn't working because it expects a list. But I tried adding list to the class and it didn't work as expected (it put the whole skim data frame in both tabs. But coercing to a list works fine, so I think we should simply add documentation.

michaelquinn32 commented 2 years ago

We should open a bug with openxlsx on this.

There also seems to be an issue with the output that they're writing. Instead of preserving the list structure from skimr, they seem to be collapsing everything into a single table. See here: https://colab.research.google.com/drive/1osM9l78MtsR8af_XONn-utauGtDEzCAm#scrollTo=KDyDfpdCnonk

michaelquinn32 commented 2 years ago

It's also worth nothing that list() is not really a class in S3; it's a type. That's why a data frame returns TRUE for is.list(). That's mostly explained by this cryptic message.

Here, we describe the so called “S3” classes (and methods). For “S4” classes (and methods), see
‘Formal classes’ below.

Many R objects have a class attribute, a character vector giving the names of the classes from
which the object inherits. (Functions oldClass and oldClass<- get and set the attribute, which can
also be done directly.)

If the object does not have a class attribute, it has an implicit class, notably "matrix", "array",
"function" or "numeric" or the result of typeof(x) (which is similar to mode(x)), but for type "language"
and mode "call", where the following extra classes exist for the corresponding function calls: if,
while, for, =, <-, (, {, call.

https://stat.ethz.ch/R-manual/R-devel/library/base/html/class.html

elinw commented 1 year ago

Okay so it's almost a year later. The good think is that I tried openxlsx with partition and it works so that is one problem taken care of.

The issues I see are

  1. Getting summary into a frame structure
  2. Document using partition() for this purpose.
elinw commented 11 months ago

Just one update on this which is that openxlsx2 now exists and is more tidyverse focused.