ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.1k stars 79 forks source link

Replace as.data.frame to improve performance #674

Closed hms1 closed 2 years ago

hms1 commented 2 years ago

This very small PR replaces as.data.frame with as_tibble inside the reshape_skimmed function. This alone has a significant impact on performance, particularly for large datasets: as.data.frame is known to be slow (see here) and in combination with increased calls to reshape_skimmed as variables are added leads to an exponential degradation in performance with dataset size.

To demonstrate, build a moderately sized dataset:


library(tidyverse)
library(stringi)
library(skimr)

number_rows <- 100000
coltypes <- c(rep("numeric",333), rep("date",333), rep("char", 333))

large_test_df <- bind_cols(
  map(seq_len(length(coltypes)), function(col){
    if(coltypes[col] == "numeric") content <- round(runif(number_rows, 0, col), sample(c(0,1,2,3), 1))
    if(coltypes[col] == "date")    content <- sample(seq(as.Date('1950-01-01'), Sys.Date(), by="day"), number_rows, replace = T)
    if(coltypes[col] == "char")    content <- sample(stri_rand_strings(500, sample(c(1,3,5), 1)), number_rows, replace = T)
    tibble(!!sym(str_c("col",col)) := content)
  }
  ))

system.time(full_version <- skimr::skim(large_test_df))
   user  system elapsed 
 36.603   1.491  38.426 

Using the previous develop branch is almost an order of magnitude slower:

   user  system elapsed 
243.928  28.356 272.908 

The difference becomes greater as variables are added.

A further rework/simplification of the skim_by_type methods can speed things up further still

hms1 commented 2 years ago

I noticed that there is a failing test but this seems to be caused by fcdf240

elinw commented 2 years ago

@michaelquinn32 What do you think? as.data.frame() is actually also less preferred than data.frame(). Since we use tibbles essentially throughout it probably makes sense to go all in.
@hms1 Yes we know there is a test issue (you can even see it in the main build).

michaelquinn32 commented 2 years ago

Can you update your branch to PR against the current version of develop and try the checks again?

Otherwise, we can look to patch the tests to resolve the issue.

Thanks!

michaelquinn32 commented 2 years ago

Also, mind updating the NEWS and adding yourself as a contributor in the DESCRIPTION? Really appreciate you taking the effort to sort these things out!

elinw commented 2 years ago

While agree with this minor change with meaningful impact, I'm still wondering the following: If the data has thousands of columns and the time is growing exponentially with columns, would it make sense to chunk? I don't know if it really is exponential but I feel like this is a use case that we really didn't consider. For example, print doesn't (i don't think) have pagination

By the way, in testing in order to isolate the issue of the impact of many columns I only used 100 rows. Also it might make sense to look at whether the types (or number of types) of columns have an impact.

Anyway, I think let's start a separate issue about increasing speed.