ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.1k stars 79 forks source link

Replace as.data.frame to improve performance #678

Closed hms1 closed 2 years ago

hms1 commented 2 years ago

This PR supersedes #674 to cleanly rebase and add better benchmarking.

This very small PR replaces as.data.frame with as_tibble inside the reshape_skimmed function. This has a significant impact on performance, particularly for large datasets: as.data.frame is known to be slow (see here) and in combination with increased calls to reshape_skimmed as variables are added leads to an exponential degradation in performance with dataset size.

image

code ```R library(tidyverse) library(stringi) library(skimr) library(microbenchmark) #devtools::install_github("ropensci/skimr@develop") #### Make some wide test data rows <- c(100,1000,10000) cols <- c(30,90,300,600,900,1200) results <- list() for(r in 1:length(rows)){ results[[r]] <- list() for(c in 1:length(cols)){ number_rows <- rows[r] coltypes <- c(rep("numeric",ceiling(cols[c]/3)), rep("date",ceiling(cols[c]/3)), rep("char", ceiling(cols[c]/3))) large_test_df <- bind_cols( map(seq_len(length(coltypes)), function(col){ if(coltypes[col] == "numeric") content <- round(runif(number_rows, 0, col), sample(c(0,1,2,3), 1)) if(coltypes[col] == "date") content <- sample(seq(as.Date('1950-01-01'), Sys.Date(), by="day"), number_rows, replace = T) if(coltypes[col] == "char") content <- sample(stri_rand_strings(500, sample(c(1,3,5), 1)), number_rows, replace = T) tibble(!!sym(str_c("col",col)) := content) })) reshape_skimmed_df <- function (column, skimmed, groups) { delim_name <- paste0(column, "_", NAME_DELIMETER) out <- dplyr::select(as.data.frame(skimmed), !!!groups, tidyselect::starts_with(delim_name, ignore.case = FALSE)) set_clean_names(out) } reshape_skimmed_tibble <- function (column, skimmed, groups){ delim_name <- paste0(column, "_", NAME_DELIMETER) out <- dplyr::select(tidyr::as_tibble(skimmed), !!!groups, tidyselect::starts_with(delim_name, ignore.case = FALSE)) set_clean_names(out) } results[[r]][[c]] <- microbenchmark( { assignInNamespace("reshape_skimmed", `environment<-`(reshape_skimmed_df, environment(skimr:::build_results)), ns="skimr") full_version_2 <- skim(large_test_df) }, { assignInNamespace("reshape_skimmed", `environment<-`(reshape_skimmed_tibble, environment(skimr:::build_results)), ns="skimr", envir = environment(skimr:::build_results)) full_version <- skim(large_test_df) }, times = 3) } } ```

Using as_tibble the bottleneck of reshaping data column-wise is removed and time now scales linearly. Furthermore number of rows has minimal effect on the overall time taken.

hms1 commented 2 years ago

@elinw @michaelquinn32 apologies for the multiple PRs - struggled to rebase from the forked repo and ended up with a mess. This one is now passing checks though and added some better benchmarking which is quite interesting.

elinw commented 2 years ago

@hms1 Thank you! This is really great documentation of the PR.

elinw commented 2 years ago

The tests are failing because of something with the runner, but it has nothing to do with this change, it's not getting far enough in running to actually test. So I'm going to go ahead and merge this to the develop branch. I read something about this issue https://github.com/r-hub/rhub/issues/500 .

michaelquinn32 commented 2 years ago

Thanks again for the fix Henry!