jthomasmock / gtExtras

A Collection of Helper Functions for the gt Package.
https://jthomasmock.github.io/gtExtras/
Other
193 stars 26 forks source link

Issues with large data set #107

Closed roaldarbol closed 10 months ago

roaldarbol commented 11 months ago

Prework

This is kind of both a bug and a feature request in one - I apologise for that weird mashup, but I think it fits together. :-) Working on really large data mean some of the operations will be quite performance heavy, making gtExtra unsuitable, as it is not optimized for performance. That's sad, as the functionality is super useful! So here a few recommendations - happy to discuss further.

Description

With large data, the buffer runs out of space. In general, large speed-ups could be possible.

Reproducible example

Unfortunately I can't make a reprex with my data set as it's way too large. But here is what breaks it.

> df_labia |> 
+   select(where(not_all_na)) |> 
+   gt_plt_summary()
Error in paste(buffer[seq_len(position)], collapse = "") : 
  result would exceed 2^31-1 bytes
In addition: Warning message:
Computation failed in `stat_bin()`
Caused by error in `bin_breaks_width()`:
! `binwidth` must be positive 

Proposal

  1. My main proposal is to make use of the {collapse} package from the fastverse. You can have it mask dplyr functions, and results in an around ~10 times speed-up (possibly more on larger data). Benchmarks can be found in their latest blog post (https://sebkrantz.github.io/Rblog/2023/10/17/releasing-collapse-2-0-blazing-fast-joins-reshaping-and-enhanced-r/). I have no affiliation to the project, but I think it is an amazing way to keep the tidyverse style, whilst providing the performance of running close to the metal as they say.

  2. Secondly, and I'm not quite sure how exactly this would be carried out, would be to somehow downsample the data for the graphs - this seems to relate specifically to the issue I've encountered. I am mainly focusing on continuous variables in my current data set. All the values (NAs, mean, median, sd) can be computed from the original data - but maybe we don't need to base the histogram on 1.000.000 values. To keep performance high maybe it would be possible to have a max number of observations to base it on, and then use slice() to filter - an implementation could look something like this:

    max_rows <- 10000
    df_rows <- nrow(data)
    rows_diff <- df_rows - max_rows
    n_rows_to_keep <- if_else(rows_diff > 0, max_rows, df_rows)
    keep_rows <- seq(
    from = 1, 
    to = df_rows, 
    length = n_rows_to_keep
    ) |> 
    round()
    data |>
    slice(keep_rows)

Session info

You can find my session info in #106.

jthomasmock commented 10 months ago

Howdy! I'm not planning to further expand for handling of very large datasets at this time.

With substantially large data, you can typically sample a fraction and still get a good overview of the data.

df <- tibble(
  x = rnorm(n = 1e7, mean = 100, sd = 10)
)

df |> 
  slice_sample(n = 1e3) |> 
  gt_plt_summary()

df |> 
    slice_sample(n = 1e4) |> 
    gt_plt_summary()

image

image

jthomasmock commented 10 months ago

I've added a warning for datasets larger than 100,000 rows - thanks!

roaldarbol commented 10 months ago

Completely fair! Warning should do it too - thanks! :-)