Closed roaldarbol closed 1 year ago
Howdy! I'm not planning to further expand for handling of very large datasets at this time.
With substantially large data, you can typically sample a fraction and still get a good overview of the data.
df <- tibble(
x = rnorm(n = 1e7, mean = 100, sd = 10)
)
df |>
slice_sample(n = 1e3) |>
gt_plt_summary()
df |>
slice_sample(n = 1e4) |>
gt_plt_summary()
I've added a warning for datasets larger than 100,000 rows - thanks!
Completely fair! Warning should do it too - thanks! :-)
Prework
This is kind of both a bug and a feature request in one - I apologise for that weird mashup, but I think it fits together. :-) Working on really large data mean some of the operations will be quite performance heavy, making
gtExtra
unsuitable, as it is not optimized for performance. That's sad, as the functionality is super useful! So here a few recommendations - happy to discuss further.Description
With large data, the buffer runs out of space. In general, large speed-ups could be possible.
Reproducible example
Unfortunately I can't make a reprex with my data set as it's way too large. But here is what breaks it.
Proposal
My main proposal is to make use of the
{collapse}
package from thefastverse
. You can have it maskdplyr
functions, and results in an around ~10 times speed-up (possibly more on larger data). Benchmarks can be found in their latest blog post (https://sebkrantz.github.io/Rblog/2023/10/17/releasing-collapse-2-0-blazing-fast-joins-reshaping-and-enhanced-r/). I have no affiliation to the project, but I think it is an amazing way to keep thetidyverse
style, whilst providing the performance of running close to the metal as they say.Secondly, and I'm not quite sure how exactly this would be carried out, would be to somehow downsample the data for the graphs - this seems to relate specifically to the issue I've encountered. I am mainly focusing on continuous variables in my current data set. All the values (NAs, mean, median, sd) can be computed from the original data - but maybe we don't need to base the histogram on 1.000.000 values. To keep performance high maybe it would be possible to have a max number of observations to base it on, and then use
slice()
to filter - an implementation could look something like this:Session info
You can find my session info in #106.