Open sharlagelfand opened 3 years ago
Just looking into this a bit. Seems like 1000 is likely our absolutely cutoff (videos below), it performs pretty decently and does well in small multiples, at least the ones I tried! 500 - 750 perform even better, so just depends on tradeoff of choppiness vs more of the data. I think taking the video itself made this even choppier - it does do better when no one's watching :)
library(dplyr)
library(ggplot2)
library(datamations)
diamonds_no_y <- diamonds %>%
select(-y)
nrow(diamonds_no_y)
# [1] 53940
# Way too much!
# 2000 is too choppy
diamonds_sample_2000 <- diamonds_no_y %>%
sample_n(2000)
"diamonds_sample_2000 %>% group_by(cut) %>% summarise(mean = mean(price))" %>%
datamation_sanddance()
# 1000 seems decent (and is a nice round number)
diamonds_sample_1000 <- diamonds_no_y %>%
sample_n(1000)
"diamonds_sample_1000 %>% group_by(cut) %>% summarise(mean = mean(price))" %>%
datamation_sanddance()
"diamonds_sample_1000 %>% group_by(cut, color, clarity) %>% summarise(mean = mean(price))" %>%
datamation_sanddance()
# 750 and 500 perform really well otherwise
diamonds_sample_750 <- diamonds_no_y %>%
sample_n(750)
"diamonds_sample_750 %>% group_by(cut) %>% summarise(mean = mean(price))" %>%
datamation_sanddance()
"diamonds_sample_750 %>% group_by(cut, color, clarity) %>% summarise(mean = mean(price))" %>%
datamation_sanddance()
Some links on sampling in this issue: https://github.com/tidyverse/dplyr/issues/549
As per #50, we can't quite handle 3000 points - so we should warn people to downsample (and if they don't, downsample ourselves with a warning).
I'll do some experimenting to see what the cutoff for downsampling is first.