microsoft / datamations

https://microsoft.github.io/datamations/
Other
67 stars 14 forks source link

Downsample data #51

Open sharlagelfand opened 3 years ago

sharlagelfand commented 3 years ago

As per #50, we can't quite handle 3000 points - so we should warn people to downsample (and if they don't, downsample ourselves with a warning).

I'll do some experimenting to see what the cutoff for downsampling is first.

sharlagelfand commented 3 years ago

Just looking into this a bit. Seems like 1000 is likely our absolutely cutoff (videos below), it performs pretty decently and does well in small multiples, at least the ones I tried! 500 - 750 perform even better, so just depends on tradeoff of choppiness vs more of the data. I think taking the video itself made this even choppier - it does do better when no one's watching :)

library(dplyr)
library(ggplot2)
library(datamations)

diamonds_no_y <- diamonds %>%
  select(-y)

nrow(diamonds_no_y)
# [1] 53940
# Way too much!

# 2000 is too choppy
diamonds_sample_2000 <- diamonds_no_y %>%
  sample_n(2000)

"diamonds_sample_2000 %>% group_by(cut) %>% summarise(mean = mean(price))" %>%
  datamation_sanddance()

https://user-images.githubusercontent.com/15895337/120232362-6fc7b380-c221-11eb-9074-3159c3f85abc.mov

# 1000 seems decent (and is a nice round number)
diamonds_sample_1000 <- diamonds_no_y %>%
  sample_n(1000)

"diamonds_sample_1000 %>% group_by(cut) %>% summarise(mean = mean(price))" %>%
  datamation_sanddance()

https://user-images.githubusercontent.com/15895337/120232370-735b3a80-c221-11eb-88a3-7df0cf3f82d0.mov

"diamonds_sample_1000 %>% group_by(cut, color, clarity) %>% summarise(mean = mean(price))" %>%
  datamation_sanddance()

https://user-images.githubusercontent.com/15895337/120232378-75bd9480-c221-11eb-9b9d-ff9d4f15cd0b.mov

# 750 and 500 perform really well otherwise
diamonds_sample_750 <- diamonds_no_y %>%
  sample_n(750)

"diamonds_sample_750 %>% group_by(cut) %>% summarise(mean = mean(price))" %>%
  datamation_sanddance()

"diamonds_sample_750 %>% group_by(cut, color, clarity) %>% summarise(mean = mean(price))" %>%
  datamation_sanddance()

https://user-images.githubusercontent.com/15895337/120232382-781fee80-c221-11eb-80c9-dbec33c51848.mov

sharlagelfand commented 3 years ago

Some links on sampling in this issue: https://github.com/tidyverse/dplyr/issues/549