microsoft / datamations

https://microsoft.github.io/datamations/
Other
67 stars 14 forks source link

Allow more than one summarized value #53

Open sharlagelfand opened 3 years ago

sharlagelfand commented 3 years ago

Right now we only support one summarized value, e.g.

small_salary %>% group_by(Degree) %>% summarize(mean = mean(Salary))

Maybe in the future could think about how multiple operations (or summarizing multiple variables) could work, e.g.

small_salary %>% group_by(Degree) %>% summarize(mean = mean(Salary), median = median(Salary))

jhofman commented 2 years ago

at least three cases to think about:

  1. multiple summarize outputs that come from different variables, e.g. summarize(mean_weight = mean(weight), mean_mpg = mean(mpg))
  2. multiple summarize outputs that come from the same variable and are meant as x-y in the final frame, e.g. summarize(mean = mean(Salary), median = median(Salary))
  3. multiple summarize outputs that come from the same variable and are NOT meant as x-y in the final frame, e.g. summarize(mu = mean(salary), se = sd(salary) / sqrt(n))

let's think about only two outputs from summarize for now because more than two makes our heads hurt.

let's think about 1. @sharlagelfand will make some static snapshots of how this could look, with an initial scatter plot that shows all the points that then collapse to summarized points.

perhaps the way to deal with more than two is to do facets for each outcome?

sharlagelfand commented 2 years ago

Just switching back to penguins here, but here's how 1. could look in terms of the scatter and summary frames:

Screen Shot 2021-10-20 at 1 24 09 PM Screen Shot 2021-10-20 at 1 24 25 PM

From this code:

library(dplyr)
library(ggplot2)
library(palmerpenguins)

theme_set(theme_minimal())

penguins %>%
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = island)) +
  geom_point() +
  coord_cartesian(xlim = c(30, 60), ylim = c(12, 23))

penguins %>%
  group_by(island) %>%
  summarise(
    mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
    mean_bill_depth = mean(bill_depth_mm, na.rm = TRUE)
  ) %>%
  ggplot(aes(x = mean_bill_length, y = mean_bill_depth, color = island)) +
  geom_point() +
  coord_cartesian(xlim = c(30, 60), ylim = c(12, 23))

For the infogrid, in this case we'd normally show the island on the X-axis, but we'd have to make the call here (likely from the fact that island appears only in color, not in x, in the ggplot2 code) to only represent it in color, like so:

Screen Shot 2021-10-20 at 1 30 34 PM
jhofman commented 2 years ago

great point about not using the x axis for island in a setting like this.

i think this approach makes sense, it would be great to prototype it in gemini @giorgi-ghviniashvili