microsoft / datamations

https://microsoft.github.io/datamations/
Other
68 stars 15 forks source link

Flow for multiple group by + summarise steps #55

Open sharlagelfand opened 3 years ago

sharlagelfand commented 3 years ago

Want to test out if it's possible to do group_by -> summarise -> group_by -> summarise (or e.g. group_by -> summarise -> summarise) - @jhofman will provide an example

jhofman commented 3 years ago

@sharlagelfand: There were too many observations in the bike data, so here's an artificial but hopefully still interesting one: take a few famous baseball players, compute their batting average for each year they played, noting the team they played for, and then look at their median batting average over the time with that team.

plyr::baseball %>%
  filter(id == "ruthba01" | id == "cobbty01" | id == "hornsro01") %>%
  group_by(id, team, year) %>%
  summarize(ba = h / ab) %>%
  group_by(id, team) %>%
  summarize(median_ba = median(ba)) %>%
  ggplot(aes(x = id, y = median_ba, color = team)) +
  geom_point(position = position_dodge(width = 0.25)) +
  labs(x = "Player", y = "Median batting average over time with each team")

I don't love the styling of this plot, but perhaps it's enough to get started with?

image

sharlagelfand commented 3 years ago

Thanks @jhofman! This actually brings up another question about how to handle summary operations that are combinations of multiple variables, e.g. ba = h / ab - right now we don't have a way to show distributions of two variables or how the relationship between them derives a new variable... I'll create an issue for that, and see if we can come up with an example that just does multiple steps without making us encounter the "derived from multiple variables" for now

jhofman commented 3 years ago

noting two things:

  1. funny enough, batting averages are a good example of where simpson's paradox pops up because of different number of at bats in a season (see here).
  2. the first group-by + summarize in the example i created was kind of silly, it could just be a mutate.
jhofman commented 3 years ago

Snoozing this until we make progress on #62 for multiple variable manipulations.