microsoft / datamations

https://microsoft.github.io/datamations/
Other
67 stars 14 forks source link

(How) can we parse and handle a ggplot command at the end of a pipeline? #40

Closed jhofman closed 3 years ago

jhofman commented 3 years ago

Right now we're sort of implicitly assuming that grouping variables become faceting variables, which is reasonable and will generalize. But what if someone wants control over this,? More generally, we want to "respect" the final plot that they generate and have the steps leading up to that reflect this.

To illustrate, imagine the same data analysis pipeline, but with three different plotting commands at the end. Right now we'd show the same datamation for each, but in theory they should end in different frames (and so should also contain different frames leading up to that).

Degree on the x, Work as facets

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary)) +
  geom_point() +
  facet_wrap(~ Work)

vs

Degree on the x, Work and Degree as facets

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary)) +
  geom_point() +
  facet_grid(Degree ~ Work)

vs

Degree on the x, Work as (dodged) color, no facet

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary, color = Work)) +
  geom_point(aes(position = position_dodge(width=0.25)))

This will require a bunch of thinking and probably some hacking of ggproto objects, but let's do the thinking before the hacking.

sharlagelfand commented 3 years ago

Just rendered versions to refer to:

Degree on the x, Work as facets

library(ggplot2)
library(dplyr)
library(datamations)

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary)) +
  geom_point() +
  facet_wrap(~ Work)

Degree on the x, Work and Degree as facets

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary)) +
  geom_point() +
  facet_grid(Degree ~ Work)

Degree on the x, Work as (dodged) color, no facet

small_salary_data %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary, color = Work)) +
  geom_point(position = position_dodge(width=0.25))

sharlagelfand commented 3 years ago

Started looking into this, some notes:

library(datamations)
library(dplyr)
library(ggplot2)
library(rlang)

pipeline <- "small_salary %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary)) +
  geom_point() +
  facet_wrap(~ Work)"

Parse the pipeline - it splits by %>%, so all of the plotting is in the last element

pipeline_steps <- pipeline  %>%
  parse_pipeline()

pipeline_steps
#> [[1]]
#> small_salary
#> 
#> [[2]]
#> group_by(Degree, Work)
#> 
#> [[3]]
#> summarize(mean_salary = mean(Salary, na.rm = TRUE))
#> 
#> [[4]]
#> ggplot(aes(x = Degree, y = mean_salary)) + geom_point() + facet_wrap(~Work)

Evaluate each of the data states (except the last, which contains plotting) - need this for each of the stages animated

data_states <- pipeline_steps[1:3] %>%
  datamations:::snake(envir = global_env())

Now actually evaluate the entire pipeline (with the plotting) to get the plot object

p <- pipeline %>%
  parse_expr() %>%
  eval()

p

And we can get aspects of plotting from the plot object itself (rather than from the code used to generate it, since it can be written in so many different ways)

# Extracting x and y variables
p$mapping$x %>%
  rlang::quo_name()
#> [1] "Degree"

# Extracting facets
# Some combination of this:
p$facet$params$facets %>%
  names()
#> [1] "Work"

# Doesn't actually say whether this is a row or column facet, but we can combine information
bp <- ggplot_build(p)
bp$layout$layout
#>   PANEL ROW COL     Work SCALE_X SCALE_Y
#> 1     1   1   1 Academia       1       1
#> 2     2   1   2 Industry       1       1

# This shows that there's only 1 row (and 2 cols) so could figure out faceting info that way

The code in the ggreverse package will probably be super helpful for figuring these bits out.

Then we can create e.g. a list with the x variable, y variable, facets, colours, etc, and use those to create the specs, rather than basing it on the # of groups. If they don’t supply ggplot2 code, can use the defaults that exist now of col facet -> row facet -> colors

jhofman commented 3 years ago

this is awesome @sharlagelfand!

to make next steps concrete, let's try to use this to add a ggplot command to get a version of the degree+work plot used in the paper, with workplace as facet and degree as the x variable

sharlagelfand commented 3 years ago

Looks like this example works pretty well right out of the box!

https://user-images.githubusercontent.com/15895337/118675960-05e2ef00-b7c9-11eb-9179-d818afdb2952.mov

One thing that's off is the X axis labels - @giorgi-ghviniashvili, from the specs it looks like all of the X values, X breaks, and X labels are 1,2 - do you know why it's not lining up?

giorgi-ghviniashvili commented 3 years ago

@sharlagelfand Plot Salary within each group does not have scale.domain.

Please add it and you will get this:

image

image

sharlagelfand commented 3 years ago

Ah thanks @giorgi-ghviniashvili!

sharlagelfand commented 3 years ago

Quite happy to say I have these examples working!!

Degree on the x, Work as facets

library(dplyr)
library(ggplot2)
library(datamations)

"small_salary %>%
    group_by(Degree, Work) %>%
    summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
    ggplot(aes(x = Degree, y = mean_salary)) +
    geom_point() +
    facet_grid(~ Work)" %>%
  datamation_sanddance()

https://user-images.githubusercontent.com/15895337/118870106-2dfb4c80-b8b4-11eb-96fb-04c88705122f.mov

Degree on the x, Work and Degree as facets

"small_salary %>%
  group_by(Degree, Work) %>%
  summarize(mean_salary = mean(Salary, na.rm = TRUE)) %>%
  ggplot(aes(x = Degree, y = mean_salary)) +
  geom_point() +
  facet_grid(Degree ~ Work)" %>%
  datamation_sanddance()

https://user-images.githubusercontent.com/15895337/118872511-a2cf8600-b8b6-11eb-99be-c94f22fd284f.mov

This one looks a bit odd, we may want to fine tune the color & location of the infogrid a bit but it does work!

jhofman commented 3 years ago

We have a great start here. @sharlagelfand will create a new issue to make sure we explain the limitations of this functionality and/or pop up corresponding warnings or error messages.

sharlagelfand commented 3 years ago

86 is the issue for explaining these limitations, so closing this now!