microsoft / datamations

https://microsoft.github.io/datamations/
Other
66 stars 14 forks source link

Handle derivations from multiple variables #62

Closed sharlagelfand closed 2 years ago

sharlagelfand commented 3 years ago

From @jhofman's example in #55, we don't have a way right now to handle a variable derived from multiple other variables, e.g.

plyr::baseball %>%
  filter(id == "ruthba01" | id == "cobbty01" | id == "hornsro01") %>%
  group_by(id, team, year) %>%
  summarize(ba = h / ab)

The summarize step computes ba which is derived from both h and ab, so we'd need some way to show distributions of both of those variables and how they interact to create the single derived variable.

jhofman commented 3 years ago

This is a great point, and probably higher priority than the multiple group-by and summarize steps. It applies for mutate on multiple columns as well.

We’ll probably have to discuss and sketch out interactively, but one idea is that we could show a scatter plot of hits vs at bats where each row is a point, then summarize to the result (batting average = hits / at bats) as a vertical jitters plot of the results.

It’s limited to only two variables and might not be clear that division is happening (instead of other possible transformations), but could be worth a shot?

On Tue, May 25, 2021 at 2:41 PM Sharla Gelfand @.***> wrote:

From @jhofman https://github.com/jhofman's example in #55 https://github.com/microsoft/datamations/issues/55, we don't have a way right now to handle a variable derived from multiple other variables, e.g.

plyr::baseball %>% filter(id == "ruthba01" | id == "cobbty01" | id == "hornsro01") %>% group_by(id, team, year) %>% summarize(ba = h / ab)

The summarize step computes ba which is derived from both h and ab, so we'd need some way to show distributions of both of those variables and how they interact to create the single derived variable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/microsoft/datamations/issues/62, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATNS3YKDS4XSLNCVNTL5LTPPVONANCNFSM45P64ZMA .

sharlagelfand commented 3 years ago

I think having a scatter plot makes sense, for sure!

The steps could go:

initial data

Screen Shot 2021-05-26 at 12 13 35 PM

grouped by team

doing team first because it looks "better" on the columns than ID does... we'll have a mechanism for controlling which variable goes in columns versus in rows, but should think about how we can decide which to show first (shows columns first right now)

Screen Shot 2021-05-26 at 12 13 42 PM

grouped by team, id

Screen Shot 2021-05-26 at 12 13 48 PM

grouped by team, id, year

missing an infogrid here, but my thought is that year would be colour ONLY and not on the x-axis (as we normally do with the third grouping variable), since the x-axis is "taken" by one of the variables

scatterplot of both variables

Screen Shot 2021-05-26 at 12 14 06 PM

plot of derived variable

Screen Shot 2021-05-26 at 12 16 17 PM

median of derived variable

optional, but the final step in #55 - and color fades off here because the last grouping variable is "dropped" off

Screen Shot 2021-05-26 at 12 16 24 PM
jhofman commented 3 years ago

this is a neat idea, just recording some notes from our call.

if we go this route, is there any way to indicate what type of function is being executed, for instance the difference between addition or division? And what about when this happens in a summarize vs. a mutate? (Probably we implicitly show the mutate first and then the aggregation?) And what about more than two variables?

a. mutate(z = x + y)
b. mutate(z = x / y)
c. summarize(z = sum(x + y))
d. summarize(z = sum(x / y))
e. summarize(z = sum(x) / sum(y))
f. mutate(d = a + b + c)

question: c. could equivalently be written as summarize(z = sum(x) + sum(y)); would we want to show that as visually different even though the result is the same?

jhofman commented 3 years ago

@dggoldst we chatted about this a bit but need to brainstorm more. let's discuss.

Ideas include:

jhofman commented 3 years ago

@dggoldst makes the point that even a one-variable transformation (like C to F degrees) could be difficult to understand

@sharlagelfand points out that we'd need to know the order of operations "inside" of tidyverse functions to distinguish between different mutate or summarize operations (e.g., sum(x) / sum(y) vs. sum(x / y)). there must be an existing way to get this, as the parser needs to know it.

@giorgi-ghviniashvili mentions that it would be cool to think about more complicated statistical operations (like a regression); probably a separate issue, maybe a custom animation

do we know of any existing work that looks at animating basic algebraic operations (addition, multiplication, division, substraction)? even if we did, there are plenty of operations that wouldn't have such a custom animation, but maybe it's worth having the most common ones? certainly people will have user-defined functions that we have no chance of animating.

another option would be to think of these operations as having "sub steps", with intermediate data frames, and updating the table to show those sub steps and pointing the user's attention to this while the plot might remain the same for those sub-steps. so for instance, summarize(z = sum(x) / sum(y)) would have a two-row intermediate data frame, but summarize(z = sum(x / y)) an N-row intermediate data frame. the question is what the plot would be doing during this time. :)

to take the opposite view: when you do have more than two input variables or a custom function (or one without a custom animation), what do you show on the plot?

maybe you always show the input and output to a function but never try to visualize the internals of the function. so for instance, w/ C to F degrees, we wouldn't show multiplication by 9/5ths and addition of 32. we' just show that some input column (C) becomes some output column (F), and annotate what the operation that got us from one to the other was.

could any "complex" operation be broken down into a set of simpler ones, like summarize(z = mean(x + y)) is actually mutate(z' = x + y) %>% summarize(z = mean(z'))?

test for ourselves: if we saw addition vs. division on an x-y scatter, would we even know the difference?

z-scores are another example that are super complicated to show internals of group_by(state) %>% mutate(z = (x - mean(x)) / sd(x))

ran out of time, but we're going to try some toy examples. ideas: batting data, weather data, standardize z-score by group

sharlagelfand commented 2 years ago

Have been thinking a bit about this "black box" idea, how to show the variables "flowing in" to a transformation, the transformation itself, then the output. We could start with a generic black box just with variables:

Datamations-52

But I'm not entirely sure if it's clear how this is working - for example, in a group, there's would be a vector of X values and a vector of Y values, e.g. x = [1, 2] and y = [3, 4], which then is translated via z = sum(x/y) = sum([1/3, 2/4]) and out comes the result, z = 0.83.

So maybe after the "generic" black box above, could show an example with one group's data (but not sure how this would scale to move variables, a lot of data, etc....)

Datamations-53 3

These could come after the distributions of x and y (if there's only two variables, we could still show them - or if there's more, after the group by steps) and afterwards, show the distribution of z and continue along with the actual datamation.

Just a start of thoughts!

sharlagelfand commented 2 years ago

And to address the idea of "test for ourselves: if we saw addition vs. division on an x-y scatter, would we even know the difference?" - here's an x and a y:

Screen Shot 2021-08-04 at 11 41 30 AM

And some transformations of them on the y axis - it's all the basics (addition, subtraction, multiplication, division) - can we distinguish them? How quickly / easily?

Screen Shot 2021-08-04 at 11 42 18 AM
giorgi-ghviniashvili commented 2 years ago

I think that handling these kind of transformations with scatterplot or bar/line charts, will make it more ambiguous than just annotating them and show just input and output.

For me a proper way of animating these kind of derivations is to build custom graphs that clearly explains what's going on.

Examples:

image

https://www.intmath.com/cg3/sincos-d3.php

https://twitter.com/freyaholmer/status/1202648662049996801

https://bl.ocks.org/patrickwalls/f8c5b58fab206f1a6ce8

I don't know how to generalize all the possible custom functions though.

jhofman commented 2 years ago

@giorgi-ghviniashvili will work on visually prototyping the addition vs. division test case that @sharlagelfand mentioned in https://github.com/microsoft/datamations/issues/62#issuecomment-892766005

giorgi-ghviniashvili commented 2 years ago

DIVISION:

I made an example of car MPG to liters per 100KM, very common calculation for car owners.. The calculation is very simple, divide ~235 by mpg. Here is the gif. We note that they have inverse dependence.

gemini

vega

giorgi-ghviniashvili commented 2 years ago

ADDITION:

x1 = ax + b, where a = 1; b = 4; vega

adding

x1 = ax + b, where a = 3; b = 4; adding

jhofman commented 2 years ago

Let's try the bmi example.

You're given weight in lbs, height in feet, and want to compute body mass index, which is (mass in kg)/(height in m)^2.

So need to convert mass = weight / 2.2, meters = feet / 3, and then do the funky formula above.

@giorgi-ghviniashvili will prototype visually to see what this might look like. presumably it starts with a scatter plot of weight and height and finishes with another plot, possibly a histogram of bmis.

giorgi-ghviniashvili commented 2 years ago

Hi, done the bmi example.

source spec: scatter

target spec: histogram

946c3a31-8444-430f-bfc3-f43d9e772ccc

Explanation:

Looks good for me. What do you think?

jhofman commented 2 years ago

Let's add some steps here:

  1. First we'll see the rescalings of each variable: lbs goes to kg, ft goes to meters.
  2. Then we'll see id on the y axis and bmi on the x axis as points.
  3. Then they'll get binned and stacked, similar to a quantile dot plot.
  4. Finally we'll switch to bars instead of stacked points (although maybe we just omit this step?)
giorgi-ghviniashvili commented 2 years ago

Here it is:

https://user-images.githubusercontent.com/6615532/146373213-51b36c2c-dedd-43c2-ba28-d1848837fe96.mov

jhofman commented 2 years ago

@giorgi-ghviniashvili: let's see what options there are in terms of rescaling the axes before translating the points.

i'm also curious to compare this to something where we show side-by-side plots, original on the left and transformed on the right. we can discuss more about that next time.

jhofman commented 2 years ago

@willdebras, can you start to look into creating parsing for mutate as well as parsing a mutate or a summarize to figure out which variables are present and if there are 1, 2, 3, or more than 3 such variables.

from there we can start with an animation for 1 variable that does the simple transformations giorgi shows here:

https://github.com/microsoft/datamations/issues/62#issuecomment-982411233 https://github.com/microsoft/datamations/issues/62#issuecomment-982394685

it'll be interesting to think about when we don't have an explicit scatter plot (as in the above examples).

would be fun to demonstrate log transformations, squaring, etc., which come up often.

willdebras commented 2 years ago

With some work I have generated some simple parsing where we can generate mappings for mutate that pluck out a mutation function/expression and the name of the variable created in the mutation step. Still need to think through how to handle multiple mutations, but we should be able to support single mutations rather easily with a spec step of variables and values associated with gemini IDs and just transitioning on them as we do with any steps. The generation of data states works entirely for these types of calculations now, so this will work easily out of the box with not a ton of work on Giorgi's side.

This week I will work on parsing for multiple mutations or mutations that involve multiple other variables to pass to mappings/specs, so we can think through how the UI should look for these.

Need to conceptualize in the UI and specs:

willdebras commented 2 years ago

Working on producing specs for mutation and have some decent progress. I have uploaded a set of specs that includes one vega state for a mutation step.

They are saved here: https://github.com/microsoft/datamations/tree/specs_mutate/sandbox/mutations

They work pretty seamlessly in our current codebase and I have it set up to account for the ordering of groupings, i.e. plotting differently if the mutation occurs in or out of group.

Currently for single mutations following a grouping:

"small_salary %>%
  group_by(Degree) %>%
  mutate(logSal = log(Salary)) %>%
  summarise(mean = mean(logSal))" %>%
  datamation_sanddance()

https://user-images.githubusercontent.com/37971596/155918899-6b0d0d84-46f3-454e-a27f-2e39f958dfe2.mp4

For single mutations before a grouping:

"small_salary %>%
  mutate(logSal = log(Salary)) %>%
  group_by(Degree) %>%
  summarise(mean = mean(logSal))" %>%
  datamation_sanddance()

https://user-images.githubusercontent.com/37971596/155918908-d3ae76e8-fed6-40ac-a65b-4731dc208f67.mp4

I think it would be good to hop on a call to discuss what we want these mutations to look like so we can dive more into the specs, @giorgi-ghviniashvili and @jhofman this week.

Right now it just plots the new variable without a data state for the variable it is a function of. I think the next step here would be to detect the variable the new variable is a function of (already have some code built in for this detection), plot that old variable, then plot the new variable?

In this example,

In the mean time, I need to iron out some bugs. This work is successfully parsing mutations of various function calls pretty well, but has some issues with mathematical formulas. I should have additional progress on this this week.

jhofman commented 2 years ago

this is awesome!

for the groupby-mutate-summarize, let's add in a spec that shows salary before the log transform and let's remove the redundant spec that shows log salary as a scatter twice in row.

for the mutate-groupby-summarize, let's add in a spec that shows salary before the log transform and let's remove the spec that shows the second info grid of points being grouped, and instead jump straight to the scatterplot being grouped.

jhofman commented 2 years ago

@giorgi-ghviniashvili, for two-variable mutates, can you make it so that it's similar to the bmi example except points just get vertically stacked in one group with jitter?

that would make it natural to jump to group-bys and whatnot from there.

so in the video linked above, that would mean going from 0:05 to 0:07 but having the final frame be rotated 90 degrees w/ bmi on the y-axis and then a jittered scatter centered around one point on the x axis

willdebras commented 2 years ago

This has been updated now, so that the following occurs:

  1. All mutate calls create two sets of specs, a detected "basis" variable (or expression) from which the new variable is created, so that we can see the transformation into the new variable
  2. Mutates occuring directly before a groupby cause the exclusion of the group_by infogrid, so it just transitions the plotted variable into group facets
  3. Mutates occuring directly before a summarize exclude the plotting of the variable (since mutate is already doing that).

Here is an example of a basic mutation:

"small_salary %>%
  mutate(salaryLogged = log(Salary)) %>%
  summarise(mean = mean(salaryLogged))" %>%
  datamation_sanddance()

https://user-images.githubusercontent.com/37971596/156951021-680e492a-3175-47fd-91eb-e27474bb895d.mp4

Here is an example of a mutation occurring before a group_by statement:

"small_salary %>%
  mutate(salaryLogged = log(Salary)) %>%
  group_by(Degree) %>%
  summarise(mean = mean(salaryLogged))" %>%
  datamation_sanddance()

https://user-images.githubusercontent.com/37971596/156951097-e093c633-a0a3-4b6a-a2dc-90f06b758d55.mp4

Here is an example of a mutation occurring after a group_by before a summarize statement:

"small_salary %>% group_by(Degree) %>% mutate(salaryLogged = log(Salary)) %>% summarise(mean = mean(salaryLogged))" %>% datamation_sanddance()

https://user-images.githubusercontent.com/37971596/156951144-f86b608c-db78-4f53-8a02-646556515025.mp4

Also added support this weekend for expressions that require multiple variables. Still can think more about if we want to plot these in different ways, but here is an example of a simple interaction between two variables:

"small_salary %>%
  mutate(new_salary = Salary * runif(nrow(small_salary))) %>%
  group_by(Degree) %>%
  summarise(mean = mean(new_salary))" %>%
  datamation_sanddance()

https://user-images.githubusercontent.com/37971596/156951338-51f698bd-bc37-4ce4-a11f-c0319b9df021.mp4

I guess we should think about if these transitions are clear? The log example is a little hard to tell much is happening, since the values are not drastically changing in relationship, just the scale on the axis is shifting a lot. Do we want to fix the range (of the y axis) in some way?

jhofman commented 2 years ago

This looks great overall.

In the second example (mutation_before_group), it looks like there's some shift in the x axis ticks for Master and PhD. @willdebras, can you check if this is on the R side x domain settings? Ideally we want the x axis ticks to just remain in one place.

The log example is interesting but I agree that it's a bit hard to track. Maybe for that a scale_y_log10() would be more clear, because you'd see the panel grid with a different spacing (logarithmic instead of linear). Let's try that? Curious to see how Gemini handles the scale transition. @willdebras and @giorgi-ghviniashvili, can you coordinate on this?

Unclear what we'll do for other transforms that don't have corresponding special axis types, but let's put that on the back burner.

The log operation gives me an idea that we might create a teaching example out of it, specifically that log-average-exponentiate acts a lot like a median (or geometric mean). Let's all think a bit about an example that shows this---maybe some income data?

One final thing, the third video (group_before_mutation) has something interesting going on from 8 seconds to 10 seconds: there's a shift in the x-axis jitter for points before and after log scaling. You could imagine that we fix the x position and only allow y to move from the transform. I think this would be easier to follow but could leave us with a hard to read situation after the transform if there's lots of overlap. One idea would be to just make these sequential, but let's hold off on that for now.

giorgi-ghviniashvili commented 2 years ago

@giorgi-ghviniashvili, for two-variable mutates, can you make it so that it's similar to the bmi example except points just get vertically stacked in one group with jitter?

@jhofman here is the video. Moved bmi to y axis and jittered.

https://user-images.githubusercontent.com/6615532/157266897-3789a606-52fb-4b68-bca7-61d8446b1592.mov

willdebras commented 2 years ago

In the second example (mutation_before_group), it looks like there's some shift in the x axis ticks for Master and PhD

Checking in on this issue:

It looks like this actually occurs currently in our normal summarize specs as well, but it also looks like the domain is correct for these. They scale from 0 to 3 for both states. For example, this call:

"small_salary %>%
  group_by(Degree) %>%
  summarise(mean = mean(Salary))" %>%
  datamation_sanddance()

We get two states that look like this:

image

image

You can see in the second frame these x labels (Masters and PhD) are closer together.

It looks like though both of these should be correct in the specs I pass. They have a value of 1 and 2 with a domain of [0,3]:

https://github.com/microsoft/datamations/blob/custom_animations/sandbox/custom_animations/custom-animations-mean-R.json#L809

https://github.com/microsoft/datamations/blob/custom_animations/sandbox/custom_animations/custom-animations-mean-R.json#L809

@giorgi-ghviniashvili, could you take a look into these (or we can just discuss on our call tomorrow)? Is there anything you can think of that would make these wider on the first frame? Is it due to automatic spacing to fit all the dots or jitter or something like that?

giorgi-ghviniashvili commented 2 years ago

@willdebras I see what the issue is. Main branch is a bit old. Lots of changes made in the other branch which was not merged into main yet. The latest changes are in the custom_animations_hack_before branch and if I test the code in that branch, it works correctly:

https://user-images.githubusercontent.com/6615532/157636862-74e486b0-fe56-41d8-b21d-742939f258cf.mov

So I think it will be automatically fixed after we merge all the work done for custom animations.

jhofman commented 2 years ago

this looks good but we still need to fix the x axis labels from 7 seconds to 10 seconds, added issue #143 to track this separately.

there's also something off about the final means---the bars look right (at 90ish and 88ish), but the points are off (82 and 84). maybe the summary values are being computed incorrectly? and after that the point at 84 goes up to 94 (instead of zooming).

looks like this was a bad hand-coded spec, should be fixed now.

let's continue to work on the log transform which we verified should work w/ vanilla gemini.

willdebras commented 2 years ago

Confirmed in our check-in that the above were bad hand coded specs.

I have also added some specs where a log scale is defined here for the second frame of the mutation:

https://github.com/microsoft/datamations/blob/scales-transform/sandbox/mutations/mutate-scale-R.json#L1327

I also have appropriate code added that detects the mutation basis function (i.e. explicit function call like log, numeric operands like * or +, or just raw numeric values. It conditionally makes changes to scale type when log is detected. It shouldn't be hard to accomodate other transforms though, i.e. if we wanted a power or sqrt scale if we detect ^, sqrt or 1/.

I am running into the issues we experienced on call though where we couldn't get the output to have that transformed scale when we manually added the scale type parameter.

@giorgi-ghviniashvili, I was hoping you could take a look at these specs I linked and let me know if you see any issues with them. The line of code where the scale type is defined is linked above.

giorgi-ghviniashvili commented 2 years ago

@willdebras tested this spec.

For the log scales, when having very small domain, it looks like it is not logarithmic.

This is original spec, where y.scale.domain is very small:

image

If we remove the domain:

image

I think that log scale 's domain should be big enough to see the difference between axis ticks.

This is jittered log in datamations:

image

Let me know how you think it should look like within this small domain.

jhofman commented 2 years ago

Let's test this by adding an outlier point to the salary data: someone who makes 150 to the data.

# one animation where we look at the mean of raw values
small_salary %>%
  summarize(mean_salary = mean(Salary))

# another animation where we look at the mean of logged values
df %>%
  mutate(log_salary = log10(Salary)) %>%
  summarize(mean_log_salary = mean(log_salary)) 

@willdebras, can you generate some specs and send them to @giorgi-ghviniashvili?

if we really want to stress test things, we can exponentiate back after the summarize in the second pipe:

... %>%
  mutate(exp_mean_log_salary = exp(mean_log_salary))

An interesting point here is that there's nothing here that says "go back to a linear scale". This can be on the back burner but just noting it here.

willdebras commented 2 years ago

I'm realizing now I'm not sure the log scale on the chart will have an effect on the logged values.

Take this outlier example:

'small_salary %>%
  mutate(log_salary = log10(Salary)) %>%
  summarize(mean_log_salary = mean(log_salary))' %>% datamation_sanddance()

Outlier here:

image

Logged scale and logged values:

image

since the values are logged, they all stay relatively close, despite the outlier. Vega will not render the differences here.

Even log10(150_000) is ~5. We could keep the values, then log the scale, and then mutate the values if we want to show an effect? That might add confusion though.

jhofman commented 2 years ago

@willdebras, let's generate something with a true log-normal distribution and see how log(x) looks.

let's also look at other one variable transformations, for instance: squaring something (x^2), inverting something (1/x) and a linear transformation (ax + b). and what happens if these are within a summarize, e.g., summarize(mean_x_sq = mean(x^2))?

then we can look at two variable transformations and see if they i) look reasonable themselves and ii) can be differentiated between each other:

a. mutate(z = x + y) b. mutate(z = x / y) c. summarize(z = sum(x + y)) d. summarize(z = sum(x / y)) e. summarize(z = sum(x) / sum(y))

note that c could also be sum(x) + sum(y), curious if we end up parsing those similarly

p.s. what happens when two-variable mutates are within a summarize? :)

jhofman commented 2 years ago

@giorgi-ghviniashvili, can you try the example at the very top here to see how this all works with multiple facets?

can you try both the summarize and mutate versions of that? for the summarize it should look like sharla's final frame here.

plyr::baseball %>%
  filter(id == "ruthba01" | id == "cobbty01" | id == "hornsro01") %>%
  group_by(id, team, year) %>%
  summarize(ba = h / ab)

for the mutate it should look like the next-to-last one.

plyr::baseball %>%
  filter(id == "ruthba01" | id == "cobbty01" | id == "hornsro01") %>%
  group_by(id, team, year) %>%
  mutate(ba = h / ab)

also can you sync with @willdebras in case the multi-variable versions above need tweaking. in particular, if we don't have any summary operation, let's end multiple variable mutates with a beeswarm all in one column instead of going to binned/stacked (so basically the 5 second mark here)

p.s. you can export the baseball data as follows in R: plyr::baseball %>% write.csv('baseball.csv')

willdebras commented 2 years ago

I added a base vignette with some examples above. Here is a quick link to view it, so you don't have to pull -> checkout branch -> knit:

https://rpubs.com/willdebras/datamations_mutations

In short,

Also random aside is that I am working on adding a new UI to the demo site to include the ability to add mutations and filter calls. If I have time in the next 7 days after finalizing mutations, updating vignettes and docs, etc. I will probably try to get this tied to the backend and online:

image

giorgi-ghviniashvili commented 2 years ago

@giorgi-ghviniashvili, can you try the example at the very top here to see how this all works with multiple facets?

can you try both the summarize and mutate versions of that? for the summarize it should look like sharla's final frame here.

plyr::baseball %>%
  filter(id == "ruthba01" | id == "cobbty01" | id == "hornsro01") %>%
  group_by(id, team, year) %>%
  summarize(ba = h / ab)

for the mutate it should look like the next-to-last one.

plyr::baseball %>%
  filter(id == "ruthba01" | id == "cobbty01" | id == "hornsro01") %>%
  group_by(id, team, year) %>%
  mutate(ba = h / ab)

also can you sync with @willdebras in case the multi-variable versions above need tweaking. in particular, if we don't have any summary operation, let's end multiple variable mutates with a beeswarm all in one column instead of going to binned/stacked (so basically the 5 second mark here)

p.s. you can export the baseball data as follows in R: plyr::baseball %>% write.csv('baseball.csv')

@willdebras is it possible to generate specs for these from datamations instead doing it manually?

image
jhofman commented 2 years ago

seems like the summarize with two variables is breaking the baseball example for @giorgi-ghviniashvili . we can change to a mutate for the two variables, and then possibly a summarize afterwards with just one variable. that should do the trick?

also, @giorgi-ghviniashvili can you take a look at the multivariate examples that @willdebras created here? right now they're only showing the first variable (salary in this case) and then the final transformed variable (newVar). (this may simply be a case of @willdebras updating the spec that gets passed?)

the new desired behavior should be this: show all the points in an infogrid, then go to a scatter plot with salary on the x and salaryTwo on the y, then have the transformation to show newVar stacked in a single column jittered, then the summarize.

side note, @willdebras needs to merge in some recent changes to the branch that has the rpub doc, hopefully we'll see axis ticks changing more smoothly, etc.

willdebras commented 2 years ago

Working on updating specs today for @giorgi-ghviniashvili with multiple variable values on that first transformation frame.

For now, I have merged in changes from main and updated the examples vignette here:

https://rpubs.com/willdebras/datamations_mutations

willdebras commented 2 years ago

@jhofman, @giorgi-ghviniashvili, here is an example of the multiple variables:

https://user-images.githubusercontent.com/37971596/160218250-3c15f526-8faf-40e6-9a13-3c6596177a6e.mp4

small_salary <- dplyr::mutate(
  small_salary,
  SalaryTwo = runif(nrow(small_salary), min = 60, max = 110)
)
"small_salary %>%
  mutate(newVar = Salary + SalaryTwo) %>%
  group_by(Degree) %>%
  summarize(mean = mean(newVar))" %>%
  datamation_sanddance()

It has some funky transitions, but I think is a good start. I output the specs, @giorgi-ghviniashvili if you have any suggestions of edits needed for smooth transitions:

https://github.com/microsoft/datamations/blob/specs_mutate/sandbox/mutations/mutation_specs_multiple_variables-R.json

giorgi-ghviniashvili commented 2 years ago

@willdebras I updated the spec with these:

https://user-images.githubusercontent.com/6615532/160590263-30387d3d-7afb-4221-bf0f-2980f05c06e3.mov

giorgi-ghviniashvili commented 2 years ago

@willdebras I am on the main branch and this error shows. Still not merged?

image
jhofman commented 2 years ago

this is great progress!

let's make a few tweaks:

@willdebras, can you add some multiple variable examples to the rpub notebook?

side note: multiple mutations will only show the first, others will be hidden. but doesn't throw an error.

on multivariable mutations inside a summarize, could we hack it by calling prep_specs_mutate and then summarize on the resulting variable as a hack?

note: mutates and summarizes that create new variables and then use them in the same call are probably trouble here, e.g. summarize(n = count(), se = sd(x)/sqrt(n)), but this is okay for now.

giorgi-ghviniashvili commented 2 years ago

@willdebras , I fixed the jittered spec. Updated d3-force simulation parameters to spread out circles more. Merged to main.

image
willdebras commented 2 years ago

Awesome, this is great. It is reflected in the most updated examples:

https://user-images.githubusercontent.com/37971596/160846458-69d84ae7-b56f-480c-b180-c27567ae1887.mp4

I have yet to add the correct gemini ids to the first data spec (this is actually somewhat tough on my end as it requires knowledge of mutations and outcome variables defined in mapping to them sort the data and assign ids). I did however fix the axes definitions and remove splitField and xAxisLabels.

willdebras commented 2 years ago

This file has been updated with all we talked about regarding spacing adjustments, additional tweaks to the title that @jhofman mentions above, and the mentioned warning on multiple new variables defined:

https://rpubs.com/willdebras/mutations

jhofman commented 2 years ago

this all looks good, on the limitations of a single mutate and no mutation inside a summarize, the warning in the vignette looks good, but perhaps we should make this an error instead and make a similar error for mutate inside a summarize if possible

willdebras commented 2 years ago

Done. Errors for expressions in summary functions:

Unable to parse the summary function. \n Error is likely due to passing a mutation in the summary function. \n Consider adding a mutate step above and then calling the summary function on the output.

Also errors instead of warns for multiple mutations:

Datamations currently only supports a single mutation call for visualization. Edit your pipeline to only include a single mutation necessary for the visualization.