microsoft / datamations

https://microsoft.github.io/datamations/
Other
67 stars 15 forks source link

Add visualization of binary variables to existing Datamations demo #138

Closed msftbozo closed 2 years ago

msftbozo commented 2 years ago

Since we completed the visualization of binary variables, it would be great to add a dataset to the existing (pinguins/salaries) dataset that has binary variables and datamate those.

jhofman commented 2 years ago

Let's see if we can use what's in #98 to make a nice example here using the baseball data.

If it looks good we can add it to the docs or a vignette or something.

willdebras commented 2 years ago

I will have this prototyped to show Thursday.

msftbozo commented 2 years ago

that's fantastic thanks @willdebras

willdebras commented 2 years ago

Had to make some minor bugfixes to get binary variables working. Previously across environments (in the renv environment, containered 4.1 with fresh installs, and 4.0.5 with fresh installs), datamations_sanddance() was erroring for binaries due to a util call that generates tooltip specs. This call was expecting gemini IDs despite being passed already aggregated data. The function removed columns that weren't grouping columns to calculate unique fields, but removed gemini_ids even if they didn't exist.

This commit removes this field only if it exists:

https://github.com/microsoft/datamations/commit/924ce7d79731e655be2ec47a94273dc52e71490b

A basic datamation with this binary variable from the following call:

"df %>%
  group_by(player, year) %>%
  summarise(mean = mean(is_hit))"  %>% datamation_sanddance()

https://user-images.githubusercontent.com/37971596/152364946-59bbb032-8276-4ad0-a182-c5395df2f578.mp4

I'll provide more examples of variants of this in the next day or two.

willdebras commented 2 years ago

Also, @jhofman/@giorgi-ghviniashvili, do you have suggestions on best practice for showcasing/recording these? This one is pretty low res and has my cursor over top of it :)

jhofman commented 2 years ago

@willdebras: did you record with the native screencapture tool? that usually works pretty well for me, but @giorgi-ghviniashvili may have some other tricks up his sleeve!

giorgi-ghviniashvili commented 2 years ago

@willdebras on mac, I was using QuickTime player which records as .mov file. But lately github had issue of embedding a player for .mov. Don't know if they resolved that issue yet or not.

willdebras commented 2 years ago

A couple examples of the simpsons paradox with the baseball example for review tomorrow, additionally showcasing passing ggplot2 code to datamation_sanddance

# datamation #1:
# jeter has a higher batting average than justice overall
'df %>%
  group_by(player) %>%
  summarize(batting_average = mean(is_hit),
            se = sqrt(batting_average * (1 - batting_average) / n()))  %>%
  ggplot(aes(x = player, y = batting_average, color = player)) +
  geom_pointrange(aes(ymin = batting_average - se,
                      ymax = batting_average + se)) +
  labs(x = "",
       y = "Batting average")' %>% datamation_sanddance()

datamation_jeter_1

# datamation #2:
# but justice has a higher batting average than jeter within each year
'df %>%
  group_by(player, year) %>%
  summarize(batting_average = mean(is_hit),
            se = sqrt(batting_average * (1 - batting_average) / n()) ) %>%
  ggplot(aes(x = year, y = batting_average, color = player)) +
  geom_pointrange(aes(ymin = batting_average - se,
                      ymax = batting_average + se),
                  position = position_dodge(width = 0.25)) +
  labs(x = "",
       y = "Batting average")' %>% datamation_sanddance()
  #geom_bar(stat = "identity", position = "dodge")

datamation_jeter_2

willdebras commented 2 years ago

https://user-images.githubusercontent.com/37971596/153227351-1f6703c0-8ce6-4d03-ab43-effe5861d5d4.mp4

jhofman commented 2 years ago

@willdebras: looks better without the ggplot code at the end. we should probably modify things at some point so the ggplot-ed version looks better, but let's focus on this default version for now.

'df %>%
  group_by(player, year) %>%
  summarize(batting_average = mean(is_hit),
            se = sqrt(batting_average * (1 - batting_average) / n()) )' %>%
  datamation_sanddance()

@giorgi-ghviniashvili for some reason we're seeing the points reshuffle between 11 seconds and 14 seconds. would be great to eliminate that and just have the points go from all solid to only the hits being solid. can you take a look at this? seems like a sorting issue on the point ids.

let's try two versions:

  1. points are pre-sorted by is_hit at 11 seconds so that there's no visual resorting, things just hope to 14 seconds
  2. points are not pre-sorted, is_hit lights up after 11 seconds as part of "plot is_hit within each group", and then they get rearranged as in 14 seconds in.
jhofman commented 2 years ago

@willdebras, can you continue to work on this and also play with the two different sortings mentioned in the above comment?

willdebras commented 2 years ago

I've merged a PR adding the baseball dataset, the fixes to the binary variables, and some added documentation.

Working through the sorting methods mentioned.

willdebras commented 2 years ago

It looks like sorting here works for a single groupby, but not the two. This is definitely something to fix on the R side. It seems when plotting summarize calls with multiple groups for binary variables, we miss the appending of gemini IDs somewhere, so they get rearranged.

From the baseball example, you can see the values field should have a gemini_id field with an array of ids, but doesn't:

"meta": { "parse": "grid", "axes": true, "description": "Plot is_hit within each group", "splitField": "year" }, "data": { "values": [ { "player": "David Justice", "year": "1995", "is_hit": 1, "n": 104 },

willdebras commented 2 years ago

After some debugging, I have updated the R end to ensure gemini IDs are in every summarize step, but they look correctly sorted in the specs I send along based on gemini ID. We can take a look on call tomorrow, but this might need more diving into the frontend to figure out if the plotting of this step is in the correct order.

https://github.com/microsoft/datamations/blob/custom_animations/sandbox/custom_animations/custom-animations-binary-R.json

jhofman commented 2 years ago

I've merged a PR adding the baseball dataset, the fixes to the binary variables, and some added documentation.

Working through the sorting methods mentioned.

this is great. let's change the group by on the second example to be group_by(year, player) so that it's easier to see the simpson's paradox bit pop up.

@willdebras pointed out that there's still some funny re-grouping going on. for instance, look at jeter in 1995---it goes from two columns to one column for some reason. it should stay fixed in one column.

@giorgi-ghviniashvili, can you take a look when you're back?

willdebras commented 2 years ago

Here are the screenshots of the funky regrouping. We can see that despite the sort on gemini ids in both steps now, we still get a rearrange. Only happens for two groupbys.

image

image

giorgi-ghviniashvili commented 2 years ago

@willdebras can you send json specs for this funky groupings?

willdebras commented 2 years ago

Sure thing, these are saved here, @giorgi-ghviniashvili:

https://github.com/microsoft/datamations/blob/custom_animations/sandbox/custom_animations/custom-animations-binary-R.json

You can see these now have been updated to have gemini ID in all steps and have appropriate sorting. Still get the funky grouping though.

giorgi-ghviniashvili commented 2 years ago

@jhofman @willdebras checked it and funky groupings happen because of different number of rows in third (rows = 35) and fourth (rows = 50) frames. We did this on purpose to avoid overlaps:

image

If we comment this code, all specs will have same number of rows and no funky grouping will happen, but in some cases, when there are too much points, they will overlap.

jhofman commented 2 years ago

sounds like we should apply this rule to all specs, find the one with the most number of rows, and use that throughout to avoid this. @giorgi-ghviniashvili, can you give it a try?

giorgi-ghviniashvili commented 2 years ago

@jhofman fixed the funky animation as per your suggestion.

https://user-images.githubusercontent.com/6615532/157499215-6d0b24f5-b917-46a6-a041-ae212df297c0.mov

jhofman commented 2 years ago

looks good! small thing: from 11 to 13 seconds the stroke width looks like it's changing. @giorgi-ghviniashvili can you track down where and @willdebras can you update on the R side?

let's change our default so that the initial circles are empty (indicating zeros) w/ strokewidth and then once we split into groups and introduce the outcome (hit or no hit), we color in the "1"s (hit). @willdebras, hopefully this is just an R side change?

at some point this might conflict w/ explicit ggplot2 commands, we can cross that bridge when we get to it ...

willdebras commented 2 years ago

@giorgi-ghviniashvili, do you know if there are unpushed (to main) changes related to the binary variables on your end? I'm wondering because in the recording above, at ~13s, we get the plotting of the dots fully filled, then it transitions to make these hollow at ~15s.

This staggered transition isn't in place in the main repo or in the example in the docs (auto generated from main):

https://microsoft.github.io/datamations/articles/Examples.html#binary-variables

It seems to go straight to the stroke with transparent fill.

From my understanding these are different animationFrames, but not different specs, right?

https://github.com/microsoft/datamations/blob/custom_animations/sandbox/custom_animations/custom-animations-binary-R.json#L174

There is only one set of specs titled "Plot is_hit within each group." Do I need to add to these specs or is this staggered effect of the plotting of the 1 v. 0 something you have in js code that isn't in main yet?

giorgi-ghviniashvili commented 2 years ago

@willdebras main branch is up to date regarding to the binary variables. This is the issue in this example as well, the circles size increases because of stroke property in "is_hit" spec, but it is not a visual issue in this example because there are few circles.

In general, issue is that previous spec does not have "stroke" property and that's why it looks thin. In this example, it has stroke-width: 1, but not stroke and this stroke-width will be ignored in this case.

image

Solution A: add stroke encoding to previous specs if there is color encoding with same field of color. (but this one did not really work well)..

Solution B: reduce circles size in "is_hit" spec. I used size = 22:

image

But previous spec should be changed to:

image

https://user-images.githubusercontent.com/6615532/158161818-33de6fbd-d475-4705-94ff-7d16a232be2d.mov

See this commit

jhofman commented 2 years ago

@willdebras has included this in the updated shiny app, related to #129