Closed msftbozo closed 2 years ago
Let's see if we can use what's in #98 to make a nice example here using the baseball data.
If it looks good we can add it to the docs or a vignette or something.
I will have this prototyped to show Thursday.
that's fantastic thanks @willdebras
Had to make some minor bugfixes to get binary variables working. Previously across environments (in the renv environment, containered 4.1 with fresh installs, and 4.0.5 with fresh installs), datamations_sanddance() was erroring for binaries due to a util call that generates tooltip specs. This call was expecting gemini IDs despite being passed already aggregated data. The function removed columns that weren't grouping columns to calculate unique fields, but removed gemini_ids even if they didn't exist.
This commit removes this field only if it exists:
https://github.com/microsoft/datamations/commit/924ce7d79731e655be2ec47a94273dc52e71490b
A basic datamation with this binary variable from the following call:
"df %>%
group_by(player, year) %>%
summarise(mean = mean(is_hit))" %>% datamation_sanddance()
I'll provide more examples of variants of this in the next day or two.
Also, @jhofman/@giorgi-ghviniashvili, do you have suggestions on best practice for showcasing/recording these? This one is pretty low res and has my cursor over top of it :)
@willdebras: did you record with the native screencapture tool? that usually works pretty well for me, but @giorgi-ghviniashvili may have some other tricks up his sleeve!
@willdebras on mac, I was using QuickTime player which records as .mov file. But lately github had issue of embedding a player for .mov. Don't know if they resolved that issue yet or not.
A couple examples of the simpsons paradox with the baseball example for review tomorrow, additionally showcasing passing ggplot2 code to datamation_sanddance
# datamation #1:
# jeter has a higher batting average than justice overall
'df %>%
group_by(player) %>%
summarize(batting_average = mean(is_hit),
se = sqrt(batting_average * (1 - batting_average) / n())) %>%
ggplot(aes(x = player, y = batting_average, color = player)) +
geom_pointrange(aes(ymin = batting_average - se,
ymax = batting_average + se)) +
labs(x = "",
y = "Batting average")' %>% datamation_sanddance()
# datamation #2:
# but justice has a higher batting average than jeter within each year
'df %>%
group_by(player, year) %>%
summarize(batting_average = mean(is_hit),
se = sqrt(batting_average * (1 - batting_average) / n()) ) %>%
ggplot(aes(x = year, y = batting_average, color = player)) +
geom_pointrange(aes(ymin = batting_average - se,
ymax = batting_average + se),
position = position_dodge(width = 0.25)) +
labs(x = "",
y = "Batting average")' %>% datamation_sanddance()
#geom_bar(stat = "identity", position = "dodge")
@willdebras: looks better without the ggplot code at the end. we should probably modify things at some point so the ggplot-ed version looks better, but let's focus on this default version for now.
'df %>%
group_by(player, year) %>%
summarize(batting_average = mean(is_hit),
se = sqrt(batting_average * (1 - batting_average) / n()) )' %>%
datamation_sanddance()
@giorgi-ghviniashvili for some reason we're seeing the points reshuffle between 11 seconds and 14 seconds. would be great to eliminate that and just have the points go from all solid to only the hits being solid. can you take a look at this? seems like a sorting issue on the point ids.
let's try two versions:
@willdebras, can you continue to work on this and also play with the two different sortings mentioned in the above comment?
I've merged a PR adding the baseball dataset, the fixes to the binary variables, and some added documentation.
Working through the sorting methods mentioned.
It looks like sorting here works for a single groupby, but not the two. This is definitely something to fix on the R side. It seems when plotting summarize calls with multiple groups for binary variables, we miss the appending of gemini IDs somewhere, so they get rearranged.
From the baseball example, you can see the values field should have a gemini_id
field with an array of ids, but doesn't:
"meta": { "parse": "grid", "axes": true, "description": "Plot is_hit within each group", "splitField": "year" }, "data": { "values": [ { "player": "David Justice", "year": "1995", "is_hit": 1, "n": 104 },
After some debugging, I have updated the R end to ensure gemini IDs are in every summarize step, but they look correctly sorted in the specs I send along based on gemini ID. We can take a look on call tomorrow, but this might need more diving into the frontend to figure out if the plotting of this step is in the correct order.
I've merged a PR adding the baseball dataset, the fixes to the binary variables, and some added documentation.
Working through the sorting methods mentioned.
this is great. let's change the group by on the second example to be group_by(year, player)
so that it's easier to see the simpson's paradox bit pop up.
@willdebras pointed out that there's still some funny re-grouping going on. for instance, look at jeter in 1995---it goes from two columns to one column for some reason. it should stay fixed in one column.
@giorgi-ghviniashvili, can you take a look when you're back?
Here are the screenshots of the funky regrouping. We can see that despite the sort on gemini ids in both steps now, we still get a rearrange. Only happens for two groupbys.
@willdebras can you send json specs for this funky groupings?
Sure thing, these are saved here, @giorgi-ghviniashvili:
You can see these now have been updated to have gemini ID in all steps and have appropriate sorting. Still get the funky grouping though.
@jhofman @willdebras checked it and funky groupings happen because of different number of rows in third (rows = 35) and fourth (rows = 50) frames. We did this on purpose to avoid overlaps:
If we comment this code, all specs will have same number of rows and no funky grouping will happen, but in some cases, when there are too much points, they will overlap.
sounds like we should apply this rule to all specs, find the one with the most number of rows, and use that throughout to avoid this. @giorgi-ghviniashvili, can you give it a try?
@jhofman fixed the funky animation as per your suggestion.
https://user-images.githubusercontent.com/6615532/157499215-6d0b24f5-b917-46a6-a041-ae212df297c0.mov
looks good! small thing: from 11 to 13 seconds the stroke width looks like it's changing. @giorgi-ghviniashvili can you track down where and @willdebras can you update on the R side?
let's change our default so that the initial circles are empty (indicating zeros) w/ strokewidth and then once we split into groups and introduce the outcome (hit or no hit), we color in the "1"s (hit). @willdebras, hopefully this is just an R side change?
at some point this might conflict w/ explicit ggplot2 commands, we can cross that bridge when we get to it ...
@giorgi-ghviniashvili, do you know if there are unpushed (to main) changes related to the binary variables on your end? I'm wondering because in the recording above, at ~13s, we get the plotting of the dots fully filled, then it transitions to make these hollow at ~15s.
This staggered transition isn't in place in the main repo or in the example in the docs (auto generated from main):
https://microsoft.github.io/datamations/articles/Examples.html#binary-variables
It seems to go straight to the stroke with transparent fill.
From my understanding these are different animationFrames, but not different specs, right?
There is only one set of specs titled "Plot is_hit within each group." Do I need to add to these specs or is this staggered effect of the plotting of the 1 v. 0 something you have in js code that isn't in main yet?
@willdebras main branch is up to date regarding to the binary variables. This is the issue in this example as well, the circles size increases because of stroke property in "is_hit" spec, but it is not a visual issue in this example because there are few circles.
In general, issue is that previous spec does not have "stroke" property and that's why it looks thin. In this example, it has stroke-width: 1
, but not stroke and this stroke-width will be ignored in this case.
Solution A:
add stroke
encoding to previous specs if there is color encoding with same field of color.
(but this one did not really work well)..
Solution B:
reduce circles size in "is_hit" spec.
I used size = 22
:
But previous spec should be changed to:
https://user-images.githubusercontent.com/6615532/158161818-33de6fbd-d475-4705-94ff-7d16a232be2d.mov
See this commit
@willdebras has included this in the updated shiny app, related to #129
Since we completed the visualization of binary variables, it would be great to add a dataset to the existing (pinguins/salaries) dataset that has binary variables and datamate those.