microsoft / datamations

https://microsoft.github.io/datamations/
Other
67 stars 14 forks source link

Decide where calculation of infogrid and jitter coordinates happens #23

Closed jhofman closed 3 years ago

jhofman commented 3 years ago

Should this be on the side of the "base language" (R) or the "rendering language" (javascript)?

If we can push this to the rendering language we'd save duplicated effort when porting to new base languages (like Python), but it doesn't seem like there's a natural way to do this vega, so we'd have to roll our own?

jhofman commented 3 years ago

I poked around and found some examples of information grids and dot plots in vegalite. The information grid version could be promising, with simpler shapes for each point (e.g., dots instead of animal shapes!).

It looks like vega supports beeswarms, but vegalite is not there yet, although there's discussion of handling jitter and offsetting.

jhofman commented 3 years ago

@giorgi-ghviniashvili discovered that there's a transform field in vega specifications that seems to work in the gemini editor. maybe we can use this on the salary data for a beeswarm plot?

giorgi-ghviniashvili commented 3 years ago

@jhofman the transform field works in vega, but not in vega-lite. Here is discussion. And here

Since @sharlagelfand generates vega-lite specs, gemini needs to convert vega-lite to vega specs.

So to use the transform field for force layout, we should have vega already compiled and gemini's name fields ready. And also we should know which specs are vega (already ready for gemini) and which vega-lite (needs to convert to vega).

So to sum up, to use force and jitter we should use vega..

jhofman commented 3 years ago

reviving this given a conversation with @sharlagelfand earlier today about where the x/y coordinates for infogrids should be calculated---on the base language (e.g., R) side, or on the rendering (e.g., javascript) side.

as per #32, given a set of x/y coordinates for points in each facet, @giorgi-ghviniashvili has a way to fake facets in a vegalite spec that is gemini friendly.

currently @sharlagelfand is using a bunch of logic on the base language (r) side to get the x/y coordinates for each point in the infogrid inside of each facet.

i'd propose that we shift this into the rendering (javascript) side, to make things more portable to different base languages. this way the base language would basically pass over something that says "make a (possibly faceted) infogrid plot with the following counts (in each facet)".

if there was a vegalite spec for infogrids, that would be great. but that doesn't seem to exist. is it worth doing a lightweight fake of a vegalite spec /as if/ such a spec existed, and then converting it to x/y points on the rendering (javascript) side?

for instance, could an infogrid spec that says show the counts for island (as row) and species (as column) look as simple as this?

@giorgi-ghviniashvili, curious to hear your thoughts. we can discuss next call if helpful.

{
  facet: {"row": {"field": island},
          "col": {"field": species}},
  data: {
    values: [
      {
        {"island": "Biscoe",
         "species": "Adelle",
         "n": 10
      },
      ...
  },
  "spec": {
    "mark": "infogrid",
    "encoding": {
      "count": {"field": "n"}
  }
}
giorgi-ghviniashvili commented 3 years ago

@jhofman Shifting to client side makes sense to me as well. We can write some helper functions per layout and generate x/y coordinates. I think we can run d3-force in background before rendering to generate collided x/y coordinates and then plot with vega.

One question about the n, does it mean that n = 10, is 10 element grid? A matrix 3 X 4 for example?

jhofman commented 3 years ago

One question about the n, does it mean that n = 10, is 10 element grid? A matrix 3 X 4 for example?

yes, exactly. that field is meant to specify the number of observations (dots) in each group.

giorgi-ghviniashvili commented 3 years ago

@jhofman I will write this script tomorrow, good idea

giorgi-ghviniashvili commented 3 years ago

@jhofman here is the function that calculates grid layout. It also adjusts axis domains to position the grid with some padding.

here is demo.

giorgi-ghviniashvili commented 3 years ago

@jhofman I also implemented jitter on js side using d3-force:

image

giorgi-ghviniashvili commented 3 years ago

Downside of d3-force is that it needs to wait for the force layout to compute its coordinates by collision detection.

For small datasets, it is quick, but for larger ones it might need some time to compute it.

giorgi-ghviniashvili commented 3 years ago

You can take a look at the code.

sharlagelfand commented 3 years ago

Hi @giorgi-ghviniashvili I've started putting together some "fake" specs that could be passed into your infogrid calculation function, they are all here.

The idea is that the data contains n, which tells you how many points are in each group (or overall if there is no grouping), and the spec contains information on the column and row facets, and the colour variable (whatever is present for that frame). Is this a format that works for you to read, convert the data to give actual x and y points, with sufficient info on the grouping (facets and colors)? (I've also changed id to .id, since that's a convention for "system" names in R - not sure how that might affect gemini code or if it needs to in fact be id.)

@jhofman had the idea that we could put mark = "infogridpoint" in the spec as a way to indicate that these are specs that need to be processed, but would then have the mark updated to be "point" (which is actually used by vega lite). Does this format make sense to you?

If this works, I think we could do something similar for the jitter points (in the summarize step) by putting mark = "jitterpoint" or similar, to indicate the points need to be jittered.

giorgi-ghviniashvili commented 3 years ago

@jhofman @sharlagelfand I created a generalized version of animation engine:

It detects which specs need to be parsed and which specs needs to fake facets, or both.

We have some conventions:

It is very simple to setup the app:

const app = App({
    specUrls: [
      dataUrl + "01-ungrouped.json",
      dataUrl + "02-column-facet.json",
      dataUrl + "03-column-row-facet.json",
      dataUrl + "04-column-row-facet-color.json",
      dataUrl + "05-jitter.json",
      dataUrl + "06-summary.json",
    ]
})

app.play()

Here is live demo

P.S. jitter needs a bit more thinking.

sharlagelfand commented 3 years ago

Looks awesome! 🎉 The specs are updated with the conventions @giorgi-ghviniashvili and I discussed, and I'm working on setting up tests within the R code to ensure the specs are created correctly!

Here's an example (close, but still psuedo-code-y) of the conventions:

For the initial infogrid, with no facets. The meta specifies parse: grid and there is only one data value, with the n. mark and encoding live at the root of the JSON, not within spec.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "meta": {
    "parse": "grid"
  },
  "data": {
    "values": {
      "n": 344
    }
  },
  "mark": "point",
  "encoding": {
    "x": "x",
    "y": "y"
  }
} 

For an infogrids with facets - again, the meta specifies parse: grid and there is one data value per grouping combination, with n. mark and encoding live within spec (always the case when there are facets).

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "meta": {
    "parse": "grid"
  },
  "data": {
    "values": [
      {
        "column": "col1",
        "row": "row1",
        "n": 1
      },
      {
        "column": "col2",
        "row": "row1",
        "n": 3
      }
    ]
  },
  "facet": {
    "column": "column",
    "row": "row"
  },
  "spec": {
    "mark": "point",
    "encoding": {
      "x": "x",
      "y": "y"
    }
  }
} 

For data that needs jittering - the meta specifies parse: jitter and there is a value for every data point, along with gemini_id (not required in the grid case, since the JS code generates it). There is facet and spec as before since that is still the same.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "meta": {
    "parse": "jitter"
  },
  "data": {
    "values": [
      {
        "column": "col1",
        "row": "row1",
        "value": 1,
        "gemini_id": 1,
        "x": 1
      },
      {
        "column": "col2",
        "row": "row1",
        "value": 5,
        "gemini_id": 2,
        "x": 1
      },
      {
        "column": "col2",
        "row": "row1",
        "value": 6,
        "gemini_id": 3,
        "x": 1
      },
      {
        "column": "col2",
        "row": "row1",
        "value": 3,
        "gemini_id": 4,
        "x": 1
      }
    ]
  },
  "facet": {
    "column": "column",
    "row": "row"
  },
  "spec": {
    "mark": "point",
    "encoding": {
      "x": "x",
      "y": "value"
    }
  }
} 

For the final summary data, meta is empty indicating no parsing needs to be done. The value field contains the summarised value (e.g. the mean in this case).

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "meta": [],
  "data": {
    "values": [
      {
        "column": "col1",
        "row": "row1",
        "value": 1,
        "gemini_id": 1,
        "x": 1
      },
      {
        "column": "col2",
        "row": "row1",
        "value": 4.66,
        "gemini_id": 2,
        "x": 1
      },
      {
        "column": "col2",
        "row": "row1",
        "value": 4.66,
        "gemini_id": 3,
        "x": 1
      },
      {
        "column": "col2",
        "row": "row1",
        "value": 4.66,
        "gemini_id": 4,
        "x": 1
      }
    ]
  },
  "facet": {
    "column": "column",
    "row": "row"
  },
  "spec": {
    "mark": "point",
    "encoding": {
      "x": "x",
      "y": "value"
    }
  }
} 
giorgi-ghviniashvili commented 3 years ago

@jhofman @sharlagelfand see new version here

sharlagelfand commented 3 years ago

Next steps for me are to integrate this generalized code into R code so there is a working htmlwidget that takes the pipeline, generates the "fake" specs, then the JS code converts these into real specs, does the animating, and it's all returned in the RStudio viewer!

It looks like the JS code can take the specs as an array so I'll see if I can figure out passing that :)

sharlagelfand commented 3 years ago

@giorgi-ghviniashvili just looking at the code here and trying to integrate into the htmlwidgets framework, is it possible to defined play() globally and have it take an ID, like I changed it to do here?

I think the current method doesn't work because the app object is only actually defined within the htmlwidget's visualization container (e.g. in #vis) so when the button is pressed and app.play() is activated, it produces an error that app is not found. I think it would probably work better with the htmlwidgets format to pass an ID to play() instead and it works on that div.

giorgi-ghviniashvili commented 3 years ago

@sharlagelfand of course it is possible to define it globally, but would not it be possible to just define app globally, and then access subsequent properties on it? Can you take a look at the code how you are trying to integrate?

sharlagelfand commented 3 years ago

@giorgi-ghviniashvili I think it's better to define play() globally with an ID and app just within the widget - if there is more than one widget on a page, then there would be more than one app, right? I think it's safer to use the ID of the div to play

sharlagelfand commented 3 years ago

Also - I think how the framework works, you need to define app within the widget because that's how you pass the specs to it. I'm not sure how it would work to define it globally

giorgi-ghviniashvili commented 3 years ago

@sharlagelfand please take a look at this commit

jhofman commented 3 years ago

@giorgi-ghviniashvili the updates on the x jitter look great.

for the y jitter, as per @sharlagelfand's ggplot example in #33 (see here), wondering if we could use one of these beeswarm plugins for d3 instead of force collide?

sharlagelfand commented 3 years ago

thank you @giorgi-ghviniashvili! can you clarify what the change is on that commit? Good to go on passing an id to play() now? It also looks like you removed the ability to add specs as an array, I was planning on testing that out (otherwise we have to e.g. write them to a temp file)

giorgi-ghviniashvili commented 3 years ago

@sharlagelfand I have ability to pass an array, but instead of passing to app, I am passing to init. Check app.js file. Yes, play is not global, I tried to mimic your widget js file.

sharlagelfand commented 3 years ago

Thanks @giorgi-ghviniashvili! Can confirm that the play() button works.

I'm a bit confused about the array because before there were two arguments for specs, specUrls and specs where I could pass either URLs or an array. Now there is only specURLs so it wasn't clear to be that an array could be passed too? Keep in mind I don't know JS super well so I can't tell if your code is just taking a single argument but detects whether it's an array or a files :)

I'm not having luck with passing an array - but passing an array like this should work?

giorgi-ghviniashvili commented 3 years ago

@sharlagelfand updated it to support spec array, like this:

image

sharlagelfand commented 3 years ago

thanks @giorgi-ghviniashvili!! That's working for me!

giorgi-ghviniashvili commented 3 years ago

@jhofman

I also reversed it and got this:

image

But when I use this function for our dataset, it looks like this:

image


@sharlagelfand's ggplot example fixes coords on both directions: x and y, but also spreads out within inner group of sex. These examples are not doing what we want. They just strictly fix x coordinates and stack circles on y axis.

jhofman commented 3 years ago

Quick note from this morning's call: we figured out that regular force collide for jitter looks much better when the y-axis is auto-scaled to cover the min and max range of the data instead of starting from zero. (Not clear if this is the vegalite default or if this was happening from some missing/NA y values in the data.)

Anyway, the plan is that @giorgi-ghviniashvili will have the code auto-scale the y axis to the range of the data.

Also, is there an external force or charge that can be used with force collide to "center" things in the beeswarm way?