vega / vega-lite

A concise grammar of interactive graphics, built on Vega.
https://vega.github.io/vega-lite/
BSD 3-Clause "New" or "Revised" License
4.65k stars 606 forks source link

Should aggregates be grouped by fields that occur in conditions? #6045

Open jakevdp opened 4 years ago

jakevdp commented 4 years ago

This is followup from an Altair user question that comes from a potentially confusing aspect of the grammar.

TLDR: In the VL grammar, aggregates specified in encodings are implicitly grouped by other encodings. Should they also be grouped by conditions that appear in those encodings?

Consider this chart (vega editor):

{
  "data": {"url": "data/cars.json"},
  "mark": "bar",
  "encoding": {
    "x": {"field": "Miles_per_Gallon", "bin": true, "type": "quantitative"},
    "y": {"aggregate": "count", "type": "quantitative"},
    "color": {"field": "Cylinders", "type": "ordinal"}
  }
}

visualization - 2020-03-09T114912 976

Now suppose the user wants to highlight rows with fewer than 5 cylinders. They might look at the docs and try replacing the color encoding with condition (editor):

{
  "data": {"url": "data/cars.json"},
  "mark": "bar",
  "encoding": {
    "x": {"field": "Miles_per_Gallon", "bin": true, "type": "quantitative"},
    "y": {"aggregate": "count", "type": "quantitative"},
    "color": {
      "condition": {"test": "datum.Cylinders < 5", "value": "steelblue"},
      "value": "darkorange"
    }
  }
}

visualization - 2020-03-09T114858 035

This clearly does not have the desired effect, because the count aggregate is no longer grouped by Cylinders. For users unfamiliar with the details of how aggregates are computed in VL, it's quite difficult to debug why this is happening.

One easy remedy is to explicitly add a detail encoding, so that the counts are appropriately grouped (editor):

{
  "data": {"url": "data/cars.json"},
  "mark": "bar",
  "encoding": {
    "x": {"field": "Miles_per_Gallon", "bin": true, "type": "quantitative"},
    "y": {"aggregate": "count", "type": "quantitative"},
    "detail": {"field": "Cylinders", "type": "ordinal"},
    "color": {
      "condition": {"test": "datum.Cylinders < 5", "value": "steelblue"},
      "value": "darkorange"
    }
  }
}

visualization - 2020-03-09T115121 569

A more complete approach would probably be to apply a calculate transform and encode the color by that field (editor):

{
  "data": {"url": "data/cars.json"},
  "transform": [
    {"calculate": "datum.Cylinders < 5 ? '< 5' : '≥ 5'", "as": "Cylinders"}
  ],
  "mark": "bar",
  "encoding": {
    "x": {"field": "Miles_per_Gallon", "bin": true, "type": "quantitative"},
    "y": {"aggregate": "count", "type": "quantitative"},
    "color": {"field": "Cylinders", "type": "nominal"}
  }
}

visualization - 2020-03-09T115540 120

But this is probably more suited to a polished, final chart than to quick and dirty data exploration.

Would it make sense to change the grammar such that aggregates specified in encodings will also group by fields that appear in conditional statements? In other words, should we treat fields referenced in conditional expressions as if they are included in the detail encoding? Or if that is too invasive, perhaps log a warning when an aggregate elides a field that's referenced in an expression?

kanitw commented 4 years ago

It's probably reasonable to include a field included in the test in the groupby.

If anything I think we should make the following spec:

{
  "data": {"url": "data/cars.json"},
  "mark": "bar",
  "encoding": {
    "x": {"field": "Miles_per_Gallon", "bin": true, "type": "quantitative"},
    "y": {"aggregate": "count", "type": "quantitative"},
    "color": {
      "condition": {"test": {"field": "Cylinders", "lt": 5}, "value": "steelblue"},
      "value": "darkorange"
    }
  }
}

However, implementing such parsing for expression would make the code perhaps too complicated.