vega / vega-lite

A concise grammar of interactive graphics, built on Vega.
https://vega.github.io/vega-lite/
BSD 3-Clause "New" or "Revised" License
4.64k stars 604 forks source link

boxplot doesn't works with column encoding #4156

Closed iliatimofeev closed 5 years ago

iliatimofeev commented 6 years ago

boxplot doesn't works with column encoding and facet. Result Error: Undefined data set name: "data_1" see editor

ijlyttle commented 6 years ago

I know this is likely overkill, but just to note that the same problem exists for row.

domoritz commented 6 years ago

The issue seems to not be in the boxplot logic.

domoritz commented 6 years ago

Here is a normalized spec that shows the issue:

```json { "data": { "values": [ { "homework_done": false, "session_time_m": 2, "session_hour": 1 }, { "homework_done": false, "session_time_m": 0, "session_hour": 2 } ] }, "$schema": "https://vega.github.io/schema/vega-lite/v3.0.0.json", "facet": { "column": { "type": "nominal", "field": "session_hour" } }, "spec": { "layer": [ { "transform": [ { "aggregate": [ { "op": "q1", "field": "session_time_m", "as": "lower_box_session_time_m" }, { "op": "q3", "field": "session_time_m", "as": "upper_box_session_time_m" }, { "op": "median", "field": "session_time_m", "as": "mid_box_session_time_m" }, { "op": "min", "field": "session_time_m", "as": "min_session_time_m" }, { "op": "max", "field": "session_time_m", "as": "max_session_time_m" } ], "groupby": [ "homework_done" ] }, { "calculate": "datum.upper_box_session_time_m - datum.lower_box_session_time_m", "as": "iqr_session_time_m" }, { "calculate": "min(datum.upper_box_session_time_m + datum.iqr_session_time_m * 1.5, datum.max_session_time_m)", "as": "upper_whisker_session_time_m" }, { "calculate": "max(datum.lower_box_session_time_m - datum.iqr_session_time_m * 1.5, datum.min_session_time_m)", "as": "lower_whisker_session_time_m" } ], "layer": [ { "mark": { "type": "rule", "style": "boxplot-rule" }, "encoding": { "y": { "field": "lower_whisker_session_time_m", "type": "quantitative", "title": "session_time_m" }, "y2": { "field": "lower_box_session_time_m", "type": "quantitative" }, "x": { "field": "homework_done", "type": "nominal", "title": "homework_done" } } }, { "mark": { "type": "rule", "style": "boxplot-rule" }, "encoding": { "y": { "field": "upper_box_session_time_m", "type": "quantitative", "title": "session_time_m" }, "y2": { "field": "upper_whisker_session_time_m", "type": "quantitative" }, "x": { "field": "homework_done", "type": "nominal", "title": "homework_done" } } }, { "mark": { "type": "bar", "size": 14, "style": "boxplot-box" }, "encoding": { "y": { "field": "lower_box_session_time_m", "type": "quantitative", "title": "session_time_m" }, "y2": { "field": "upper_box_session_time_m", "type": "quantitative" }, "x": { "field": "homework_done", "type": "nominal", "title": "homework_done" } } }, { "mark": { "color": "white", "type": "tick", "size": 14, "orient": "horizontal", "style": "boxplot-median" }, "encoding": { "y": { "field": "mid_box_session_time_m", "type": "quantitative", "title": "session_time_m" }, "x": { "field": "homework_done", "type": "nominal", "title": "homework_done" } } } ] }, { "transform": [ { "window": [ { "op": "q1", "field": "session_time_m", "as": "lower_box_session_time_m" }, { "op": "q3", "field": "session_time_m", "as": "upper_box_session_time_m" } ], "frame": [ null, null ], "groupby": [ "homework_done" ] }, { "filter": "(datum.session_time_m < datum.lower_box_session_time_m - 1.5 * (datum.upper_box_session_time_m - datum.lower_box_session_time_m)) || (datum.session_time_m > datum.upper_box_session_time_m + 1.5 * (datum.upper_box_session_time_m - datum.lower_box_session_time_m))" } ], "mark": { "type": "point", "style": "boxplot-outliers" }, "encoding": { "y": { "field": "session_time_m", "type": "quantitative" }, "x": { "field": "homework_done", "type": "nominal", "title": "homework_done" } } } ] } } ```
domoritz commented 6 years ago

Here is a small example

{
  "data": {
    "values": [
      {
        "homework_done": false,
        "session_time_m": 2,
        "session_hour": 1
      },
      {
        "homework_done": false,
        "session_time_m": 0,
        "session_hour": 2
      }
    ]
  },
  "$schema": "https://vega.github.io/schema/vega-lite/v3.0.0.json",
  "facet": {
    "column": {
      "type": "nominal",
      "field": "session_hour"
    }
  },
  "spec": {
    "layer": [
      {
        "transform": [
          {
            "aggregate": [
              {
                "op": "median",
                "field": "session_time_m",
                "as": "mid_box_session_time_m"
              }
            ],
            "groupby": [
              "homework_done"
            ]
          }
        ],
        "layer": [
          {
            "mark": {
              "type": "tick"
            },
            "encoding": {
              "y": {
                "field": "mid_box_session_time_m",
                "type": "quantitative"
              },
              "x": {
                "field": "homework_done",
                "type": "nominal"
              }
            }
          }
        ]
      },
      {
        "transform": [
          {
            "window": [
            ],
            "groupby": [
              "homework_done"
            ]
          }
        ],
        "mark": {
          "type": "point"
        },
        "encoding": {
          "y": {
            "field": "session_time_m",
            "type": "quantitative"
          },
          "x": {
            "field": "homework_done",
            "type": "nominal"
          }
        }
      }
    ]
  }
}
domoritz commented 6 years ago

Hmm, this doesn't look fun.

screen shot 2018-09-21 at 23 45 17
domoritz commented 6 years ago

Hmm, weird. We have a data_1 after facet but somehow Vega doesn't find it. I thought that worked.

domoritz commented 6 years ago

Ahh, the problem are the scales. We have a scale at the top level spec but it reads data from data_1, which is defined in an inner scope. What's weird is that I thought we are making a copy of the dataflow for this reason. When I change the domain to use data_3, it works.

domoritz commented 6 years ago

I'll keep looking later.

domoritz commented 6 years ago

🎉

screen shot 2018-09-25 at 18 53 27

So, the issue seems to be that we didn't correctly treat window aggregates as aggregates. Now the chart just needs a bite more data.

domoritz commented 6 years ago

Here is another example that doesn't work

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "description": "A vertical 1D box plot showing median, min, and max in the US population distribution of age groups in 2000.",
  "data": {"url": "data/population.json"},
  "mark": "boxplot",
  "encoding": {
    "y": {
      "field": "people",
      "type": "quantitative",
      "axis": {"title": "population"}
    },
    "column": {
      "field": "sex",
      "type": "ordinal"
    }
  }
}
domoritz commented 6 years ago

Hmm, why is people in the domain here?

  "scales": [
    {
      "name": "y",
      "type": "linear",
      "domain": {
        "fields": [
          {"data": "data_1", "field": "lower_whisker_people"},
          {"data": "data_1", "field": "lower_box_people"},
          {"data": "data_1", "field": "upper_box_people"},
          {"data": "data_1", "field": "upper_whisker_people"},
          {"data": "data_1", "field": "mid_box_people"},
          {"data": "data_3", "field": "people"}
        ]
      },
      "range": [{"signal": "child_height"}, 0],
      "nice": true,
      "zero": true
    }
  ],
domoritz commented 6 years ago

Ahh, people is for outliers. We need to use a window to calculate an aggregate and then filter with it. The right thing for the scale is to be derived from a clone of the dataflow that is hoisted to the top. We do this for normal aggregates so let's see why this isn't happening for window aggregates.

domoritz commented 6 years ago

Here is a small spec that shows the error even when I fix the push down logic.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {
    "url": "data/population.json"
  },
  "facet": {
    "column": {
      "field": "sex",
      "type": "ordinal"
    }
  },
  "spec": {
    "layer": [
      {
        "transform": [
          {
            "aggregate": [
              {
                "op": "min",
                "field": "people",
                "as": "min_people"
              }
            ],
            "groupby": []
          }
        ],
        "mark": {
          "type": "tick",
          "style": "boxplot-rule"
        },
        "encoding": {
          "y": {
            "field": "min_people",
            "type": "quantitative"
          }
        }
      },
      {
        "transform": [
          {
            "window": [
              {
                "op": "q1",
                "field": "people",
                "as": "lower_box_people"
              }
            ],
            "groupby": []
          }
        ],
        "mark": {
          "type": "point",
          "style": "boxplot-outliers"
        },
        "encoding": {
          "y": {
            "field": "people",
            "type": "quantitative"
          }
        }
      }
    ]
  }
}
domoritz commented 6 years ago

@invokesus had a hunch that the bug may be caused by https://github.com/vega/vega-lite/pull/4029. However, going back to dad69556d, doesn't seem to fix the issue with https://github.com/vega/vega-lite/issues/4156#issuecomment-424560258 but it does fix https://github.com/vega/vega-lite/issues/4156#issuecomment-423722449. So maybe https://github.com/vega/vega-lite/pull/4175 resolves at least partially resolves the issue.

domoritz commented 6 years ago

This example works before the transform merging but not after:

{
  "data": {
    "values": [
      {
        "homework_done": false,
        "session_time_m": 2,
        "session_hour": 1
      },
      {
        "homework_done": false,
        "session_time_m": 0,
        "session_hour": 2
      }
    ]
  },
  "$schema": "https://vega.github.io/schema/vega-lite/v3.0.0.json",
  "facet": {
    "column": {
      "type": "nominal",
      "field": "session_hour"
    }
  },
  "spec": {
    "layer": [
      {
        "transform": [
          {
            "aggregate": [
              {
                "op": "median",
                "field": "session_time_m",
                "as": "mid_box_session_time_m"
              }
            ],
            "groupby": []
          }
        ],
        "mark": {
          "type": "tick"
        },
        "encoding": {
          "y": {
            "field": "mid_box_session_time_m",
            "type": "quantitative"
          }
        }
      },
      {
        "transform": [
          {
            "window": [],
            "groupby": []
          }
        ],
        "mark": {
          "type": "point"
        },
        "encoding": {
          "y": {
            "field": "session_time_m",
            "type": "quantitative"
          }
        }
      }
    ]
  }
}

For some reason, this spec doesn't work in either case

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {
    "url": "data/population.json"
  },
  "facet": {
    "column": {
      "field": "sex",
      "type": "ordinal"
    }
  },
  "spec": {
    "layer": [
      {
        "transform": [
          {
            "aggregate": [
              {
                "op": "min",
                "field": "people",
                "as": "min_people"
              }
            ],
            "groupby": []
          }
        ],
        "mark": {
          "type": "tick",
          "style": "boxplot-rule"
        },
        "encoding": {
          "y": {
            "field": "min_people",
            "type": "quantitative"
          }
        }
      },
      {
        "transform": [
          {
            "window": [],
            "groupby": []
          }
        ],
        "mark": {
          "type": "point"
        },
        "encoding": {
          "y": {
            "field": "people",
            "type": "quantitative"
          }
        }
      }
    ]
  }
}
domoritz commented 6 years ago

Wow, so with dad69556d the dataflow looks like

screen shot 2018-09-25 at 20 21 23

and with the latest dom/window-dataflow

screen shot 2018-09-25 at 20 30 20

So something is very wrong here. I'm going to wait for @invokesus to fix https://github.com/vega/vega-lite/pull/4175 and see whether this resolves this problem.

https://github.com/vega/vega-lite/pull/4177 still seems like a good idea so I'll leave it open.

domoritz commented 5 years ago

https://github.com/vega/vega-lite/pull/4177 and https://github.com/vega/vega-lite/pull/4175 will fix this.

Phew, this was one of the hardest debugging sessions I've done. Took me three days with some really weird behavior in between. However, it exposed a few separate bugs that are all fixed now and we have tests and helper tools to make sure we can catch these class of bugs much easier now.