scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

Improved support for binned data in Vega-Lite #49

Closed domoritz closed 5 years ago

domoritz commented 5 years ago

Veha-Lite 3 comes with improved support for prebinned data. You can find an example at https://vega.github.io/vega-lite/docs/bin.html#binned. I think this could be relevant here.

I'm one of the authors of Vega-Lite and if there is anything we can help with, please let me know.

jpivarski commented 5 years ago

Wow, that's great! I'm looking forward to using it (and simplifying some of the logic in histbook).

Is there an option to produce the sort of "skyline" outlines that are popular in physics, as opposed to filled areas or bar graphs with lots of vertical lines? (Like on the histbook GitHub page?)

domoritz commented 5 years ago

Can you use a line chart with a step function?

You might also be interested in https://github.com/dhuppenkothen/altaircorner/blob/master/README.md from @dhuppenkothen. There is some.overlap and she has shown some plots that use lines for histograms. Maybe she can share some of the screenshots here.

jpivarski commented 5 years ago

That's what we currently do (line with step), though the first and last bins have to be handled specially to make them drop to zero on the edges, and logarithmic y axes are a problem. That's what I meant about simplifying the logic in histbook— we currently put in fake data points and augment zero bins to get the right picture using a line with step function.

If you like, I can give you examples to stress-test your binned data visualization. Can you show me a minimal example in the Vega Editor with embedded data that I can change to show you the kinds of corner cases we typically encounter in our field?

domoritz commented 5 years ago

Ahh, I understand now. I guess what you would want is a step function that puts the step in the middle between consecutive points. Then you could just replace a bar chart with a line.

Vega-Lite can't do anything about this but you could file an issue in Vega about this and once it's supported there it would work in Vega-Lite as well.

You can open the example in the docs in the editor. https://vega.github.io/vega-lite/docs/bin.html#binned

jpivarski commented 5 years ago

I'll do that: this new syntax would allow us to write Vega-Lite files that plot what we mean, rather than trickery with fake data points. I'd rather wait for the right solution than to continue hacking.

In the meantime, here's an example of an issue with log scales. With binned data, it's easy to have a bin with a count of zero, and the scale should adjust to balance the positive values, rather than showing a blank canvas. Data from particle physics Monte Carlo generators are often weighted by positive and negative weights, so in this case, it's also possible to get a bin with a negative count (from a statistical fluctuation). For this to be useful in particle physics, it should set the dependent variable scale from the positive bin values, not all bin values. Can that be done in Vega-Lite without upstreaming it to Vega?

For that matter, where to I submit these issues? Here (for Vega)?

domoritz commented 5 years ago

For that matter, where to I submit these issues? Here (for Vega)?

Yes

jpivarski commented 5 years ago

Actually, this example with lines looks pretty good. The end-point issues are that the line doesn't go to zero at the left and right (minor visual nicety) and that the last bin isn't shown with a horizontal line (major).

I didn't find any Vega (not Vega-Lite) types for dealing with binned data. The Vega-Lite encoding/bin = binned translates to Vega; why not translate to a line with that one extra point at the end so that the last bin is shown? Should I raise that as a Vega-Lite issue instead? If it's to be a Vega issue, what should I be asking for, a "step-mid" option for mark = line?

domoritz commented 5 years ago

why not translate to a line with that one extra point at the end so that the last bin is shown?

Excellent question. The reason is that Vega-Lite is agnostic to the data. It actually never looks at the data and so there is no clean way to Vega-Lite to add a data point.

If it's to be a Vega issue, what should I be asking for, a "step-mid" option for mark = line?

I think that makes sense. If you describe what the goal is, @jheer might have a suggestion I haven't thought of.

domoritz commented 5 years ago

and the scale should adjust to balance the positive values, rather than showing a blank canvas

Can you elaborate? Do you want https://github.com/vega/vega/issues/1277?

domoritz commented 5 years ago

I think the prebinned data support is really only relevant for bar charts. For line charts we probably want to support specs like this one:

{
  "data": {
    "values": [
      {"binned": 8,  "count": 7},
      {"binned": 10, "count": 1},
      {"binned": 12, "count": 71},
      {"binned": 14, "count": 127},
      {"binned": 16, "count": 94},
      {"binned": 18, "count": 54},
      {"binned": 20, "count": 17},
      {"binned": 22, "count": 5},
      {"binned": 24, "count": 5}
    ]
  },
  "mark": {"type": "line", "interpolate": "step-middle", "point": true},
  "encoding": {
    "x": {
      "field": "binned",
      "type": "quantitative",
      "scale": {"domain": [5, 26]}
    },
    "y": {
      "field": "count",
      "type": "quantitative",
      "scale": {"type": "log"}
    }
  }
}
jpivarski commented 5 years ago

Can you elaborate? Do you want vega/vega#1277?

No, that's different. This would be to exclude data points with non-positive counts when the log scale is present. I've been trying to get it to work by fiddling with the compiled Vega in the Editor. There's something promising here called "signals" but I haven't gotten it to work as an example.

In the bar chart rendering, I'd want any data point with non-positive count to simply be excluded. In the line rendering, I'd want it to look like the line goes to -inf, which is perhaps more difficult. (If -inf is not a legal coordinate, then knowledge of the window range would be necessary.)

You're faster than me; I'll commit and then try the example you just sent.

jpivarski commented 5 years ago

The only problem with your line-chart example is that it doesn't show where the first bin starts and the last bin ends.

I'm willing to give up on the "skyline" visualization if the log scale issue can be fixed and I can remove the gap between bins. The reason for the skyline is because we often have hundreds of bins— all of those vertical gaps are distracting. I see that the binning implementation is making rectangles. Without a gap, they'd exactly line up and we'd get solid color, which is recognizable as a distribution. There's a historical/aesthetic to seeing an outline of that solid area, but the original motivation for that might have been the pen-and-ink based line printers that drew the first HIGZ histograms!

I found the answer to the gap problem: "mark": {"type": "bar", "binSpacing": 0}. And maybe I can put in some sort of "filter" for counts > 0 (though it would be nice if that were automatic for any scale that can't display non-positive values).

jpivarski commented 5 years ago

Aha! Maybe I have nothing to request.

{
  "data": {
    "values": [
      {"bin_start": 8,  "bin_end": 10, "count": 7},
      {"bin_start": 10, "bin_end": 12, "count": 0},
      {"bin_start": 12, "bin_end": 14, "count": 71},
      {"bin_start": 14, "bin_end": 16, "count": 127},
      {"bin_start": 17, "bin_end": 18, "count": 94},
      {"bin_start": 18, "bin_end": 20, "count": 54},
      {"bin_start": 20, "bin_end": 22, "count": 17},
      {"bin_start": 22, "bin_end": 24, "count": 5}
    ]
  },
  "transform": [{"filter": {"field": "count", "gt": 0}}],
  "mark": {"type": "bar", "binSpacing": 0},
  "encoding": {
    "x": {
      "field": "bin_start",
      "bin": "binned",
      "type": "quantitative",
      "axis": {
        "tickStep": 2
      }
    },
    "x2": {
      "field": "bin_end",
      "type": "quantitative"
    },
    "y": {
      "field": "count",
      "type": "quantitative",
      "scale": {"type": "log"}
    }
  }
}
domoritz commented 5 years ago

For the step-middle interpolation, I'd expect it to look like this:

image

domoritz commented 5 years ago

Aha! Maybe I have nothing to request.

Sweet. I still think that line would be nicer for log scales since people perceive the length of the bars but you can't compare the lengths since the bars do not start at 0. I think asking for step-middle interpolation is still a good idea. One issue with the interpolator is that you need to know how far to extend the last and first data point.

The cleanest solution for this project may be to use step-after and add repeat the first point.

jpivarski commented 5 years ago

Duplicating the last point with step-before is what I had been doing (and then adding two more points at zero to make the lines drop down to zero as a final touch).

I want to use the new binned functionality because then the data are exactly what we mean— someone could pick up the vega-lite JSON and use it as a table of published values— a theorist could fit their data to values published by an experimenter. That's what attracted me to vega-lite in the first place.

I'll see if my colleagues can get used to the area-color representation. We can probably also put horizontal lines on the tops of the bars (lines between x and x2) to get them to stand out if the color is light.

Thanks for all your suggestions!

domoritz commented 5 years ago

Sounds good. There are no immediate TODOs for Vega-Lite then, right? If anything comes up, feel free to file an issue anytime. I want to support this project as much as I can.

jpivarski commented 5 years ago

No TODOs that I know of, unless you want me to request the "transform": [{"filter": {"field": "count", "gt": 0}}] to be automatically added when "scale": {"type": "log"}.

I've linked this conversation here: https://gitter.im/HSF/PyHEP-histogramming where a group of us in HEP are coordinating to build a better histogramming environment in Python.

domoritz commented 5 years ago

unless you want me to request the "transform": [{"filter": {"field": "count", "gt": 0}}] to be automatically added when "scale": {"type": "log"}.

I think explicit is better than implicit here.

domoritz commented 5 years ago

How do you like the step interpolation?

{
  "data": {
    "values": [
      {"binned": 6,  "count": 7},
      {"binned": 8,  "count": 7},
      {"binned": 10, "count": 1},
      {"binned": 12, "count": 71},
      {"binned": 14, "count": 127},
      {"binned": 16, "count": 94},
      {"binned": 18, "count": 54}
    ]
  },
  "mark": {"type": "line", "interpolate": "step", "point": true},
  "encoding": {
    "x": {
      "field": "binned",
      "type": "quantitative",
      "scale": {"domain": [5, 26]}
    },
    "y": {
      "field": "count",
      "type": "quantitative",
      "scale": {"type": "log"}
    }
  }
}

https://vega.github.io/editor/#/url/vega-lite/N4KABGBEAmCGAutIC4yghSA3WAbArgKYDOKYA2uBhMJAEYCWAdk4dGQGwA0mAxgPb4m8MgHYAvlyrVajFmzIAOHlAFCRqCVOo16zVu1QBGAAw9Ia4WSOTpGWfoXGATOcsawom9p0P5hsCMAFjdBKxctO105A2tuVTCPAE4g2x00PX9rZQT1MgBWVLsAXSo0qABbWAAnAGsyWngATwAHQjJIXGZ282Z4QmqW-lwEdtRIYn6WyHMhvrJ4aqJyyEImAWhmAHMGu0gAD130yAAzBkJcAMzYn2pIZraOgEd8WGEGRHgGLB6oqGJeHgxhloPwqswyOR8jxnBxiuI7OVME0jjpTudLh13DM-vdWsDIC83l9Pt9fscAUCGnjHuNcPwdgjqEywAjxEA

image

jpivarski commented 5 years ago

For one thing, it doesn't have the interaction with log scales that I'm looking for: turn the bin with 1 count into 0 counts and the scale is wrong. Introduce the "transform": [{"filter": {"field": "count", "gt": 0}}] and the graphic is highly misleading (because it makes a continuous line between the two bins surrounding the 0 bin).

As I see it, the complete list of issues with making a line chart represent binned data is:

Basically, we want it to look like the following. (Note the line-printer friendly fonts!)

Here's a more recent example (also from the HEP community):

If it was just one issue, I'd be more tempted to try to get it fixed. There are enough issues that it sounds like an uphill battle or worse, using the wrong tool.

Ideally, we'd like the mark: {type: bar} to support a new graphical mode, one that takes a union of all overlapping rectangles and draws a stroke around the outline. I know that can't be done at the Vega-Lite level; it most likely needs to be implemented in Javascript. That's why I'm going to see if we can make do with the shaded region.

domoritz commented 5 years ago

Thank you for explaining the reasoning behind these requirements. Bars instead of lines seem to work really well, which is nice. It sounds like for lines you have special requirements and it makes sense to treat Vega-Lite more as a drawing library as you already did. Everything you describe seems expressible with some adjustments to the data (e.g. adding an extra datapoint).

or worse, using the wrong tool.

I would hope that this is not the case and work hard to make sure Vega-Lite makes sense as a tool for you. If Vega-Lite turns out not to be expressive enough, you could look into using Vega directly.

Note the line-printer friendly fonts!

You can change the fonts if that's a concern.