scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

Logscale not working properly #37

Closed imandr closed 6 years ago

imandr commented 6 years ago

This fragment plots empty histogram in log scale, but works fine in linear scale:

from histbook import Hist, bin, beside
from vega import VegaLite
import numpy as np

h = Hist(bin("n", 10, 0, 10))
h.fill(n=np.array([3,3,3]))
beside(h.step(), h.step(yscale={"type":"log"})).to(VegaLite)

visualization 2

If I fill it with [3,3,3,2,4], then the logscale plot is not empty, but it is incorrect.

visualization 1

Also, I noticed that logscale does not work well with errors enabled until all bins have non-zero counts.

jpivarski commented 6 years ago

I see what's going on here— this isn't going to be easy. Vega didn't expect any data to be non-positive in log mode, and if you give it any such points, the whole plot disappears. So I've been removing the zero and negative points before feeding it to Vega, but that causes the artifacts you see here: it draws a line between the two existing points, which is wrong.

Maybe I could try another tact: replace non-positive values with an extremely small value and set the axis range to not include that artificial value. At least then, the lines for empty histogram bins would go down.

jpivarski commented 6 years ago

Try the branch issue-37. I'm too uncertain of this change to put it into master just before getting on a plane.

I've removed the code that drops non-positive points when in log scale and set "clip": True to all marks. The "anchors" at the beginning and end of the step-line are set to numpy.nextafter(0, 1), rather than 0; this is the smallest positive number that can be represented by floating point.

On top of that, I don't have code yet to set the y domain. It can't be automatically determined from the data because Vega-Lite isn't smart enough to ignore non-positive points. You'd have to pass {"type": "log", "domain": [0.1, 100]} for now. I'll add that code later, but it will be invasive because I'll have to take a union of all overlaid plots. For now, you can set the y domain yourself to see if it's working for you.

Finally, I don't see it rendering correctly in VegaScope, only in the online Vega-Lite editor. I may need to upgrade the version of Vega-Lite that VegaScope uses by default. But you're using ipyvega in Jupyter Notebook, so it would be a different version for you anyway.

Log histograms are another one of those things where HEP has to lead the way! (I remember a statistician saying once that a histogram plotted on a logarithmic axis is "completely meaningless" because the area doesn't correspond to probability. :)

imandr commented 6 years ago

It's better, but still not quite right:

h = Hist(bin("n", 10, 0, 10))
h.fill(n=np.array([3, 3, 1]))
hmin, hmax = 0.0, 0.0
if h._content is not None:
    hmin = min(h._content[h._content > 0])
    hmax = max(h._content[h._content > 0])
beside(h.step(), h.step(yscale={"type":"log", "domain":[hmin*0.9, hmax*1.1]})).to(canvas)

visualization 2

imandr commented 6 years ago

And it still is not working with errors displayed:

h = Hist(bin("n", 10, 0, 10))
h.fill(n=np.array([3, 3, 1]))
hmin, hmax = 0.0, 0.0
if h._content is not None:
    hmin = min(h._content[h._content > 0])
    hmax = max(h._content[h._content > 0])
beside(h.step(error=True), 
       h.step(yscale={"type":"log", "domain":[hmin*0.9, hmax*1.1]}, error=True)).to(canvas)

visualization 3

Besides, this is unrelated to the logscale issue, but I think error bars should be drawn in the same color as the steps

imandr commented 6 years ago

As for the value of logscale in histograms, it serves 2 purposes:

  1. It lets you see low tails of very non-uniform distributions
  2. It lets you see if the distribution is exponential or normal or something else
jpivarski commented 6 years ago

Oh, I know the usefulness. I'm just complaining that an easy thing to get right hasn't been handled yet.

It could be that the smallest possible floating point number is getting lost in round-off. I'll try something based on the data themselves. If you find something that works in the Vega editor (clock on the "editor" link), let me know and I'll rewrite the histbook code to generate that JSON.

About the error bars, they're currently emulated with different layers, the recommendation in the current examples. There's been a lot of activity in Vega-Lite about making error bars a first class citizen in the language, at which point we can dump the emulated method that doesn't match colors.

imandr commented 6 years ago

Here is what works in the editor:

First determine the range of actual data:

hmin, hmax = 0.0, 0.0
if h._content is not None:
    hmin = min(h._content[h._content > 0])
    hmax = max(h._content[h._content > 0])

Then generate JSON like this with domain=[hmin 0.9, hmax 1.1] and replacing all 0 values with hmin*0.09

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "encoding": {
    "y": {
      "field": "a1",
      "scale": {"domain": [0.9, 2.2], "type": "log"},
      "type": "quantitative",
      "axis": {"title": "entries per bin"}
    },
    "x": {
      "field": "a0",
      "scale": {"zero": false},
      "type": "quantitative",
      "axis": {"title": "n"}
    }
  },
  "data": {
    "values": [
      {"a1": 0.09, "a0": 0},
      {"a1": 0.09, "a0": 1e-9},
      {"a1": 1, "a0": 1},
      {"a1": 0.09, "a0": 2},
      {"a1": 2, "a0": 3},
      {"a1": 0.09, "a0": 4},
      {"a1": 0.09, "a0": 5},
      {"a1": 0.09, "a0": 6},
      {"a1": 0.09, "a0": 7},
      {"a1": 0.09, "a0": 8},
      {"a1": 0.09, "a0": 9},
      {"a1": 0.09, "a0": 10}
    ]
  },
  "transform": [],
  "mark": {"type": "line", "clip": true, "interpolate": "step-before"}
}
jpivarski commented 6 years ago

Thanks! I was working offline and just jumped online to git push a solution (I don't know how long the connection will last).

Just like you, I ended up working around Vega-Lite by putting the values well below any data (even if you have weight=1e-10 and the histogram bin contents are tiny numbers). Then I also have to track all of the real y values to manually set the domain at 20% of min and max (you chose 10%). As it turns out, Vega-Lite picks round numbers beyond the min and max we give it.

Someday, Vega-Lite will natively support error bars and logarithmic scales with not-all-positive data. For now (and because users will surely encounter old versions of Vega-Lite), we have this work-around.

This will be in a separate branch until you've done a fair amount of plotting with it. On Monday, I'll go through the README tutorial again and make sure everything still looks okay before pushing it to master and a new release. (I really need a good way to have automated tests on graphical functions!)

Oh— I haven't tested overlays in this mode, though I would guess that it takes the union of the two domains (since it doesn't use them as strict edges...).

imandr commented 6 years ago

Unfortunately, overlays do not quite work:

h1 = Hist(bin("n", 10, 0, 10))
h1.fill(n=np.array([3,1,2]))
h2 = Hist(bin("n", 10, 0, 10))
h2.fill(n=np.array([5, 4, 4, 1]*100))
beside(
    h1.step(yscale={"type":"log"}),
    h2.step(yscale={"type":"log"}),
    overlay(h1.step(yscale={"type":"log"}), h2.step(yscale={"type":"log"}))
).to(canvas)

visualization 4

jpivarski commented 6 years ago

Instead of setting a baseline to be five times as far from the min as the difference between min and max (allowing you to see the baseline in the third plot), I've set it to the minimum possible 32-bit floating point number (as opposed to the minimum 64-bit floating point number, which had numerical instability). That number is 1e-45.

vegascope

Since these are histograms, the minimum nonzero value will usually be 1. The only exception is if the histogram is weighted: then you can see that artifact.

from histbook import *
Hist(bin("n", 10, 0, 10), fill=[5], weight=2e-45).step(yscale="log").vegascope()

vegascope 1

I think this is the best we can do.

imandr commented 6 years ago

Here is another issue with logscale. Sometimes it sets the high end of the domain too high:

visualization 5

code:

basic_cut_narrow.overlay("category").step("basic_cut_acceptance", yscale={"type":"log"}).to(VegaLite)
jpivarski commented 6 years ago

Most of that change is unrelated to your fix. The part that addresses this issue is line 713 of vega.py: I only give Vega-Lite a minimal y domain now, without padding, because Vega-Lite itself pads the domain.

imandr commented 6 years ago

Another thing with logscale. This time it is with line(). Here is the code and the output:

h2 = Hist(bin("n", 10, 0, 10))
h2.fill(n=np.array([2,2,2, 3,3, 5,5,5, 6]*100))
h2.line(yscale={"type":"log"}).to(VegaLite)

visualization 6

The lines drop down to the bottom is not right. I would rather see something like this:

download

I did this in the VegaEditor by replacing "zero" values with nulls:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "encoding": {
    "y": {
      "field": "a1",
      "scale": {"domain": [80.27415617602307, 2700], "type": "log"},
      "type": "quantitative",
      "axis": {"title": "entries per bin"}
    },
    "x": {
      "field": "a0",
      "scale": {"zero": false},
      "type": "quantitative",
      "axis": {"title": "n"}
    }
  },
  "data": {
    "values": [
      {"a1": null, "a0": 0.5},
      {"a1": null, "a0": 1.5},
      {"a1": 300, "a0": 2.5},
      {"a1": 200, "a0": 3.5},
      {"a1": null, "a0": 4.5},
      {"a1": 300, "a0": 5.5},
      {"a1": 100, "a0": 6.5},
      {"a1": null, "a0": 7.5},
      {"a1": null, "a0": 8.5},
      {"a1": null, "a0": 9.5}
    ]
  },
  "transform": [],
  "mark": {"type": "line", "clip": true}
}
imandr commented 6 years ago

Notice that the high end of the domain in the above JSON is too high. It does not need to be 2700.

jpivarski commented 6 years ago

Woah! It accepts nulls as "don't draw me?" That's great! I'll do that for line and marker, but leave the 1e-45 baseline for step and area.

jpivarski commented 6 years ago

Should be good now.