Closed imandr closed 6 years ago
I see what's going on here— this isn't going to be easy. Vega didn't expect any data to be non-positive in log mode, and if you give it any such points, the whole plot disappears. So I've been removing the zero and negative points before feeding it to Vega, but that causes the artifacts you see here: it draws a line between the two existing points, which is wrong.
Maybe I could try another tact: replace non-positive values with an extremely small value and set the axis range to not include that artificial value. At least then, the lines for empty histogram bins would go down.
Try the branch issue-37. I'm too uncertain of this change to put it into master just before getting on a plane.
I've removed the code that drops non-positive points when in log scale and set "clip": True
to all marks. The "anchors" at the beginning and end of the step-line are set to numpy.nextafter(0, 1)
, rather than 0; this is the smallest positive number that can be represented by floating point.
On top of that, I don't have code yet to set the y domain. It can't be automatically determined from the data because Vega-Lite isn't smart enough to ignore non-positive points. You'd have to pass {"type": "log", "domain": [0.1, 100]}
for now. I'll add that code later, but it will be invasive because I'll have to take a union of all overlaid plots. For now, you can set the y domain yourself to see if it's working for you.
Finally, I don't see it rendering correctly in VegaScope, only in the online Vega-Lite editor. I may need to upgrade the version of Vega-Lite that VegaScope uses by default. But you're using ipyvega in Jupyter Notebook, so it would be a different version for you anyway.
Log histograms are another one of those things where HEP has to lead the way! (I remember a statistician saying once that a histogram plotted on a logarithmic axis is "completely meaningless" because the area doesn't correspond to probability. :)
It's better, but still not quite right:
h = Hist(bin("n", 10, 0, 10))
h.fill(n=np.array([3, 3, 1]))
hmin, hmax = 0.0, 0.0
if h._content is not None:
hmin = min(h._content[h._content > 0])
hmax = max(h._content[h._content > 0])
beside(h.step(), h.step(yscale={"type":"log", "domain":[hmin*0.9, hmax*1.1]})).to(canvas)
And it still is not working with errors displayed:
h = Hist(bin("n", 10, 0, 10))
h.fill(n=np.array([3, 3, 1]))
hmin, hmax = 0.0, 0.0
if h._content is not None:
hmin = min(h._content[h._content > 0])
hmax = max(h._content[h._content > 0])
beside(h.step(error=True),
h.step(yscale={"type":"log", "domain":[hmin*0.9, hmax*1.1]}, error=True)).to(canvas)
Besides, this is unrelated to the logscale issue, but I think error bars should be drawn in the same color as the steps
As for the value of logscale in histograms, it serves 2 purposes:
Oh, I know the usefulness. I'm just complaining that an easy thing to get right hasn't been handled yet.
It could be that the smallest possible floating point number is getting lost in round-off. I'll try something based on the data themselves. If you find something that works in the Vega editor (clock on the "editor" link), let me know and I'll rewrite the histbook code to generate that JSON.
About the error bars, they're currently emulated with different layers, the recommendation in the current examples. There's been a lot of activity in Vega-Lite about making error bars a first class citizen in the language, at which point we can dump the emulated method that doesn't match colors.
Here is what works in the editor:
First determine the range of actual data:
hmin, hmax = 0.0, 0.0
if h._content is not None:
hmin = min(h._content[h._content > 0])
hmax = max(h._content[h._content > 0])
Then generate JSON like this with domain=[hmin 0.9, hmax 1.1] and replacing all 0 values with hmin*0.09
{
"$schema": "https://vega.github.io/schema/vega-lite/v2.json",
"encoding": {
"y": {
"field": "a1",
"scale": {"domain": [0.9, 2.2], "type": "log"},
"type": "quantitative",
"axis": {"title": "entries per bin"}
},
"x": {
"field": "a0",
"scale": {"zero": false},
"type": "quantitative",
"axis": {"title": "n"}
}
},
"data": {
"values": [
{"a1": 0.09, "a0": 0},
{"a1": 0.09, "a0": 1e-9},
{"a1": 1, "a0": 1},
{"a1": 0.09, "a0": 2},
{"a1": 2, "a0": 3},
{"a1": 0.09, "a0": 4},
{"a1": 0.09, "a0": 5},
{"a1": 0.09, "a0": 6},
{"a1": 0.09, "a0": 7},
{"a1": 0.09, "a0": 8},
{"a1": 0.09, "a0": 9},
{"a1": 0.09, "a0": 10}
]
},
"transform": [],
"mark": {"type": "line", "clip": true, "interpolate": "step-before"}
}
Thanks! I was working offline and just jumped online to git push a solution (I don't know how long the connection will last).
Just like you, I ended up working around Vega-Lite by putting the values well below any data (even if you have weight=1e-10
and the histogram bin contents are tiny numbers). Then I also have to track all of the real y values to manually set the domain at 20% of min and max (you chose 10%). As it turns out, Vega-Lite picks round numbers beyond the min and max we give it.
Someday, Vega-Lite will natively support error bars and logarithmic scales with not-all-positive data. For now (and because users will surely encounter old versions of Vega-Lite), we have this work-around.
This will be in a separate branch until you've done a fair amount of plotting with it. On Monday, I'll go through the README tutorial again and make sure everything still looks okay before pushing it to master and a new release. (I really need a good way to have automated tests on graphical functions!)
Oh— I haven't tested overlays in this mode, though I would guess that it takes the union of the two domains (since it doesn't use them as strict edges...).
Unfortunately, overlays do not quite work:
h1 = Hist(bin("n", 10, 0, 10))
h1.fill(n=np.array([3,1,2]))
h2 = Hist(bin("n", 10, 0, 10))
h2.fill(n=np.array([5, 4, 4, 1]*100))
beside(
h1.step(yscale={"type":"log"}),
h2.step(yscale={"type":"log"}),
overlay(h1.step(yscale={"type":"log"}), h2.step(yscale={"type":"log"}))
).to(canvas)
Instead of setting a baseline to be five times as far from the min as the difference between min and max (allowing you to see the baseline in the third plot), I've set it to the minimum possible 32-bit floating point number (as opposed to the minimum 64-bit floating point number, which had numerical instability). That number is 1e-45.
Since these are histograms, the minimum nonzero value will usually be 1. The only exception is if the histogram is weighted: then you can see that artifact.
from histbook import *
Hist(bin("n", 10, 0, 10), fill=[5], weight=2e-45).step(yscale="log").vegascope()
I think this is the best we can do.
Here is another issue with logscale. Sometimes it sets the high end of the domain too high:
code:
basic_cut_narrow.overlay("category").step("basic_cut_acceptance", yscale={"type":"log"}).to(VegaLite)
Most of that change is unrelated to your fix. The part that addresses this issue is line 713 of vega.py: I only give Vega-Lite a minimal y domain now, without padding, because Vega-Lite itself pads the domain.
Another thing with logscale. This time it is with line(). Here is the code and the output:
h2 = Hist(bin("n", 10, 0, 10))
h2.fill(n=np.array([2,2,2, 3,3, 5,5,5, 6]*100))
h2.line(yscale={"type":"log"}).to(VegaLite)
The lines drop down to the bottom is not right. I would rather see something like this:
I did this in the VegaEditor by replacing "zero" values with nulls:
{
"$schema": "https://vega.github.io/schema/vega-lite/v2.json",
"encoding": {
"y": {
"field": "a1",
"scale": {"domain": [80.27415617602307, 2700], "type": "log"},
"type": "quantitative",
"axis": {"title": "entries per bin"}
},
"x": {
"field": "a0",
"scale": {"zero": false},
"type": "quantitative",
"axis": {"title": "n"}
}
},
"data": {
"values": [
{"a1": null, "a0": 0.5},
{"a1": null, "a0": 1.5},
{"a1": 300, "a0": 2.5},
{"a1": 200, "a0": 3.5},
{"a1": null, "a0": 4.5},
{"a1": 300, "a0": 5.5},
{"a1": 100, "a0": 6.5},
{"a1": null, "a0": 7.5},
{"a1": null, "a0": 8.5},
{"a1": null, "a0": 9.5}
]
},
"transform": [],
"mark": {"type": "line", "clip": true}
}
Notice that the high end of the domain in the above JSON is too high. It does not need to be 2700.
Woah! It accepts nulls as "don't draw me?" That's great! I'll do that for line
and marker
, but leave the 1e-45 baseline for step
and area
.
Should be good now.
This fragment plots empty histogram in log scale, but works fine in linear scale:
If I fill it with [3,3,3,2,4], then the logscale plot is not empty, but it is incorrect.
Also, I noticed that logscale does not work well with errors enabled until all bins have non-zero counts.