scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

Category color changes in stack() #38

Closed imandr closed 6 years ago

imandr commented 6 years ago

I fill the histogram with a category axis in several iterations and I want to see it after each iteration, e.g.:

h = Hist(bin("x", 10, 0, 1), groupby("type"))

x = np.random.random((100,))
h.fill(type="A", x=x*x)
h.stack("type").area("x").to(canvas)

Here is the plot: visualization 5

Here comes second iteration:

x = np.random.random((100,))
h.fill(type="B", x=1-x*x)
h.stack("type").area("x").to(canvas)

visualization 6

Notice that the color for type=A histogram has changed from blue to orange and type=B is blue now. It would be good to make subplots keep their representation attributes.

Btw, it works correctly with overlay()

jpivarski commented 6 years ago

The colors are assigned by Vega-Lite and there's some way to set them— control that I'll pass through to the user somehow. However, they're probably in order of the data points themselves. That order is not preserved by _content, which is a dict. Maybe the content for groupby stuff should be an OrderedDict? dict already introduces a performance bottleneck, OrderedDict could be worse.

Nothing would be broken by (optionally?) making the dict ordered. All of the checks are isinstance(x, dict). Maybe that should be a groupby constructor option.

In the farther future, I should probably replace the dict representation with something contiguous in memory. Probably the easiest way would be something based on ChunkedArrays from the new awkward-array library I'm developing.

If there are no issues with overlays and the log plots are looking okay, can I close the other issue and merge that branch into master? I'll consider this one closed when I add the OrderedDict option to groupby.

imandr commented 6 years ago

Log plots with overlays do not look ok. See my last comment on issue 37.

I meant to say that overlay preserves colors unlike stack. Or maybe I was just lucky.

imandr commented 6 years ago

I think you could add an attribute to the Hist with categorical axes, which would be the list of all the category axis valued plotted so far, for all the categorical axes. This list would be used only to set the order of the subplots inside stack() or overlay(), not for filling. And if a category value is not in the list, then just append it.


class Hist:
    def __init__(...):
        # for each category axis:
        self.CategoryOrders = {}

    def stack(axis_name):
        category_order = self.CategoryOrders.get(axis_name)
        if category_order is None:
            category_order = []
            self.CategoryOrders[axis_name] = category_order
        for value in values_for_axis(axis_name):
            if not value in category_order:
                 category_order.append(value)
        for value in category_order:
            add_subplot(value)
jpivarski commented 6 years ago

Wait a minute— I just realized that this is stack: you have the power to set the order by hand. The stack method has an order parameter:

>>> beside(h.stack("type", order=["A", "B"]).area("x"),
           h.stack("type", order=["B", "A"]).area("x")).vegascope()

vegascope 2

An automated "sort stability" fix should tap into that mechanism. Stay tuned.

imandr commented 6 years ago

Using explicit order here is not always convenient because you either need to use all possible category values or all values filled into the histogram so far.

If you use all possible values, the legend will contain all of them even if the histogram has only some of them filled, which is not ideal:

visualization 3

So the best option is to use only those values filled into histogram so far, but in case of iterative displaying, that means using pretty much the same logic as I described earlier, only outside of the histogram class.

jpivarski commented 6 years ago

How about this?

>>> import numpy as np
>>> from histbook import *
>>> h = Hist(bin("x", 10, 0, 1), groupby("type", keeporder=True))   # new option
>>> x = np.random.random((10000,))
>>> h.fill(type="A", x=x*x)
>>> h.stack("type").area("x").vegascope()

vegascope

>>> x = np.random.random((10000,))
>>> h.fill(type="B", x=1-x*x)
>>> h.stack("type").area("x").vegascope()

vegascope 1

>>> x = np.random.random((10000,))
>>> h.fill(type="C", x=x*x)
>>> h.stack("type").area("x").vegascope()

vegascope 2

>>> x = np.random.random((10000,))
>>> h.fill(type="D", x=1-x*x)
>>> h.stack("type").area("x").vegascope()

vegascope 3

jpivarski commented 6 years ago

Yes— of course— it's a new property. I'll pass that through.

jpivarski commented 6 years ago

You know, I don't see anything wrong with the JSON serialization. Could it be that you're trying to deserialize old objects with new code?

The JSON format hasn't been made schema-evolving for archive. I don't foresee frequent changes anyway— this was a surprise.

imandr commented 6 years ago

Which branch is this fixed in ? I am using "issue-37" and colors are unstable.

jpivarski commented 6 years ago

It is in that branch and I observed stable colors when using the keeporder=True argument.

imandr commented 6 years ago

Ah yes, thanks !

imandr commented 6 years ago

I was wondering, maybe keeporder should always be True ? I can not think of a case when someone would actually want colors to change..

jpivarski commented 6 years ago

If it's not animated, the colors aren't "changing," and we've paid the penalty for an OrderedDict rather than a dict. The collection order also doesn't have a meaning when you combine histograms from different sources. The cases in which collection order matters are special, rather than the other way around.