scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

Sensitivity to variable name ? #23

Closed imandr closed 6 years ago

imandr commented 6 years ago

I have different results for these 2 cases, which differ only by the variable name, "z" vs. "e":

h = Hist(bin("z", 5, 0, 1000.0))
h.fill(z=np.array([123.0]))

h.step("z", error=True).to(canvas)

and

h = Hist(bin("e", 5, 0, 1000.0))
h.fill(e=np.array([123.0]))

h.step("e", error=True).to(canvas)

It works correctly with "z", but with "e", it uniformly fills all bins with 1

jpivarski commented 6 years ago

e is a mathematical constant, approximately 2.71, which explains the dependence on name. (pi would have done something similar..) However, this doesn't sound like the right behavior. There should at least be some kind of warning that the fields you think you're giving it aren't being interpreted as variable fields.

imandr commented 6 years ago

What would be a meaning of using a constant like e or pi in this context ?

jpivarski commented 6 years ago

It would be part of a calculation. The strings are mathematical expressions that might use e and pi.

But I can see how this is an unintended consequence in the car you were attempting: if the whole expression resolves to a single constant, the user probably meant it as a field name. Less so if the constant were a literal number like "3".

To distinguish between "algebraically constant" and "literally constant," the expression nodes would have to carry more information about how they were derived. For instance, to know that you got 2.71... by typing "e" instead of all the digits manually, or even "e + 0" or "6 + e - 2*3" or even "x**2 + e -x**3/x". (Yes, it should solve that.)

One formal way to remove this ambiguity would be to remove all named constants from the default namespace. (You can add your own with the defs parameter, but then you know what you're doing.) Right now, there are only two: e and pi. To get these very useful numbers, there would have to be zero-argument functions e() and pi(), which looks a little weird, but there are other systems that do that.

I've been wanting to check this function namespace to see if it's compatible with numexpr, which would be a good feature to have. I'll look at how they handle e and pi.

imandr commented 6 years ago

Yes, I see what you mean. In my example, the expressions in line 1 and 3 are legitimate expressions, which happen to evaluate to a constant.

h = Hist(bin("e", 5, 0, 1000.0))        # line 1
h.fill(e=np.array([123.0]))                # line 2

h.step("e", error=True).to(canvas)    # line 3

I wonder if it is possible to determine that the expression in bin() or step() is evaluated to constant and give some sort of a warning, although I can not think of a good way to deliver such a warning. An exception is too harsh. Printing to stderr or stdout can get lost or annoying.

What is really weird, is that the result of my code. The histogram gets filled with 1 in all bins.

jpivarski commented 6 years ago

Yeah, the filled result is unexpected, something I need to investigate. I explicitly remember thinking that I haven't given enough thought to the case where an expression is a pure constant (though I hadn't foreseen that named constants like "e" would be confused for variables: my bad).

imandr commented 6 years ago

Maybe you could name constants in capitals, and/or use underscores like PI or _PI_ ?

jpivarski commented 6 years ago

That's what I was thinking about making them zero-argument functions. Like, if you had to type e() or pi(), there would be no ambiguity and it kinda makes sense that they are functions of zero arguments.

I want to check first, though, about what numexpr does. histbook ships have some kind of synergy with numexpr— it's an existing language with about the same scope and target audience. I may even use it in the execution plan to avoid some costly intermediary arrays. And finally, the SciKit-HEP project has a nice "formulate" package that converts TTree::Draw syntax into numexpr syntax. Possibly, I could make histbook's expression language exactly equal to numexpr's expression language, and that might help people in the long run. So I should check to see how they handle e and pi.

(If at all: "e" is "exp(1)" and "pi" is "acos(-1)", which would be easy to put in a defs.)

imandr commented 6 years ago

Yes, but if you use functions, would not you want to know that pi() + pi() - pi() == pi() always? If they are functions, you can not assume they always return same value. For example, random() does not.

jpivarski commented 6 years ago

Actually, sin(1) + sin(1) - sin(1) would get algebraically reduced to sin(1) (and similar for other functions). Referential transparency is assumed, which would be broken if any nondeterministic functions, functions with side-effects, or functions dependent on state or context were allowed.

Thus, random() is not in the language. It would be easy enough for the user to introduce variables and fill them with numpy.random.*, which also lets them express "these two are the same as each other, but this third one is independent." Putting that part of the calculation into histbook's expression language wouldn't help anyone because the purpose is to find repeated subexpressions in a large set of histograms and compute them once. Anything that isn't a pure function would have to be recomputed and couldn't be a part of that optimization.

That said, it would be possible to introduce such functions. I would have to make a new AST node that isn't equal to itself (__eq__ always returns False). That way, each random () would be treated as a whole new subexpression and ensure that it gets recomputed.

jpivarski commented 6 years ago

Reproduced:

% python -i -c 'import numpy as np; from histbook import *'
>>> h = Hist(bin("5", 5, 0, 1000.0))
>>> h.fill(e=np.array([123.0]))
>>> h.pandas()
                 count()  err(count())
5                                     
[-inf, 0.0)          1.0           1.0
[0.0, 200.0)         1.0           1.0
[200.0, 400.0)       1.0           1.0
[400.0, 600.0)       1.0           1.0
[600.0, 800.0)       1.0           1.0
[800.0, 1000.0)      1.0           1.0
[1000.0, inf)        1.0           1.0
{NaN}                1.0           1.0

Note that any constant will do this.

jpivarski commented 6 years ago

The actual bugginess of this has been resolved by #24:

>>> from histbook import *
>>> h = Hist(bin("3.1415926", 5, 0, 5))
>>> h.pandas()
             count()  err(count())
3.1415926                         
[-inf, 0.0)      0.0           0.0
[0.0, 1.0)       0.0           0.0
[1.0, 2.0)       0.0           0.0
[2.0, 3.0)       0.0           0.0
[3.0, 4.0)       0.0           0.0
[4.0, 5.0)       0.0           0.0
[5.0, inf)       0.0           0.0
{NaN}            0.0           0.0
>>> h.fill()
>>> h.pandas()
             count()  err(count())
3.1415926                         
[-inf, 0.0)      0.0           0.0
[0.0, 1.0)       0.0           0.0
[1.0, 2.0)       0.0           0.0
[2.0, 3.0)       0.0           0.0
[3.0, 4.0)       1.0           1.0
[4.0, 5.0)       0.0           0.0
[5.0, inf)       0.0           0.0
{NaN}            0.0           0.0

That is, an axis whose expression is a constant just fills like a constant. If we bundled this with a non-scalar fill, it would fill more than one:

>>> h = Book(one=Hist(bin("3.1415926", 5, 0, 5)), two=Hist(bin("x", 5, 0, 5)))
>>> h["one"].pandas()
             count()  err(count())
3.1415926                         
[-inf, 0.0)      0.0           0.0
[0.0, 1.0)       0.0           0.0
[1.0, 2.0)       0.0           0.0
[2.0, 3.0)       0.0           0.0
[3.0, 4.0)       0.0           0.0
[4.0, 5.0)       0.0           0.0
[5.0, inf)       0.0           0.0
{NaN}            0.0           0.0
>>> h.fill(x=[1, 2, 3])
>>> h["one"].pandas()
             count()  err(count())
3.1415926                         
[-inf, 0.0)      0.0      0.000000
[0.0, 1.0)       0.0      0.000000
[1.0, 2.0)       0.0      0.000000
[2.0, 3.0)       0.0      0.000000
[3.0, 4.0)       3.0      1.732051
[4.0, 5.0)       0.0      0.000000
[5.0, inf)       0.0      0.000000
{NaN}            0.0      0.000000
jpivarski commented 6 years ago

But this leaves the issue of user confusion over whether "e" is a constant or a variable name. I've got an idea for that.

jpivarski commented 6 years ago

Names like "e", which could be intended as 2.718281828459045 or could be intended as a field (e.g. fifth in a series starting with "a"), are interpreted as constants only if not specified in the fill. For instance:

>>> from histbook import *
>>> h = Hist(bin("e", 5, 0, 5))
>>> h.fill(e=[1, 2, 3])
>>> h.pandas()
             count()  err(count())
e                                 
[-inf, 0.0)      0.0           0.0
[0.0, 1.0)       0.0           0.0
[1.0, 2.0)       1.0           1.0
[2.0, 3.0)       1.0           1.0
[3.0, 4.0)       1.0           1.0
[4.0, 5.0)       0.0           0.0
[5.0, inf)       0.0           0.0
{NaN}            0.0           0.0
>>> h = Hist(bin("e", 5, 0, 5))
>>> h.fill()
>>> h.pandas()
             count()  err(count())
e                                 
[-inf, 0.0)      0.0           0.0
[0.0, 1.0)       0.0           0.0
[1.0, 2.0)       0.0           0.0
[2.0, 3.0)       1.0           1.0
[3.0, 4.0)       0.0           0.0
[4.0, 5.0)       0.0           0.0
[5.0, inf)       0.0           0.0
{NaN}            0.0           0.0

The first fill specifies "e", so it's a field name. The second fill doesn't specify it, so it's a constant. The complete list of "maybe constants" is:

maybeconstants = {"pi": numpy.pi, "Pi": numpy.pi, "e": numpy.e, "E": numpy.e,
                  "inf": numpy.inf, "Inf": numpy.inf, "infinity": numpy.inf,
                  "Infinity": numpy.inf, "nan": numpy.nan, "NaN": numpy.nan,
                  "Nan": numpy.nan}