scikit-hep / boost-histogram

Python bindings for the C++14 Boost::Histogram library
https://boost-histogram.readthedocs.io
BSD 3-Clause "New" or "Revised" License
143 stars 21 forks source link

support for `bh.loc(value) + N` #152

Closed HDembinski closed 4 years ago

HDembinski commented 4 years ago

This would just offset the index by N

henryiii commented 4 years ago

@jpivarski, @HDembinski, I'd like to rename the current internal .value here (since we need to add a new one anyway to support). I'd like to make a slightly controversial choice of names, but I'll explain why I bring it (back) up.

I would like the internal name for the location in data coords to be .imag, and the name for the bin offset to be .real. This would be an implementation detail, but an explicit one; a Python complex number would also fill the duck-typed requirements and would be usable, for example, in interactive manipulation in the REPL/notebook. Anyone implementing loc would need to structure it this way.

bh.loc would still be a separate class, it would have a nice repr, it would not be based on complex, it would not support other complex calculations, etc. Complex numbers would just be allowed to substitute for it, primarily for interactive work (and might not even be mentioned in boost-histogram, but only in hist and the UHI site; they both implement UHI so it would still work in boost-histogram, though).

This is not my original complex number idea, a clear, typed tag is used. And, one of the main arguments against imaginary numbers was that the "real" part might be non-zero (since it's really a complex number). As soon as people saw bh.loc, though, they started asking for a + 1, that is, a real part!

What do you think? Otherwise, we need a name for the new property. Maybe offset.

HDembinski commented 4 years ago

I am against using .imag and .real even if it is an implementation detail.

1) For anyone who is not Henry, these are not intuitive names. 2) You call that an implementation detail, but it is public interface (does not start with _) 3) You would be very tempted to teach people to use complex numbers for indexing, despite the good arguments against this. It is an awkward abuse of complex numbers and it does not generalize, see 4). 4) Complex numbers don't work with axes that accept something different than numbers. We have a category axis which accepts strings.

HDembinski commented 4 years ago

PS: Implementation may inform design, but it is really easy to implement the functionality without relying on complex number arithmetic.

henryiii commented 4 years ago

We would not rely on anything from complex numbers. It would just allow complex numbers to duck type like loc so that we would have a fast way to work with numeric histograms in the REPL. But, fine, Jim wasn’t excited by the idea either (and he even likes the complex number shortcut)

To eventually add such a shortcut, we will just have to add an if type check. Fine.

jpivarski commented 4 years ago

I'm "okay with" the complex number shortcut if someone wants to use it as an alternative. I argued for the protocol being defined in terms of meaningful names, but allowing the UHI to also accept real and imag in their place (as an alternate spelling). A loc developer wouldn't have to implement real and imag; this would be a little extra logic in the UHI implementation.

Also, I think it's ambitious to expect any other libraries to implement UHI, but I like the idea that it's modularized enough to allow that possibility. Having to implement alternate spellings like real and imag would be a point against portability, but not a big point. Alternate spellings only get to be a problem when there are dozens of them, as a Lorentz vector protocol can (ROOT's API has a lot of alternate spellings, but it's not claiming to be an implementation-independent protocol). In such a case, it's hard to organize in such a way that everything the user expects is there without clogging the namespace. For UHI, however, just adding one alternate spelling doesn't get anywhere near that cognitive overload.

HDembinski commented 4 years ago

I like about the complex numbers that they are short, but they come with a rich set of behaviors which do not all make sense. h[3j + 1: ...] can be interpreted in a meaningful way, but h[3j * 2j:...] cannot. We cannot catch these errors.

henryiii commented 4 years ago

This is only intended for a shortcut on the command line or in scripts to quickly and readably input data coordinates.

h[3j + 1 : 5j, 0.5j : 2j]
vs.
h[bh.loc(3) + 1 : bh.loc(5), bh.loc(0.5) : bh.loc(2)]

However, this is only a usage shortcut for readability and typing, and not a replacement for library usage:

h[bh.loc(x) + y: bh.loc(z), bh.loc(a): bh.loc(b)]

With great power comes great responsibility (Spiderman) / you should not dull the knives of a great chef (paraphrased from Ruby). Someone wanting the checks in place should use bh.loc, that's what it's for - along with non-numerical axes, etc. Someone wanting to crop a numerical histogram for investigation in a notebook could use the complex shortcut.

HDembinski commented 4 years ago

Fine, I give up my resistance. The docs on complex numbers should quote "With great power comes great responsibility.".

jpivarski commented 4 years ago

Last-ditch alternative: what if an integer is a bin index but a string of an float/int is a data-space location?

hist[3, "3.14"]         # and also
hist[3, "loc(3.14)"]    # and also
hist[3, bh.loc(3.14)]

instead of

hist[3, 3.14j]          # and also
hist[3, bh.loc(3.14)]

Because:

The way I imagine this would work in implementation is to have some rules that immediately translate a string into bh.loc. For instance, call eval on it in an environment that includes from bh import loc, project, rebin and from math import *, wrap it in a bh.loc if it's a float or int, and then proceede as normal. The only thing quoted strings could never be is a positional index, but users would use integers instead of strings for that. This could also be a succinct (and uniform!) way to allow "rebin(5)" without the pesky bh..

I use something similar in all the uproot methods that could take a number of entries (as an integer) or a data size (as a string like "4 GB"), for example entrysteps in tree.iterate; see also here. I've been happy with it: the difference in meaning between entrysteps=500 and entrysteps="50 kB" is obvious.

I wish I'd thought of this months ago on Gitter, but I didn't. Thoughts? Is it too late to be putting new proposals on the table?

HDembinski commented 4 years ago

I like it! Parsing the string is a bit more expensive, but this is negligible.

henryiii commented 4 years ago

I hated the proposal at first, then at point 4 I switched to just disliking it. Reasons, first matching your arguments:

  1. You have to match quotes, which is less natural than a suffix or prefix. User literals are suffixes in C++.
  2. IPython colors the j too (at least the notebook does) - that's the main use case.
  3. Strings have unwanted operations too. "2.3" + "1" is valid, and not what you want. Complex numbers have a few other operations that are not useful, but at least they behave like numbers.
  4. This is the one nice feature - but we can easily have string categories automatically accept string arguments, without adding this.

Using eval on input is ugly and I usually try to avoid it. If we call float on the strings, then that's better, but you lose the ability to do the "loc(2)+1" without advanced parsing, which is one feature people wanted (and is almost as many chars as doing it the right way). You also can't mix at all with other math calculations like 1/3j (without the eval).

I think numerical indexing should be numbers. If someone sees h[2.3j], they will guess something is up. But if they see h["2.3"], it really looks like a string-based category, not a continuous histogram.

I just know that eventually you'd get users doing h[str(x)], which is so much worse than h[x*1j]. (h[complex(0,x)] is even worse, still, of course - but at least is still numerical).

I am not 100% against the proposal, I'm just pretty strongly still in the complex number camp.

henryiii commented 4 years ago

Another point:

HDembinski commented 4 years ago

So if we argue about ugliness, which is super objective, then let me reiterate that using complex number is super ugly and a terrible abuse that I find absolutely revolting.

jpivarski commented 4 years ago

This is partly cultural, what "looks normal." Strings of numbers looks like Javascript to you, and I've always thought of Numpy's r_, ix, ogrid, ... as a dark corner of Numpy. I don't know if it's widely used. The fact that Numpy is using complex numbers for r_ makes it feel even more like a dark corner.

But that's cultural—what feels normal—and Hans beat me to it by saying that it's not super objective.

henryiii commented 4 years ago

You find using a complex number as a 2 part number more revolting than using a string as a number? And eval'ing it? And having to write "loc(1) + 2" because "1" + 2 fails, because it is not a number?

HDembinski commented 4 years ago

Numpy's r_ is similarly revolting. How that got into numpy is a mystery to me.

HDembinski commented 4 years ago

Regexes are also strings, so yes, a string as a DSL seems less awkward to me.

henryiii commented 4 years ago

Regexes are for parsing strings, not numbers. Using a string for a string purpose is not that unreasonable. Using a string for a number is quite horrible, IMO.

HDembinski commented 4 years ago

Regexes are for parsing strings, not numbers.

Besides the point. What matters is that we establish a little DSL here, and DSLs are usually implemented with parsed strings. What the DSL does is completely unrelated

HDembinski commented 4 years ago

The fact that strings would normally not appear in __getitem__ makes it even more clear at the first glance that something special is happening. If I see a complex number there, I would be like "huh? WTF?"

HDembinski commented 4 years ago

I also like that I don't have to import my tags from some source, that is super awkward.

henryiii commented 4 years ago

I do not want to come up with a DSL, I want to come up with an EDSL, one that uses pure python, not strings. DSL's have to be learned, and have custom rules, while EDSLs follow the rules of Python. Complex numbers have some nice behaviors for us (addition with a regular number), while strings don't, and have to take over the whole expression.

Why would a string not be a "WTF" too?

Why not take it fully and make a full custom DSL? Like h["[1,3) -> rebin 2"]? The idea is that if we stay in pure in Python, users don't have to learn as much.

henryiii commented 4 years ago

We are not adding a shortcut method at all any time soon, partially to make sure we come up with a good method to do it. Avoiding importing tags is nice. If we use python's sum and complex numbers, the only tag to import is rebin, though.

HDembinski commented 4 years ago

DSL's have to be learned, and have custom rules, while EDSLs follow the rules of Python.

Only you think that using complex numbers is intuitive in this context.

The disadvantage of string-based DSLs normally is that you don't get a syntax error immediately, but this does not matter here, because this is for interactive work anyway. If I mistype my commands, I will get the feedback immediately.

henryiii commented 4 years ago

I'm not against allowing a single string like "sum" as a shortcut (not the primary way to do it), but "rebin(2)" already is getting ugly, since the number inside is getting parsed. And "1.2" is really ugly.

We could allow bh.loc * 1.2 + 1, just a thought. Only saves one char though.

HDembinski commented 4 years ago

Why would a string not be a "WTF" too?

Because it is not a number, it is obviously different. A complex number is subtly different.

Why not take it fully and make a full custom DSL? Like h["[1,3) -> rebin 2"]? The idea is that if we stay in pure in Python, users don't have to learn as much.

I would be fine with that, but I think I like your idea to use slices better. It is an extension of something that you already know and use.

henryiii commented 4 years ago

Another huge disadvantage is your tooling cannot help you. Syntax highlighting, tab completion, etc.

jpivarski commented 4 years ago

This is a conflict of programming cultures (what's normal?) and h["[1, 3) -> rebin 2"] is getting into slippery slopes: no one suggested that. Even when I suggested "loc(3.14) + 1", I wasn't suggesting a new syntax, because it's Python expression syntax (eval; no statements). Part of the motivation for that was to ease the transition from "3.14" (fast) to "loc(3.14) + 1" (balance of fast and flexible) to bh.loc(3.14) + 1 (formal, full system).

You share Gordon's dislike of non-embedded DSLs, and Hans and I are not on the other end of the spectrum, in which large chunks of work is done in strings, but we're more open to a mix. The reason Hans mentioned regex is because it's an example of a successful non-embedded DSL that programmers have accepted. We know that regexes are in a different domain, but the fact that that domain is strings isn't the reason why the regexes themselves are allowed to be strings. For example, Perl has an embedded syntax.

I saw these "ease of use" arguments in the ROOT forum: people develop something on top of ROOT, think it makes life easier, and then want to integrate it into ROOT, and they have very different ideas about makes things easy. (For example: a fully ASCII GUI for all the TBrowser and such.) That's an analogy. What I took away from those arguments is that "easy to use" is a very subjective thing. Sometimes communities coalesce with a common notion of a word like "Pythonic," but that's the exception.

boost-histogram has a lot to offer with less slick slicing syntax (I mean bh.loc). Perhaps we should get one, general method in the hands of users to see if they start asking for a faster way.

HDembinski commented 4 years ago

Avoiding importing tags is nice. If we use python's sum and complex numbers, the only tag to import is rebin, though.

Using sum as a tag is ugly, as much as using complex numbers.

HDembinski commented 4 years ago

Syntax highlighting, tab completion, etc.

You want tab completion in your slice? :)

HDembinski commented 4 years ago

bh.loc * 1.2 + 1

This I also like, it was something I was thinking about, too. It seems natural in this context to use loc like a unit. Full string parsing "1.2" requires me to type less, however. You wouldn't be able to do "1.2" + 1, but I don't mind. I don't think that adding an offset of 1 is something that people will actually use a lot.

henryiii commented 4 years ago

In Python, item["1.2"] is a dictionary lookup. Anyone seeing that will expect "1.2" to be a category (and for category histograms, this is exactly what it will be!). You could not indicate a category withh[1.2j], as floating point numbers, much less complex numbers, are never* used as categories. So it does not overlap with a perfectly valid Python syntax.

*: Should never be

henryiii commented 4 years ago

sum as a tag is rather natural, I think - you are asking for the sum over an axis, and python has a built in sum. And bh.sum will be available so tab completion on bh will show it (which is where I would expect it to show up).

And, yes, I can implement tab completion for strings for categories in IPython - I've implemented the same thing in uproot already for branch names. I can't implement tab completion for all floating point numbers (what would that even mean?).

henryiii commented 4 years ago

I think using bh.loc * val is not bad at all - maybe that would elevate the issues with complex numbers. I hate lots of matching (), especially inside [] - so this improved readability and typability might be enough. And we can always revisit and add a shortcut later.

HDembinski commented 4 years ago

Using sum as a string tag is natural, but not reusing an existing function as a tag. That is ugly and weird. I didn't complain about that so far, because it seemed like a necessary evil. The string approach solves all that.

henryiii commented 4 years ago

And by the time we decide it is useful, maybe Python will add a literal syntax, which would fix the problem. 3loc + 1 or something like that.

HDembinski commented 4 years ago

literal syntax would be preferred, but we then would still need a tag for sum and rebin. I don't share your dislike of eval at all. All the Python code you write is already evaluated on the fly by the interpreter. Now you pass a string explicitly to the interpreter. Why is that suddenly ugly? I find it quite natural, especially for Python.

henryiii commented 4 years ago

I want discoverability: I want bh.sum to exist. And if it does, it needs to to be the built in sum, or maybe the numpy sum. Therefore, that limits our choices.

Users wouldn't even have to know it is exactly the builtin sum.

HDembinski commented 4 years ago

I think Jim's suggestion is to have small types build into boost-histogram, which are discoverable. On top of that, strings would be accepted, which would be eval'd in a context where these tags are available.

HDembinski commented 4 years ago

Then you have your discoverability and I have a fast way to do transforms without importing those tags. And no complex numbers or misinterpreted np.sum or sum() are needed.

henryiii commented 4 years ago

We need to have a solution before we add the shortcut method, that can be added later. So we need to have sum before we can add the eval. The string method (if added) must be the shortcut method, not the only method.

HDembinski commented 4 years ago

We need to have a solution before we add the shortcut method, that can be added later. So we need to have sum before we can add the eval. The string method (if added) must be the shortcut method, not the only method.

I don't see why string handling has to come after, but I am not against that. Implementing all this at once is not rocket science. Anyway, Jim's suggestion as I understand him is to have the tags anyway. What is revolting is to use existing Python objects as tags. I am not against implementing our own tags in some module.

henryiii commented 4 years ago

That still leaves us with bh.sum and it's definition. Sticking it into a string and eval'ing it doesn't change anything other than provide a shortcut way to do it. Again, the shortcut method cannot be the only way to do it.

Why can't the sum we implement be the Python sum? If someone wants to do:

from boost_histogram import rebin, sum

They will find sum([1,2]) is now broken.

Note: one other option, which I would be okay with, would be for __call__ on our sum tag to just be the built-in sum. This at least keeps from breaking in the above case, but means you can't write ::sum without the above import - which I thought was nice to have for free, but is not consistent with rebin.

I'm not totally against the string shortcut method (though, then, 1 and "1" have different meanings, which I again find ugly and confusing)

HDembinski commented 4 years ago
from boost_histogram.tags import loc, sum, rebin

Yes, sum is clashing with builtin sum(), that's why it should not be imported at all, but passed to the slice as a string.

HDembinski commented 4 years ago

I don't want to prefix my tags with whatever namespace.

HDembinski commented 4 years ago

though, then, 1 and "1" have different meanings, which I again find ugly and confusing

We need two different types anyway. This is better than using complex numbers to represent the second kind of type. I am also ok with a slightly less compact but more obvious string-based DSL, which really just eval's Python code with the tags in the locals. We could reject "1.2" and only allow "loc(1.2)" or "1.2 * loc" (I don't much care either way, heck we could support both simultaneously).

henryiii commented 4 years ago

We need to follow Python conventions. This was a problem with HistBook - it expected users to import *, and one of the things it imported was bin - oops. Python users may want to work in namespaces, and they may want to import things - we should make that work as smoothly as possible. I don't intend to use strings everywhere, and many other Python users will not either. We should use the language we are provided. A secondary way to jump out with eval'ed strings is not unreasonable, but should not be the only way or recommended way to work (unless it is unreasonable to do otherwise).

HDembinski commented 4 years ago

I am not disagreeing with you. Neither Jim nor I want to get rid of the tags completely. I already said that I am ok with tags that we implement in our little library in a local module. They should live in their own tag module, not in the top level boost_histogram. Then people can use them EDSL style. I would use them DSL style, because I am too lazy to import them and prefix them.

HDembinski commented 4 years ago

What I am strongly against is to use other external Python objects as tags, np.sum, builtin sum, complex numbers.

henryiii commented 4 years ago

Why not in the top level module? bh.loc is not unreasonable, but bh.tags.loc is. They also live in a submodule (currently bh.uhi), but they are intended for common use.