Closed HDembinski closed 5 years ago
@jpivarski, @HDembinski, I'd like to rename the current internal .value
here (since we need to add a new one anyway to support). I'd like to make a slightly controversial choice of names, but I'll explain why I bring it (back) up.
I would like the internal name for the location in data coords to be .imag
, and the name for the bin offset to be .real
. This would be an implementation detail, but an explicit one; a Python complex number would also fill the duck-typed requirements and would be usable, for example, in interactive manipulation in the REPL/notebook. Anyone implementing loc
would need to structure it this way.
bh.loc
would still be a separate class, it would have a nice repr, it would not be based on complex, it would not support other complex calculations, etc. Complex numbers would just be allowed to substitute for it, primarily for interactive work (and might not even be mentioned in boost-histogram, but only in hist and the UHI site; they both implement UHI so it would still work in boost-histogram, though).
This is not my original complex number idea, a clear, typed tag is used. And, one of the main arguments against imaginary numbers was that the "real" part might be non-zero (since it's really a complex number). As soon as people saw bh.loc
, though, they started asking for a + 1
, that is, a real part!
What do you think? Otherwise, we need a name for the new property. Maybe offset
.
I am against using .imag
and .real
even if it is an implementation detail.
1) For anyone who is not Henry, these are not intuitive names.
2) You call that an implementation detail, but it is public interface (does not start with _
)
3) You would be very tempted to teach people to use complex numbers for indexing, despite the good arguments against this. It is an awkward abuse of complex numbers and it does not generalize, see 4).
4) Complex numbers don't work with axes that accept something different than numbers. We have a category axis which accepts strings.
PS: Implementation may inform design, but it is really easy to implement the functionality without relying on complex number arithmetic.
We would not rely on anything from complex numbers. It would just allow complex numbers to duck type like loc so that we would have a fast way to work with numeric histograms in the REPL. But, fine, Jim wasn’t excited by the idea either (and he even likes the complex number shortcut)
To eventually add such a shortcut, we will just have to add an if type check. Fine.
I'm "okay with" the complex number shortcut if someone wants to use it as an alternative. I argued for the protocol being defined in terms of meaningful names, but allowing the UHI to also accept real
and imag
in their place (as an alternate spelling). A loc
developer wouldn't have to implement real
and imag
; this would be a little extra logic in the UHI implementation.
Also, I think it's ambitious to expect any other libraries to implement UHI, but I like the idea that it's modularized enough to allow that possibility. Having to implement alternate spellings like real
and imag
would be a point against portability, but not a big point. Alternate spellings only get to be a problem when there are dozens of them, as a Lorentz vector protocol can (ROOT's API has a lot of alternate spellings, but it's not claiming to be an implementation-independent protocol). In such a case, it's hard to organize in such a way that everything the user expects is there without clogging the namespace. For UHI, however, just adding one alternate spelling doesn't get anywhere near that cognitive overload.
I like about the complex numbers that they are short, but they come with a rich set of behaviors which do not all make sense. h[3j + 1: ...]
can be interpreted in a meaningful way, but h[3j * 2j:...]
cannot. We cannot catch these errors.
This is only intended for a shortcut on the command line or in scripts to quickly and readably input data coordinates.
h[3j + 1 : 5j, 0.5j : 2j]
vs.
h[bh.loc(3) + 1 : bh.loc(5), bh.loc(0.5) : bh.loc(2)]
However, this is only a usage shortcut for readability and typing, and not a replacement for library usage:
h[bh.loc(x) + y: bh.loc(z), bh.loc(a): bh.loc(b)]
With great power comes great responsibility (Spiderman) / you should not dull the knives of a great chef (paraphrased from Ruby). Someone wanting the checks in place should use bh.loc
, that's what it's for - along with non-numerical axes, etc. Someone wanting to crop a numerical histogram for investigation in a notebook could use the complex shortcut.
Fine, I give up my resistance. The docs on complex numbers should quote "With great power comes great responsibility.".
Last-ditch alternative: what if an integer is a bin index but a string of an float/int is a data-space location?
hist[3, "3.14"] # and also
hist[3, "loc(3.14)"] # and also
hist[3, bh.loc(3.14)]
instead of
hist[3, 3.14j] # and also
hist[3, bh.loc(3.14)]
Because:
"
or '
), which is not much more than one (j
).j
. My editor doesn't do that.)bh.loc
would have a __add__
method, the subject of this issue thread). However, strings give us a lot of room to add syntax. In the proposal above, "3.14"
is equivalent to "loc(3.14)"
so that (a) quick scripts can gradually transition to formal ones and (b) so that you can do this: "loc(3.14) + 1"
. If Python ever gets advanced interpolation like Scala's, we'd be able to do loc"3.14" + 1
.The way I imagine this would work in implementation is to have some rules that immediately translate a string into bh.loc
. For instance, call eval
on it in an environment that includes from bh import loc, project, rebin
and from math import *
, wrap it in a bh.loc
if it's a float or int, and then proceede as normal. The only thing quoted strings could never be is a positional index, but users would use integers instead of strings for that. This could also be a succinct (and uniform!) way to allow "rebin(5)"
without the pesky bh.
.
I use something similar in all the uproot methods that could take a number of entries (as an integer) or a data size (as a string like "4 GB"
), for example entrysteps
in tree.iterate; see also here. I've been happy with it: the difference in meaning between entrysteps=500
and entrysteps="50 kB"
is obvious.
I wish I'd thought of this months ago on Gitter, but I didn't. Thoughts? Is it too late to be putting new proposals on the table?
I like it! Parsing the string is a bit more expensive, but this is negligible.
I hated the proposal at first, then at point 4 I switched to just disliking it. Reasons, first matching your arguments:
"2.3" + "1"
is valid, and not what you want. Complex numbers have a few other operations that are not useful, but at least they behave like numbers.Using eval on input is ugly and I usually try to avoid it. If we call float on the strings, then that's better, but you lose the ability to do the "loc(2)+1"
without advanced parsing, which is one feature people wanted (and is almost as many chars as doing it the right way). You also can't mix at all with other math calculations like 1/3j
(without the eval).
I think numerical indexing should be numbers. If someone sees h[2.3j]
, they will guess something is up. But if they see h["2.3"]
, it really looks like a string-based category, not a continuous histogram.
I just know that eventually you'd get users doing h[str(x)]
, which is so much worse than h[x*1j]
. (h[complex(0,x)]
is even worse, still, of course - but at least is still numerical).
I am not 100% against the proposal, I'm just pretty strongly still in the complex number camp.
Another point:
r_
, which has been around since I started using Numpy. Strings are not used at a replacement for numbers (and strings being converted to numbers magically looks like JavaScript)So if we argue about ugliness, which is super objective, then let me reiterate that using complex number is super ugly and a terrible abuse that I find absolutely revolting.
This is partly cultural, what "looks normal." Strings of numbers looks like Javascript to you, and I've always thought of Numpy's r_
, ix
, ogrid
, ... as a dark corner of Numpy. I don't know if it's widely used. The fact that Numpy is using complex numbers for r_
makes it feel even more like a dark corner.
But that's cultural—what feels normal—and Hans beat me to it by saying that it's not super objective.
You find using a complex number as a 2 part number more revolting than using a string as a number? And eval'ing it? And having to write "loc(1) + 2" because "1" + 2 fails, because it is not a number?
Numpy's r_
is similarly revolting. How that got into numpy is a mystery to me.
Regexes are also strings, so yes, a string as a DSL seems less awkward to me.
Regexes are for parsing strings, not numbers. Using a string for a string purpose is not that unreasonable. Using a string for a number is quite horrible, IMO.
Regexes are for parsing strings, not numbers.
Besides the point. What matters is that we establish a little DSL here, and DSLs are usually implemented with parsed strings. What the DSL does is completely unrelated
The fact that strings would normally not appear in __getitem__
makes it even more clear at the first glance that something special is happening. If I see a complex number there, I would be like "huh? WTF?"
I also like that I don't have to import my tags from some source, that is super awkward.
I do not want to come up with a DSL, I want to come up with an EDSL, one that uses pure python, not strings. DSL's have to be learned, and have custom rules, while EDSLs follow the rules of Python. Complex numbers have some nice behaviors for us (addition with a regular number), while strings don't, and have to take over the whole expression.
Why would a string not be a "WTF" too?
Why not take it fully and make a full custom DSL? Like h["[1,3) -> rebin 2"]
? The idea is that if we stay in pure in Python, users don't have to learn as much.
We are not adding a shortcut method at all any time soon, partially to make sure we come up with a good method to do it. Avoiding importing tags is nice. If we use python's sum and complex numbers, the only tag to import is rebin, though.
DSL's have to be learned, and have custom rules, while EDSLs follow the rules of Python.
Only you think that using complex numbers is intuitive in this context.
The disadvantage of string-based DSLs normally is that you don't get a syntax error immediately, but this does not matter here, because this is for interactive work anyway. If I mistype my commands, I will get the feedback immediately.
I'm not against allowing a single string like "sum"
as a shortcut (not the primary way to do it), but "rebin(2)"
already is getting ugly, since the number inside is getting parsed. And "1.2" is really ugly.
We could allow bh.loc * 1.2 + 1
, just a thought. Only saves one char though.
Why would a string not be a "WTF" too?
Because it is not a number, it is obviously different. A complex number is subtly different.
Why not take it fully and make a full custom DSL? Like h["[1,3) -> rebin 2"]? The idea is that if we stay in pure in Python, users don't have to learn as much.
I would be fine with that, but I think I like your idea to use slices better. It is an extension of something that you already know and use.
Another huge disadvantage is your tooling cannot help you. Syntax highlighting, tab completion, etc.
This is a conflict of programming cultures (what's normal?) and h["[1, 3) -> rebin 2"]
is getting into slippery slopes: no one suggested that. Even when I suggested "loc(3.14) + 1"
, I wasn't suggesting a new syntax, because it's Python expression syntax (eval; no statements). Part of the motivation for that was to ease the transition from "3.14"
(fast) to "loc(3.14) + 1"
(balance of fast and flexible) to bh.loc(3.14) + 1
(formal, full system).
You share Gordon's dislike of non-embedded DSLs, and Hans and I are not on the other end of the spectrum, in which large chunks of work is done in strings, but we're more open to a mix. The reason Hans mentioned regex is because it's an example of a successful non-embedded DSL that programmers have accepted. We know that regexes are in a different domain, but the fact that that domain is strings isn't the reason why the regexes themselves are allowed to be strings. For example, Perl has an embedded syntax.
I saw these "ease of use" arguments in the ROOT forum: people develop something on top of ROOT, think it makes life easier, and then want to integrate it into ROOT, and they have very different ideas about makes things easy. (For example: a fully ASCII GUI for all the TBrowser and such.) That's an analogy. What I took away from those arguments is that "easy to use" is a very subjective thing. Sometimes communities coalesce with a common notion of a word like "Pythonic," but that's the exception.
boost-histogram has a lot to offer with less slick slicing syntax (I mean bh.loc
). Perhaps we should get one, general method in the hands of users to see if they start asking for a faster way.
Avoiding importing tags is nice. If we use python's sum and complex numbers, the only tag to import is rebin, though.
Using sum
as a tag is ugly, as much as using complex numbers.
Syntax highlighting, tab completion, etc.
You want tab completion in your slice? :)
bh.loc * 1.2 + 1
This I also like, it was something I was thinking about, too. It seems natural in this context to use loc
like a unit. Full string parsing "1.2" requires me to type less, however. You wouldn't be able to do "1.2" + 1, but I don't mind. I don't think that adding an offset of 1 is something that people will actually use a lot.
In Python, item["1.2"]
is a dictionary lookup. Anyone seeing that will expect "1.2" to be a category (and for category histograms, this is exactly what it will be!). You could not indicate a category withh[1.2j]
, as floating point numbers, much less complex numbers, are never* used as categories. So it does not overlap with a perfectly valid Python syntax.
*: Should never be
sum
as a tag is rather natural, I think - you are asking for the sum over an axis, and python has a built in sum
. And bh.sum
will be available so tab completion on bh will show it (which is where I would expect it to show up).
And, yes, I can implement tab completion for strings for categories in IPython - I've implemented the same thing in uproot already for branch names. I can't implement tab completion for all floating point numbers (what would that even mean?).
I think using bh.loc * val
is not bad at all - maybe that would elevate the issues with complex numbers. I hate lots of matching ()
, especially inside []
- so this improved readability and typability might be enough. And we can always revisit and add a shortcut later.
Using sum
as a string tag is natural, but not reusing an existing function as a tag. That is ugly and weird. I didn't complain about that so far, because it seemed like a necessary evil. The string approach solves all that.
And by the time we decide it is useful, maybe Python will add a literal syntax, which would fix the problem. 3loc + 1
or something like that.
literal syntax would be preferred, but we then would still need a tag for sum and rebin. I don't share your dislike of eval at all. All the Python code you write is already evaluated on the fly by the interpreter. Now you pass a string explicitly to the interpreter. Why is that suddenly ugly? I find it quite natural, especially for Python.
I want discoverability: I want bh.sum
to exist. And if it does, it needs to to be the built in sum, or maybe the numpy sum. Therefore, that limits our choices.
Users wouldn't even have to know it is exactly the builtin sum.
I think Jim's suggestion is to have small types build into boost-histogram, which are discoverable. On top of that, strings would be accepted, which would be eval'd in a context where these tags are available.
Then you have your discoverability and I have a fast way to do transforms without importing those tags. And no complex numbers or misinterpreted np.sum or sum() are needed.
We need to have a solution before we add the shortcut method, that can be added later. So we need to have sum
before we can add the eval. The string method (if added) must be the shortcut method, not the only method.
We need to have a solution before we add the shortcut method, that can be added later. So we need to have sum before we can add the eval. The string method (if added) must be the shortcut method, not the only method.
I don't see why string handling has to come after, but I am not against that. Implementing all this at once is not rocket science. Anyway, Jim's suggestion as I understand him is to have the tags anyway. What is revolting is to use existing Python objects as tags. I am not against implementing our own tags in some module.
That still leaves us with bh.sum
and it's definition. Sticking it into a string and eval'ing it doesn't change anything other than provide a shortcut way to do it. Again, the shortcut method cannot be the only way to do it.
Why can't the sum we implement be the Python sum? If someone wants to do:
from boost_histogram import rebin, sum
They will find sum([1,2])
is now broken.
Note: one other option, which I would be okay with, would be for __call__
on our sum tag to just be the built-in sum. This at least keeps from breaking in the above case, but means you can't write ::sum
without the above import - which I thought was nice to have for free, but is not consistent with rebin.
I'm not totally against the string shortcut method (though, then, 1 and "1" have different meanings, which I again find ugly and confusing)
from boost_histogram.tags import loc, sum, rebin
Yes, sum is clashing with builtin sum()
, that's why it should not be imported at all, but passed to the slice as a string.
I don't want to prefix my tags with whatever namespace.
though, then, 1 and "1" have different meanings, which I again find ugly and confusing
We need two different types anyway. This is better than using complex numbers to represent the second kind of type. I am also ok with a slightly less compact but more obvious string-based DSL, which really just eval's Python code with the tags in the locals. We could reject "1.2" and only allow "loc(1.2)" or "1.2 * loc" (I don't much care either way, heck we could support both simultaneously).
We need to follow Python conventions. This was a problem with HistBook - it expected users to import *
, and one of the things it imported was bin
- oops. Python users may want to work in namespaces, and they may want to import things - we should make that work as smoothly as possible. I don't intend to use strings everywhere, and many other Python users will not either. We should use the language we are provided. A secondary way to jump out with eval'ed strings is not unreasonable, but should not be the only way or recommended way to work (unless it is unreasonable to do otherwise).
I am not disagreeing with you. Neither Jim nor I want to get rid of the tags completely. I already said that I am ok with tags that we implement in our little library in a local module. They should live in their own tag module, not in the top level boost_histogram
. Then people can use them EDSL style. I would use them DSL style, because I am too lazy to import them and prefix them.
What I am strongly against is to use other external Python objects as tags, np.sum
, builtin sum
, complex numbers.
Why not in the top level module? bh.loc
is not unreasonable, but bh.tags.loc
is. They also live in a submodule (currently bh.uhi
), but they are intended for common use.
This would just offset the index by N