scikit-hep / boost-histogram

Python bindings for the C++14 Boost::Histogram library
https://boost-histogram.readthedocs.io
BSD 3-Clause "New" or "Revised" License
143 stars 21 forks source link

support for `bh.loc(value) + N` #152

Closed HDembinski closed 4 years ago

HDembinski commented 4 years ago

This would just offset the index by N

HDembinski commented 4 years ago

Pollution of namespace and consistency. You introduced a module for storage, too. You said once that you want all common things to be grouped in a module, and I agree. You got to have principles and not change your mind at every corner.

HDembinski commented 4 years ago

I don't much care whether the module is called bh.tags or bh.uhi as long as they sit in a separate module.

henryiii commented 4 years ago

Tags are needed directly in the indexing and are used constantly. Storage is used once when creating the histogram. Also, someone looking for storage will probably look in storage - it is like an enum, and you write storage= before it. Someone looking for rebin may not know it is a tag.

henryiii commented 4 years ago

They do sit in a separate module, but are also imported into boost_histogram since they are intended for heavy use.

HDembinski commented 4 years ago

You make no sense. Someone looking for tags will look into bh.tags, obviously. Whether something is used constantly or sparingly has no impact on whether it should sit at top level or not.

henryiii commented 4 years ago

Only we know they are tags. A normal user will just know that there is a "loc" they can use to describe a location.

The top level should have things used very often. We don't do np.ufuncts.sum, etc.

henryiii commented 4 years ago

It should be used sparingly, but I think this is a case where it can be used. Making this bh.tags will force users to write a bunch of import statements, or use your DSL.

HDembinski commented 4 years ago

They do sit in a separate module, but are also imported into boost_histogram since they are intended for heavy use.

A user will see sum in side bh and assume it is a function to sum histograms, while the actual function will sit in bh.algorithms.

henryiii commented 4 years ago

And I think we are talking about 3 items, loc, rebin, and sum.

Valid point: but we can make bh.sum do both.

HDembinski commented 4 years ago

We can make bh.sum do both.

No, that is ugly territory again. One well-defined use per class.

henryiii commented 4 years ago

(Which is the best argument I've seen for not making it the Python sum, by the way)

henryiii commented 4 years ago

It's the same use. Sum a histogram or sum one axis of a histogram. It should not have the same name if they were different uses.

henryiii commented 4 years ago

If they are not the same, let's not name it sum - problem solved.

HDembinski commented 4 years ago

No, once it is a tag for a EDSL, the other time it is a transforming function. These are two very distinctly different things.

HDembinski commented 4 years ago

If they are not the same, let's not name it sum - problem solved.

They should be both called sum. We cannot avoid name clashes anyway, that's why we have namespaces in the first place.

HDembinski commented 4 years ago

You just have to do import bh.tags as bt or something. Then you have your two-letter prefix. I don't see why you make such a fuss about this.

HDembinski commented 4 years ago

We can discuss this all day, I will not change my mind. You were also adamant on having the storages in a submodule, which I think now was a good choice. I reluctantly agreed, and now I am just using all the good arguments you used against me before.

henryiii commented 4 years ago

Then you have a massive list of imports at the top, which all have to be repeated/taught/remembered. And your syntax is broken, you can't use a imported name in an imported statement.

You can just write import numpy as np and be done with it. Likewise with pandas. We should be able to do the same - we have a smaller, tighter focus than either.

henryiii commented 4 years ago

We will not have sum mean two different things - you just said they are different and it's bad design to name them the same. How about remove? It does remove an axis, and for a profile, it's not exactly a sum. You can't remove an axis without doing something with it - combining it using the accumulators seems to make sense.

HDembinski commented 4 years ago

You are grasping at straws now. My syntax is not broken.

HDembinski commented 4 years ago

I am ok with sum meaning different things in different contexts. The context is symbolized by the namespace.

HDembinski commented 4 years ago

The point is that the very same object may not have two different behaviors. That is confusing, weird, and ugly. And very very unpythonic (it is perl-like)

henryiii commented 4 years ago

The point is that the very same object may not have two different behaviors. That is confusing, weird, and ugly. And very very unpythonic (it is perl-like)

Huh? There are lots of places in Python where this is done. type with one argument returns the class of an object. type with three arguments generates a new class. And type itself is actually the class of a class and not really a function at all. property is both a function and a decorator - while you may argue that is a side effect of the language, they provided ways to use it to add multiple functions, for setters and deleters, in both cases - with keywords when called as function, or with sub-decorators.

The idea is you proved an object that represents an idea, then make sure it works in any context that makes sense. sum(h) and h[:, sum] are clearly different contexts.

jpivarski commented 4 years ago

I had already tapped out and I suggest you guys do, too. You can't argue over what "simplicity," "elegance," "naturalness," etc. mean unless you're in the same frame of mind.

Here's a suggestion: implement the general case that does not try to save the user any keystrokes, get it in front of users, and respond to their suggestions. There are interview/focus group/survey techniques if you want to get formal about it. For now, let's not try to make it pretty—usability can come later.

henryiii commented 4 years ago

Sounds good. The only thing we need to do is rename project, and I like remove. We can work on simplifications, etc. later. We already showed bh.<tags> to a group of users at PyHEP, so let's not break that. We did say we were going to rename project, so that's okay to change.

I will rename bh.uhi to bh.tags, however, as that will be a less cryptic name for users.

henryiii commented 4 years ago

A thought I had on the drive home: the three other tags, overflow, underflow, and end, are valid (at least overflow is) on string categories, so this will break on string categories.loc(2) is also a valid string category - could be a really bad, hard to catch mistake if growth is on. Complex numbers was a small, highly focused addition that allowed one extra thing and nothing else. Using eval'd strings opens a pandora's box of problems.

I really don't like using strings to shortcut good design.

If users really want a shortcut, we can discuss this eventually. For now, no strings (or complex numbers, sadly).

HDembinski commented 4 years ago

It is quite simple to resolve this. Strings are passed literally in the category string axis. For other axes, the string is eval'd. Nothing changes if growth is on, because an axis can only grow by a fill operation, not by a reduction.

I really think that using strings is a much better design than using complex number and misusing external python objects as tags. And I want the shortcuts, because I don't want to import tags.

henryiii commented 4 years ago

Strings are a valid item in axes operations. Complex numbers are not. You cannot tell me what cell:

h["overflow", "overflow"]

returns (or sets, in a setting operation). If I have a 2D histogram with one category and one numeric, this would do different things per axis, in fact. This breaks type safety, hampers our ability to produce good error messages, and even affects users (like me) who do not want to touch string-based UHI; they still need to use strings in array indexing if they have category axes and may trigger this instead. "2" might be the category 2 bin, or either the second bin or the bin containing the value 2 if applied to the wrong axis (depending on what it ended up meaning)

I really find h[::"rebin(2)"] to be revolting and not Pythonic at all. And I can just see someone eventually writing h[::"rebin({x})".format(x)] or h[::"rebin(" + str(x) + ")]. Our tags (and complex numbers) do not fall apart so fast when you try to misuse them. Also, this fails with user-supplied functions. One of the key ideas of UHI is that users can write their own tags or get tags from a different library. But those tags are not in scope when you eval the string, so:

from mylib import specialbinning
h[::"specialbinning"]

fails.

The key design of UHI is that it is not a DSL, but an EDSL - users can use their knowledge of indexing numpy arrays and python object manipulations. We could have invented a full DSL like tree.Fill has, but we are trying to make a Pythonic way to index histograms so users have to learn less. I can easily show you examples in Numpy that I've been using for over 10 years where complex numbers are used. You may not personally like it, but it is used and will be familiar to at least some people. It doesn't open a host of issues like just replacing things with strings does. I can't show you an example in Numpy where it evals strings in indexing. Evaling strings in other libraries is a) almost always clearly delineated in some sort of method, and b) is related to, you know, evaluating things. Not indexing.

I think allowing bh.loc * v is fine for now. We will have a bh.sum (unless we select a different name), and I'll come up with an internal implementation - a user does not need to know what that implementation is. I think if we add shortcuts now, we may start getting lazy with the design and make it hard for someone who wants to use it correctly to use by importing. We will later discuss shortcuts again.

PS: If you do lots of work with highly array-wise (vectorized in matlab terms) computations, which I have done, generating grids is crucial and very common, which is why these shortcuts exist.

henryiii commented 4 years ago

Quick example:

h = histogram(regular(10, 0, 20), category(['1', '2', '3', 'NAN']))

# My proposal        # String proposal
h[3j, "2"] = 2;      h["3", "2"] = 2

It is clear that these are axes are different in my example, it is clear the first one is not just finding a "3" bin, and you can't accidentally index one as the other.

jpivarski commented 4 years ago

I really think it's time to drop this, only implement the longhand form, and bring up the possibilities for a shorthand form after the users have been using it. That would allow them to be a part of the decision.

At a certain point, which we've crossed, we know that arguments are not going to convince each other. (Giving up is not the same thing as being convinced.) There's a community of users out there who would naturally take to one thing more than another. For example, I thought the right thing to do with ROOT's C++ strings in uproot was to pass then through as bytes, because they're not encoded (and ROOT is old and international, so there might be weird encodings out there), but this was very much the wrong choice—users want strings to be str, not because they said so, but because dozens have stumbled over the mistake of assuming it. (Therefore, I have to change it in uproot 4.)

There are several of these things where I've had to change course because the users' expectations turned out differently than I expected. The trade-off is that it's hard to change an API that people are using (hence, I have to wait for uproot 4 to fix the strings), but if we're taking about a shortcut syntax, we can implement the longhand only, which is not likely to change, and hold trial sessions for the shortcuts (at tutorials, perhaps, after the longhand has been introduced). That way, there's a stable API and a provisional one (or two), so that users can be productive, basing their scripts on the stable API, but they can give meaningful feedback about shortcuts.