scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

Broadcasting ? #24

Closed imandr closed 6 years ago

imandr commented 6 years ago

Imagine I have a histogram like this:

Hist(bin("x", 100, 0.0, 1.0), groupby("condition"))

and I have an array of x collected under condition="test". So I want to put these data into the histogram. To do that, I need to do something like this:

hist.fill(x=xarray, condition=["test"]*len(xarray))

I think it would be useful to adopt a simple "broadcasting" rule to allow this:

hist.fill(x=xarray, condition="test")

jpivarski commented 6 years ago

hist.fill(x=xarray, condition="'test'") (one more level of quoting) might already work. If not, it's related to the constants-handling bug (#23, I think).

imandr commented 6 years ago

No, that does not work and it generates this:

ValueError: array 'tt' has len 1 but other arrays have len 100

h = Hist(bin("a", 10, 0, 1), groupby("tt"))
h.fill(a=np.random.random((100,)), tt='"a"')
h.fill(a=np.random.random((100,)), tt='"b"')
h.stack("tt").area("a").to(canvas)
imandr commented 6 years ago

By the way, hist.fill(x=xarray, condition=["test"]*len(xarray)) seems to be very inefficient. It severely slows things down.

jpivarski commented 6 years ago

The ValueError you got was correct. The suggestion I have you was wrong. I meant to say

hist = Hist(bin("x", 100, 0.0, 1.0), groupby("'test'"))
hist.fill(x=xarray)

I put the single string in quotations in the wrong place. It's supposed to be a constant expression. Also, because of #23, I might need to fix constant expressions.

["test"]*len(xarray) would be very slow, but that's because it's creating a large Python list and converting it to a string-valued Numpy array before it ever enters histbook.

imandr commented 6 years ago

No, that is not the use case I meant.

I meant this:

h = Hist(bin("data", 10, 0, 1), groupby("condition"))
h.fill(data=np.random.random((100,)), condition="a")
h.fill(data=np.random.random((100,)), condition="b")
...

Real life example: I want to plot some distribution for a number of datasets. So I call fill() some number of times for dataset "a", then for dataset "b" and so on, and then I want to see stacked histogram with all the datasets together.

So I am not grouping by a constant. It is constant only for a single call to fill(). Next call to fill() will use another value.

jpivarski commented 6 years ago

I figured that out in the other issue. You want to use the Hist.group static method. (I'm trying to do all of this on a phone in a hotel room without wifi.)

imandr commented 6 years ago

I appreciate your help, Jim. Sorry you have to do this in such inhumane conditions :)

Maybe grouping would work too, but I got it figured out, implementing the "broadcasting" myself for now. Here is the fragment of real code which works:

h_all = Hist(hbin("NJets", 20, 0, 20), groupby("dataset"))

h_all_display = IPythonDisplay(
    h_all
        .stack("dataset")
        .area("NJets", width=400)
)

for dataset_name in Datasets[:5]:
    job = session.createJob(dataset_name, display=False)
    job.addHistogram(h_all, ["NJets", "dataset"], constants={"dataset":dataset_name})
    job.run()
    runtime = job.TFinish - job.TStart
    nevents = job.EventsProcessed
    nworkers = len(job.WorkerAddresses)
    h_all_display.update()
    print "%s: %d events processed using %d workers in %.1f seconds, %.2e events/second" % (
                    dataset_name, nevents, nworkers, runtime, float(nevents)/runtime)
jpivarski commented 6 years ago

I don't see the fill in this. (Maybe that happens remotely?)

One of the complaints about Histogrammar was that everything that appeared in a final plot had to come from a single dataset, but it's common to combine data from different Monte Carlo and data samples. It seemed unnatural to have to create new fields in the data that were piecewise constant to emulate what the physicists had in mind: bringing together data from different places.

So histbook has methods for adding information (like "group") in addition to removing information (like "select", "rebin", and "project").

imandr commented 6 years ago

Yes, the actual filling happens remotely and once in a while the Histbook histogram gets picked and shipped back and then cleared.

Here is the implementation of the broadcasting at the remote:

       ...
       # by this time my_dict contains "regular" data in ndarray's
       if my_dict:
                nitems = len(my_dict[my_dict.keys()[0]])
                for dn, dv in self.Constants.items():
                        my_dict[dn] = [dv]*nitems
        self.H.fill(**my_dict)
jpivarski commented 6 years ago
>>> from histbook import *
>>> h = Hist(bin("x", 5, 0, 5), groupby("condition"))
>>> h.pandas()
Empty DataFrame
Columns: [count(), err(count())]
Index: []
>>> h.fill(x=[1, 2, 3], condition="one")
>>> h.pandas()
                       count()  err(count())
condition x                                 
one       [-inf, 0.0)      0.0           0.0
          [0.0, 1.0)       0.0           0.0
          [1.0, 2.0)       1.0           1.0
          [2.0, 3.0)       1.0           1.0
          [3.0, 4.0)       1.0           1.0
          [4.0, 5.0)       0.0           0.0
          [5.0, inf)       0.0           0.0
          {NaN}            0.0           0.0
>>> h.fill(x=[4, 5], condition="two")
>>> h.pandas()
                       count()  err(count())
condition x                                 
one       [-inf, 0.0)      0.0           0.0
          [0.0, 1.0)       0.0           0.0
          [1.0, 2.0)       1.0           1.0
          [2.0, 3.0)       1.0           1.0
          [3.0, 4.0)       1.0           1.0
          [4.0, 5.0)       0.0           0.0
          [5.0, inf)       0.0           0.0
          {NaN}            0.0           0.0
two       [-inf, 0.0)      0.0           0.0
          [0.0, 1.0)       0.0           0.0
          [1.0, 2.0)       0.0           0.0
          [2.0, 3.0)       0.0           0.0
          [3.0, 4.0)       0.0           0.0
          [4.0, 5.0)       1.0           1.0
          [5.0, inf)       1.0           1.0
          {NaN}            0.0           0.0
jpivarski commented 6 years ago

Numerical values also get broadcasted:

>>> from histbook import *
>>> h = Hist(bin("x", 5, 0, 5), bin("y", 2, 0, 2, underflow=False, ove
rflow=False, nanflow=False))
>>> h.pandas()
                        count()  err(count())
x           y
[-inf, 0.0) [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[0.0, 1.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[1.0, 2.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[2.0, 3.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[3.0, 4.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[4.0, 5.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[5.0, inf)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
{NaN}       [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
>>> h.fill(x=[1, 2, 3], y=0)
>>> h.pandas()
                        count()  err(count())
x           y                                
[-inf, 0.0) [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[0.0, 1.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[1.0, 2.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[2.0, 3.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[3.0, 4.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[4.0, 5.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[5.0, inf)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
{NaN}       [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
>>> h.fill(x=[4, 5], y=1)
>>> h.pandas()
                        count()  err(count())
x           y                                
[-inf, 0.0) [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[0.0, 1.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
[1.0, 2.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[2.0, 3.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[3.0, 4.0)  [0.0, 1.0)      1.0           1.0
            [1.0, 2.0)      0.0           0.0
[4.0, 5.0)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      1.0           1.0
[5.0, inf)  [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      1.0           1.0
{NaN}       [0.0, 1.0)      0.0           0.0
            [1.0, 2.0)      0.0           0.0
imandr commented 6 years ago

I am trying to use broadcasting with groupby() and a string as the groupby axis value, and it is not working:

import histbook
from histbook import Hist, bin, groupby
import numpy as np

h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
h.fill(dataset="dataset1", NJets=np.arange(100))

this fragment gives me a warning and an error:

/home/ivm/anaconda2/lib/python2.7/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full(100, array('dataset1', 
      dtype='|S8')) will return an array of dtype('S8')
  format(shape, fill_value, array(fill_value).dtype), FutureWarning)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-6c35a986016c> in <module>()
      4 
      5 h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
----> 6 h.fill(dataset="dataset1", NJets=np.arange(100))
      7 
      8 h1 = Hist(bin("NJets", 10, 0, 10), bin("dataset", 100, 0, 100))

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/hist.pyc in fill(self, arrays, **more)
    574 
    575             self._prefill()
--> 576             length = self._fill(arrays)
    577             self._postfill(arrays, length)
    578 

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/hist.pyc in _fill(self, arrays)
    151                     array = numpy.array(array)
    152                 if array.shape == ():
--> 153                     array = numpy.full(length, array)
    154 
    155                 if length != array.shape[0]:

/home/ivm/anaconda2/lib/python2.7/site-packages/numpy/core/numeric.pyc in full(shape, fill_value, dtype, order)
    300             "in the future, full({0}, {1!r}) will return an array of {2!r}".
    301             format(shape, fill_value, array(fill_value).dtype), FutureWarning)
--> 302     multiarray.copyto(a, fill_value, casting='unsafe')
    303     return a
    304 

ValueError: could not convert string to float: dataset1

this works fine:

h1 = Hist(bin("NJets", 10, 0, 10), bin("dataset", 100, 0, 100))
h1.fill(dataset=13, NJets=np.arange(100))

my version is 1.2.0

jpivarski commented 6 years ago

Hmmm.

>>> from histbook import *
>>> import numpy as np
>>> h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
>>> h.fill(dataset="dataset1", NJets=np.arange(100))
>>> h.pandas()
                      count()  err(count())
dataset  NJets                             
dataset1 [-inf, 0.0)      0.0      0.000000
         [0.0, 1.0)       1.0      1.000000
         [1.0, 2.0)       1.0      1.000000
         [2.0, 3.0)       1.0      1.000000
         [3.0, 4.0)       1.0      1.000000
         [4.0, 5.0)       1.0      1.000000
         [5.0, 6.0)       1.0      1.000000
         [6.0, 7.0)       1.0      1.000000
         [7.0, 8.0)       1.0      1.000000
         [8.0, 9.0)       1.0      1.000000
         [9.0, 10.0)      1.0      1.000000
         [10.0, inf)     90.0      9.486833
         {NaN}            0.0      0.000000

My version is 1.2.0 also. What's your Numpy version?

>>> histbook.__version__
'1.2.0'
>>> np.__version__
'1.14.4'

This might imply a higher minimum Numpy version: right now I say install_requires = ["numpy>=1.8.0"].

imandr commented 6 years ago

import numpy numpy.version '1.11.3'

jpivarski commented 6 years ago

I found Numpy 1.11.3 on lxplus and tried it there. Indeed:

>>> numpy.full(10, "hello")
/afs/cern.ch/user/p/pivarski/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full(10, 'hello') will return an array of dtype('<U5')
  format(shape, fill_value, array(fill_value).dtype), FutureWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/afs/cern.ch/user/p/pivarski/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py", line 302, in full
    multiarray.copyto(a, fill_value, casting='unsafe')
ValueError: could not convert string to float: 'hello'

I'll keep the Numpy requirement minimum where it is and put a work-around in this bit of code. I think this just affects everywhere numpy.full is called with a string argument.

imandr commented 6 years ago

I upgraded numpy to 1.14.5 and now the error is different:

import histbook
from histbook import Hist, bin, groupby
import numpy as np

print np.__version__

h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
h.fill("dataset1", NJets=np.arange(100))
1.14.5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-84a0ed9de926> in <module>()
      6 
      7 h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
----> 8 h.fill("dataset1", NJets=np.arange(100))
      9 

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.2.0-py2.7.egg/histbook/hist.pyc in fill(self, arrays, **more)
    374 
    375             self._prefill()
--> 376             length = self._fill(arrays)
    377             self._postfill(arrays, length)
    378 

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.2.0-py2.7.egg/histbook/fill.pyc in _fill(self, arrays)
    118                 else:
    119                     try:
--> 120                         array = arrays[instruction.extern.value]
    121                     except KeyError:
    122                         if instruction.extern.value in histbook.expr.Expr.maybeconstants:

/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.2.0-py2.7.egg/histbook/util/__init__.pyc in __getitem__(self, n)
     38             return self._two[n]    # and it has precedence
     39         else:
---> 40             return self._one[n]    # self._one might only have __getitem__

TypeError: string indices must be integers, not str

Sorry. my bad. Call to h.fill() is wrong.

It works fine with numpy 1.14.5

jpivarski commented 6 years ago

I just patched master so that it works as shown in my comment in Numpy 1.14.4 and 1.11.3.

But your problem is that you're missing a keyword argument:

h.fill("dataset1", NJets=np.arange(100))

instead of

h.fill(dataset="dataset1", NJets=np.arange(100))

It's trying to interpret the positional (non-keyword) argument as a dict of arrays, a Pandas DataFrame, or a Spark DataFrame. I wanted to be loose about it so that anything with df["column"] → array would work in that slot (i.e. duck typing). But if the "df" is a string, it applies square-bracket-string-close-square-bracket to what is actually a string, hence "string indices must be integers, not str."