Closed imandr closed 6 years ago
hist.fill(x=xarray, condition="'test'")
(one more level of quoting) might already work. If not, it's related to the constants-handling bug (#23, I think).
No, that does not work and it generates this:
ValueError: array 'tt' has len 1 but other arrays have len 100
h = Hist(bin("a", 10, 0, 1), groupby("tt"))
h.fill(a=np.random.random((100,)), tt='"a"')
h.fill(a=np.random.random((100,)), tt='"b"')
h.stack("tt").area("a").to(canvas)
By the way, hist.fill(x=xarray, condition=["test"]*len(xarray)) seems to be very inefficient. It severely slows things down.
The ValueError
you got was correct. The suggestion I have you was wrong. I meant to say
hist = Hist(bin("x", 100, 0.0, 1.0), groupby("'test'"))
hist.fill(x=xarray)
I put the single string in quotations in the wrong place. It's supposed to be a constant expression. Also, because of #23, I might need to fix constant expressions.
["test"]*len(xarray)
would be very slow, but that's because it's creating a large Python list and converting it to a string-valued Numpy array before it ever enters histbook.
No, that is not the use case I meant.
I meant this:
h = Hist(bin("data", 10, 0, 1), groupby("condition"))
h.fill(data=np.random.random((100,)), condition="a")
h.fill(data=np.random.random((100,)), condition="b")
...
Real life example: I want to plot some distribution for a number of datasets. So I call fill() some number of times for dataset "a", then for dataset "b" and so on, and then I want to see stacked histogram with all the datasets together.
So I am not grouping by a constant. It is constant only for a single call to fill(). Next call to fill() will use another value.
I figured that out in the other issue. You want to use the Hist.group
static method. (I'm trying to do all of this on a phone in a hotel room without wifi.)
I appreciate your help, Jim. Sorry you have to do this in such inhumane conditions :)
Maybe grouping would work too, but I got it figured out, implementing the "broadcasting" myself for now. Here is the fragment of real code which works:
h_all = Hist(hbin("NJets", 20, 0, 20), groupby("dataset"))
h_all_display = IPythonDisplay(
h_all
.stack("dataset")
.area("NJets", width=400)
)
for dataset_name in Datasets[:5]:
job = session.createJob(dataset_name, display=False)
job.addHistogram(h_all, ["NJets", "dataset"], constants={"dataset":dataset_name})
job.run()
runtime = job.TFinish - job.TStart
nevents = job.EventsProcessed
nworkers = len(job.WorkerAddresses)
h_all_display.update()
print "%s: %d events processed using %d workers in %.1f seconds, %.2e events/second" % (
dataset_name, nevents, nworkers, runtime, float(nevents)/runtime)
I don't see the fill
in this. (Maybe that happens remotely?)
One of the complaints about Histogrammar was that everything that appeared in a final plot had to come from a single dataset, but it's common to combine data from different Monte Carlo and data samples. It seemed unnatural to have to create new fields in the data that were piecewise constant to emulate what the physicists had in mind: bringing together data from different places.
So histbook has methods for adding information (like "group") in addition to removing information (like "select", "rebin", and "project").
Yes, the actual filling happens remotely and once in a while the Histbook histogram gets picked and shipped back and then cleared.
Here is the implementation of the broadcasting at the remote:
...
# by this time my_dict contains "regular" data in ndarray's
if my_dict:
nitems = len(my_dict[my_dict.keys()[0]])
for dn, dv in self.Constants.items():
my_dict[dn] = [dv]*nitems
self.H.fill(**my_dict)
>>> from histbook import *
>>> h = Hist(bin("x", 5, 0, 5), groupby("condition"))
>>> h.pandas()
Empty DataFrame
Columns: [count(), err(count())]
Index: []
>>> h.fill(x=[1, 2, 3], condition="one")
>>> h.pandas()
count() err(count())
condition x
one [-inf, 0.0) 0.0 0.0
[0.0, 1.0) 0.0 0.0
[1.0, 2.0) 1.0 1.0
[2.0, 3.0) 1.0 1.0
[3.0, 4.0) 1.0 1.0
[4.0, 5.0) 0.0 0.0
[5.0, inf) 0.0 0.0
{NaN} 0.0 0.0
>>> h.fill(x=[4, 5], condition="two")
>>> h.pandas()
count() err(count())
condition x
one [-inf, 0.0) 0.0 0.0
[0.0, 1.0) 0.0 0.0
[1.0, 2.0) 1.0 1.0
[2.0, 3.0) 1.0 1.0
[3.0, 4.0) 1.0 1.0
[4.0, 5.0) 0.0 0.0
[5.0, inf) 0.0 0.0
{NaN} 0.0 0.0
two [-inf, 0.0) 0.0 0.0
[0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[2.0, 3.0) 0.0 0.0
[3.0, 4.0) 0.0 0.0
[4.0, 5.0) 1.0 1.0
[5.0, inf) 1.0 1.0
{NaN} 0.0 0.0
Numerical values also get broadcasted:
>>> from histbook import *
>>> h = Hist(bin("x", 5, 0, 5), bin("y", 2, 0, 2, underflow=False, ove
rflow=False, nanflow=False))
>>> h.pandas()
count() err(count())
x y
[-inf, 0.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[0.0, 1.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[1.0, 2.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[2.0, 3.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[3.0, 4.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[4.0, 5.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[5.0, inf) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
{NaN} [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
>>> h.fill(x=[1, 2, 3], y=0)
>>> h.pandas()
count() err(count())
x y
[-inf, 0.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[0.0, 1.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[1.0, 2.0) [0.0, 1.0) 1.0 1.0
[1.0, 2.0) 0.0 0.0
[2.0, 3.0) [0.0, 1.0) 1.0 1.0
[1.0, 2.0) 0.0 0.0
[3.0, 4.0) [0.0, 1.0) 1.0 1.0
[1.0, 2.0) 0.0 0.0
[4.0, 5.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[5.0, inf) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
{NaN} [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
>>> h.fill(x=[4, 5], y=1)
>>> h.pandas()
count() err(count())
x y
[-inf, 0.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[0.0, 1.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
[1.0, 2.0) [0.0, 1.0) 1.0 1.0
[1.0, 2.0) 0.0 0.0
[2.0, 3.0) [0.0, 1.0) 1.0 1.0
[1.0, 2.0) 0.0 0.0
[3.0, 4.0) [0.0, 1.0) 1.0 1.0
[1.0, 2.0) 0.0 0.0
[4.0, 5.0) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 1.0 1.0
[5.0, inf) [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 1.0 1.0
{NaN} [0.0, 1.0) 0.0 0.0
[1.0, 2.0) 0.0 0.0
I am trying to use broadcasting with groupby() and a string as the groupby axis value, and it is not working:
import histbook
from histbook import Hist, bin, groupby
import numpy as np
h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
h.fill(dataset="dataset1", NJets=np.arange(100))
this fragment gives me a warning and an error:
/home/ivm/anaconda2/lib/python2.7/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full(100, array('dataset1',
dtype='|S8')) will return an array of dtype('S8')
format(shape, fill_value, array(fill_value).dtype), FutureWarning)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-6c35a986016c> in <module>()
4
5 h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
----> 6 h.fill(dataset="dataset1", NJets=np.arange(100))
7
8 h1 = Hist(bin("NJets", 10, 0, 10), bin("dataset", 100, 0, 100))
/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/hist.pyc in fill(self, arrays, **more)
574
575 self._prefill()
--> 576 length = self._fill(arrays)
577 self._postfill(arrays, length)
578
/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.1.0-py2.7.egg/histbook/hist.pyc in _fill(self, arrays)
151 array = numpy.array(array)
152 if array.shape == ():
--> 153 array = numpy.full(length, array)
154
155 if length != array.shape[0]:
/home/ivm/anaconda2/lib/python2.7/site-packages/numpy/core/numeric.pyc in full(shape, fill_value, dtype, order)
300 "in the future, full({0}, {1!r}) will return an array of {2!r}".
301 format(shape, fill_value, array(fill_value).dtype), FutureWarning)
--> 302 multiarray.copyto(a, fill_value, casting='unsafe')
303 return a
304
ValueError: could not convert string to float: dataset1
this works fine:
h1 = Hist(bin("NJets", 10, 0, 10), bin("dataset", 100, 0, 100))
h1.fill(dataset=13, NJets=np.arange(100))
my version is 1.2.0
Hmmm.
>>> from histbook import *
>>> import numpy as np
>>> h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
>>> h.fill(dataset="dataset1", NJets=np.arange(100))
>>> h.pandas()
count() err(count())
dataset NJets
dataset1 [-inf, 0.0) 0.0 0.000000
[0.0, 1.0) 1.0 1.000000
[1.0, 2.0) 1.0 1.000000
[2.0, 3.0) 1.0 1.000000
[3.0, 4.0) 1.0 1.000000
[4.0, 5.0) 1.0 1.000000
[5.0, 6.0) 1.0 1.000000
[6.0, 7.0) 1.0 1.000000
[7.0, 8.0) 1.0 1.000000
[8.0, 9.0) 1.0 1.000000
[9.0, 10.0) 1.0 1.000000
[10.0, inf) 90.0 9.486833
{NaN} 0.0 0.000000
My version is 1.2.0 also. What's your Numpy version?
>>> histbook.__version__
'1.2.0'
>>> np.__version__
'1.14.4'
This might imply a higher minimum Numpy version: right now I say install_requires = ["numpy>=1.8.0"]
.
import numpy numpy.version '1.11.3'
I found Numpy 1.11.3 on lxplus and tried it there. Indeed:
>>> numpy.full(10, "hello")
/afs/cern.ch/user/p/pivarski/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py:301: FutureWarning: in the future, full(10, 'hello') will return an array of dtype('<U5')
format(shape, fill_value, array(fill_value).dtype), FutureWarning)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/afs/cern.ch/user/p/pivarski/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py", line 302, in full
multiarray.copyto(a, fill_value, casting='unsafe')
ValueError: could not convert string to float: 'hello'
I'll keep the Numpy requirement minimum where it is and put a work-around in this bit of code. I think this just affects everywhere numpy.full
is called with a string argument.
I upgraded numpy to 1.14.5 and now the error is different:
import histbook
from histbook import Hist, bin, groupby
import numpy as np
print np.__version__
h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
h.fill("dataset1", NJets=np.arange(100))
1.14.5
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-84a0ed9de926> in <module>()
6
7 h = Hist(bin("NJets", 10, 0, 10), groupby("dataset"))
----> 8 h.fill("dataset1", NJets=np.arange(100))
9
/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.2.0-py2.7.egg/histbook/hist.pyc in fill(self, arrays, **more)
374
375 self._prefill()
--> 376 length = self._fill(arrays)
377 self._postfill(arrays, length)
378
/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.2.0-py2.7.egg/histbook/fill.pyc in _fill(self, arrays)
118 else:
119 try:
--> 120 array = arrays[instruction.extern.value]
121 except KeyError:
122 if instruction.extern.value in histbook.expr.Expr.maybeconstants:
/home/ivm/anaconda2/lib/python2.7/site-packages/histbook-1.2.0-py2.7.egg/histbook/util/__init__.pyc in __getitem__(self, n)
38 return self._two[n] # and it has precedence
39 else:
---> 40 return self._one[n] # self._one might only have __getitem__
TypeError: string indices must be integers, not str
Sorry. my bad. Call to h.fill() is wrong.
It works fine with numpy 1.14.5
I just patched master so that it works as shown in my comment in Numpy 1.14.4 and 1.11.3.
But your problem is that you're missing a keyword argument:
h.fill("dataset1", NJets=np.arange(100))
instead of
h.fill(dataset="dataset1", NJets=np.arange(100))
It's trying to interpret the positional (non-keyword) argument as a dict of arrays, a Pandas DataFrame, or a Spark DataFrame. I wanted to be loose about it so that anything with df["column"] → array would work in that slot (i.e. duck typing). But if the "df" is a string, it applies square-bracket-string-close-square-bracket to what is actually a string, hence "string indices must be integers, not str."
Imagine I have a histogram like this:
Hist(bin("x", 100, 0.0, 1.0), groupby("condition"))
and I have an array of x collected under condition="test". So I want to put these data into the histogram. To do that, I need to do something like this:
hist.fill(x=xarray, condition=["test"]*len(xarray))
I think it would be useful to adopt a simple "broadcasting" rule to allow this:
hist.fill(x=xarray, condition="test")