scikit-hep / root_numpy

The interface between ROOT and NumPy
http://scikit-hep.org/root_numpy
BSD 3-Clause "New" or "Revised" License
133 stars 54 forks source link

Creating histograms with a dimension higher than 3 / multiplying 4D data with 2D weights #306

Closed gbesjes closed 7 years ago

gbesjes commented 7 years ago

Hi,

I've got a bunch of data for which I'd like to be able to construct histograms with a dimension greater than 3, to then be able to project on a 2D plane or one of the axes. (I'm aware that this is like a tree with 4 branches, but in that analogy the bin errors are discarded.) Is there an easy way to achieve that in root_numpy?

The reason that I'm looking for this solution is that I have data in the form A:B:X:Y, and a set of weights (with errors!) binned in A:B. I'm then interested in X:Y with weights applied, i.e. where a histogram-based multiplication has occurred. To give a practical example: think of an nJet histogram where you've got some weights binned in pT. That's easy: build a 2D histogram, and project. Now you'd like to see e.g. nJet:MET with a weight binned in pT and eta.

I'd be very curious to hear what's the easiest way to achieve that, other than some construed hacks. Solutions that I've already thought of include a constructing N 3-dimensional histograms, then projecting each of these to 1-dimensional histogram and stacking all of those behind each other to create a 2D histogram. Alternatively, looping over the array content and then picking a weight is also possible and trivial if there were no errors for A:B - but becomes a bit nastier as one has to re-implement error propagation.

ndawe commented 7 years ago

I think this just requires a bit of gymnastics on the numpy side:

import numpy as np
from numpy import random
import math
import matplotlib.pyplot as plt

# make some toy data (this could be from root2array)
data = np.empty(10000, dtype=[('pt', float), ('eta', float), ('met', float), ('njet', int)])
data['pt'] = random.lognormal(3, 1, size=data.shape[0])
data['eta'] = random.normal(0, 2, size=data.shape[0])
data['met'] = data['pt'] + np.random.normal(0, 3, size=data.shape[0])
data['njet'] = random.poisson(size=data.shape[0])

# create the 2D weights over pt and eta (could also be from a 2D histogram and hist2array)
pt_edges = np.linspace(10, 100, 11)
eta_edges = np.linspace(-5, 5, 11)
xx, yy = np.meshgrid(pt_edges, eta_edges)
binned_weights = (1 + 1./xx) * (1 / (np.exp(0.1 * np.abs(yy))))

# plot the weights
plt.figure()
ax = plt.contourf(pt_edges, eta_edges, binned_weights)
cbar = plt.colorbar(ax)
cbar.ax.set_ylabel("weight")
plt.xlabel("pt")
plt.ylabel("eta")
plt.tight_layout()
plt.savefig("weights.png")

# plot met, njet without weights
plt.figure()
plt.hist2d(data['njet'], data['met'], bins=(5, 10), range=((0, 5), (0, 100)))
plt.colorbar()
plt.xlabel("njet")
plt.ylabel("met")
plt.tight_layout()
plt.savefig("njet_met.png")

# plot met, njet with weights
# np.searchsorted is the key. This allows you to determine the indices in binned_weights for the data.
pt_idx = np.searchsorted(pt_edges, data['pt']) - 1
eta_idx = np.searchsorted(eta_edges, data['eta']) - 1

plt.figure()
plt.hist2d(data['njet'], data['met'], bins=(5, 10), range=((0, 5), (0, 100)), weights=binned_weights[pt_idx, eta_idx])
plt.colorbar()
plt.xlabel("njet")
plt.ylabel("met")
plt.tight_layout()
plt.savefig("njet_met_weighted.png")