gradhep / center

The center for all things differentiable analysis!
Apache License 2.0
6 stars 1 forks source link

Smooth histograms — which methods are best? #3

Open phinate opened 4 years ago

phinate commented 4 years ago

Following up on the discussion that took place during the kick-off meeting, an interesting point arose on the issue of which methods are best used to approximate a set of binned counts.

In the case where one has a neural network-based observable, one could apply the softmax function exp(nn / T) / sum(exp(nn / T)) ( T = temperature hyperparam) to the nn output, like one would do with a classification problem. This essentially puts every event in all bins, but weights each bin with the corresponding nn output (normalized to 1). The histogram is then the elementwise sum of the softmaxed output over all events.

An alternative to this is to take an approach using a kernel density estimate (kde), which is defined by instantiating a distribution (kernel) centered around each data point (e.g. standard normal distribution) and averaging their contributions. The smoothness of the resulting estimate is controlled with a 'bandwidth' hyperparameter, which scales the widths of the kernels. Once you choose a bandwidth, you can get 'counts' by integrating the pdf over a set of intervals (bins). One can equivalently use the cumulative density function.

Revisiting the case with the nn-based observable, one could then make the network output one value (regression net), and then define a kde over the set of outputs from a batch of events. The question is then: is there any clear advantage in expressivity or principle from using a kde, softmax, or other methods?

Another interesting question is where these approaches break down, e.g. using a kde in this way seems to be very senstitive to the bandwidth.

It's also worth noting that a kde may make more sense for general use as a candidate for smooth histogramming, but the case of the nn observable may be more nuanced.

philippwindischhofer commented 4 years ago

[...] which is defined by instantiating a distribution (kernel) centered around each data point [...]

This is a very pragmatic question and shows my ignorance about these things: Have you already tried to investigate the scaling behavior of your current KDE implementation (resource-wise)? Do you (need to) keep all the kernels of all data points in memory at the same time for this to work? (Put differently: "what happens if you fill a KDE-style histogram with 10M events?")

phinate commented 4 years ago

@philippwindischhofer In practice, our code looks something like:

events = np.array([x_1, x_2, ..., x_n])
bins = np.linspace(a,b,num_steps) 

edge_hi = bins[1:] # high edge for each bin 
edge_lo = bins[:-1] # low edge for each bin 

# get cumulative counts from ‘a’ up to each lo and hi bin edge 
cdf_up = norm.cdf(edge_hi, loc = events, scale = bandwidth) 
cdf_dn = norm.cdf(edge_lo, loc = events, scale = bandwidth) 

# subtract to get counts in [lo,hi] for all bins 
counts = cdf_up - cdf_dn 

Since this is just essentially applying normcdf twice in a vectorised way, I imagine there wouldnt be any issues with scaling (but I havent tried!)

HDembinski commented 4 years ago

I think use of KDEs is fine, but using them with a normal Poisson-based maximum-likelihood function as implemented in pyhf is mathematically wrong. Maximum-likelihood fits only produce unbiased estimates of the value and its uncertainty when the data in each bin is Poisson-distributed, this is not the case when you convert data into KDEs. If you add a single new event, it will not only increase the value in a single bin, but also in neighboring bins. These additional correlations are not included in the fitted model.

Whether this is a problem in practice, I cannot say. The problem I describe may not affect the optimization significantly, but this point needs to be studied.

alexander-held commented 4 years ago

The bandwidth could be reduced throughout analysis optimization to slowly approach the discrete bin case. For the final inference step after the optimization, the KDE can be replaced by a traditional histogram.

HDembinski commented 4 years ago

btw

events = np.array([x_1, x_2, ..., x_n])
bins = np.linspace(a,b,num_steps) 

edge_hi = bins[1:] # high edge for each bin 
edge_lo = bins[:-1] # low edge for each bin 

# get cumulative counts from ‘a’ up to each lo and hi bin edge 
cdf_up = norm.cdf(edge_hi, loc = events, scale = bandwidth) 
cdf_dn = norm.cdf(edge_lo, loc = events, scale = bandwidth) 

# subtract to get counts in [lo,hi] for all bins 
counts = cdf_up - cdf_dn 

can be more efficiently written as

events = np.array([x_1, x_2, ..., x_n])
bins = np.linspace(a,b,num_steps) 

# get cumulative counts from ‘a’ up to each lo and hi bin edge 
cdf = norm.cdf(bins, loc = events, scale = bandwidth) 

# subtract to get counts in [lo,hi] for all bins 
counts = np.diff(cdf)