iris-hep / as-user-facing

Collecting information on the user-facing interface to the analysis system
5 stars 1 forks source link

Iterative reweighting of data-driven background estimation #5

Open masonproffitt opened 5 years ago

masonproffitt commented 5 years ago

Take an analysis that has two dimensions of signal and control regions. This is like the ABCD method but with more than one discriminating value per dimension:

A B C D E
F G H I J
K L M N O

where A, B, and C are the search regions, each representing a discrete orthogonal selection of the same signal but with different efficiencies. All other regions are for data-driven background estimation.

In order to search for a signal, data is compared to the estimated background in a binned histogram of an event-level kinematic variable. An analyst wants to use regions D and E to estimate the background in the signal regions A, B, and C, but the kinematics are correlated with the horizontal axis in the above table.

To account for this, the analyst iteratively reweights events in regions N and O so that key kinematic distributions converge to the distributions in K, L, and M. These weights will be applied to the events in regions I and J to make a prediction for and verify against regions F, G, and H. Finally, the weights are applied to regions D and E to get the background estimates in A, B, and C.

The procedure:

  1. Loop through all preselected events in data and classify each as one of the categories A through O. If the event is in regions D, E, I, J, N, or O, the region is split into two subregions by another discrete observable (ignore all other regions for this procedure). Fill separate histograms for each subregion for each of several kinematic variables.
  2. For each of the kinematic variables, take ratios of the corresponding histogram bin contents of the same kinematic variable between several different subregions.
  3. Fit splines to these histogram ratios and store them.
  4. Loop through all events again, going through the same categorization and histogram-filling procedure, except first evaluate the spline values corresponding to the values of the kinematic variables of each event, multiply these spline values together, and use this product as the weight for filling the histograms for that event. Also store the weight for each event.
  5. Repeat step 4 several (~10) times, except multiplying the weight of each event in the previous iteration by the new spline product to get the new weight.
  6. Store the weights of the final iteration.
  7. Later on, load these weights from a file. Loop over events one more time, now only selecting events in A, B, C, D, or E. Use the loaded weights for D and E events to fill the background estimation histograms and compare to the distributions from the signal regions A, B, and C.
cranmer commented 5 years ago

This kind of thing fits with the HistFitter use cases and things in template fit, though usually that's 1-d. Is this technique used?

It's not clear to me that it would converge to the right thing and account for a generic correlated distribution.

The analysis systems should be flexible enough to do almost anything, but for our declarative specifications we should also make sure that they map to well-established approaches.

masonproffitt commented 5 years ago

There are fits (a very standard thing) involved, sure, but the important and somewhat less common thing here is the reiteration part. You need to be able to save information about each event and about iteration as a whole, then run over all the data again with the information from the previous iteration as an input. I would think of this as more similar to training a machine learning algorithm and then using the training data for final classification.

In the limit of completely uncorrelated reweighting variables, it would be guaranteed to perfectly converge. In reality, ensuring convergence requires generally requires some tuning to avoid overcorrections. But it does work, assuming you pick good reweighting variables (the whole point is to get a small set of uncorrelated variables that nearly parameterize all the kinematics you care about). I don't know how established the method is elsewhere, but what I'm describing here is what we did in my previous ATLAS analysis: https://arxiv.org/abs/1804.06174 (see section 6.2).