scikit-hep / pyhf

pure-Python HistFactory implementation with tensors and autodiff
https://pyhf.readthedocs.io/
Apache License 2.0
283 stars 83 forks source link

Parameter settings for Asimov dataset #1029

Open alexander-held opened 4 years ago

alexander-held commented 4 years ago

Description

It would be convenient to be able to obtain the parameter settings for the Asimov dataset of a given model. This works already in some cases:

workspace = pyhf.Workspace(spec)
model = workspace.model()

obs_data = workspace.data(model)  # to get the observed data
sometimes_asimvov_data = model.expected_data(model.config.suggested_init())  # see below

As long as no parameter inits are used in measurements / config / parameters in the workspace, then this returns the Asimov dataset. Otherwise it picks up the specified inits, and will return a pseudodataset. When fitted, the MLE will be at the initial parameter settings specified in the workspace.

Is your feature request related to a problem? Please describe.

Such a feature would clean up Asimov dataset generation.

Describe the solution you'd like

A new ModelConfig method that returns the Asimov parameter settings.

Describe alternatives you've considered

There might already be a reasonably clean way to achieve what I am looking for. Other alternatives: overriding workspace properties (removing inits) to build the Asimov dataset, or building the Asimov parameters manually from the workspace. One could for example sum over samples per region to get the Asimov data yields, and add something similar to model.expected_auxdata (see comment below).

Relevant Issues and Pull Requests

discussion in #1017, maybe #881

Additional context

workspace example:

{
    "channels": [
        {
            "name": "region",
            "samples": [
                {
                    "data": [
                        4.0
                    ],
                    "modifiers": [
                        {
                            "data": [
                                0.2
                            ],
                            "name": "staterror_region",
                            "type": "staterror"
                        },
                        {
                            "data": null,
                            "name": "mu",
                            "type": "normfactor"
                        }
                    ],
                    "name": "signal"
                }
            ]
        }
    ],
    "measurements": [
        {
            "config": {
                "parameters": [
                    {
                        "inits": [1.2],
                        "name": "staterror_region"
                    }
                ],
                "poi": "mu"
            },
            "name": "meas"
        }
    ],
    "observations": [
        {
            "data": [
                5.0
            ],
            "name": "region"
        }
    ],
    "version": "1.0.0"
}

The following script will print [4.8 1.2]. The Asimov data (with auxdata) should be [4.0, 1.0].

import json
import pyhf

with open("ws.json") as f:
    spec = json.load(f)

ws = pyhf.Workspace(spec)
model = ws.model()
print(model.expected_data(model.config.suggested_init()))
alexander-held commented 4 years ago

I just came across model.nominal_rates, which returns the "actual" data part needed for the Asimov dataset. The missing piece is the aux data, something similar to model.expected_auxdata but at the respective MLE of the aux measurement instead of the initial value from the workspace.

A sum like

np.sum(model.nominal_rates, axis=1)

returns the Asimov yields per bin.

alexander-held commented 4 years ago

One more update, what I'm looking for is already possible with current pyhf:

def build_Asimov(model):
    asimov_data = np.sum(model.nominal_rates, axis=1)[0][0]
    asimov_aux = model.config.auxdata
    return np.hstack((asimov_data, asimov_aux))

This is sufficient for me. Feel free to close this if you want, though there still may be value in having this more cleanly accessible via the public API.

alexander-held commented 4 years ago

For a given model: pyhf.pdf.Model, something like the following provides what I would call the "Asimov" parameters. It might be useful to include such functionality within pyhf.

auxdata_pars_all = []
for parameter in model.config.auxdata_order:
    auxdata_pars_all += [parameter]*model.config.param_set(parameter).n_parameters

asimov_parameters = []
for parameter in model.config.par_order:
    # indices in auxdata list that match the current parameter
    aux_indices = [i for i, par in enumerate(auxdata_pars_all) if par == parameter]
    if aux_indices:
        # pick up best-fit value from auxdata
        inits = [aux for i, aux in enumerate(model.config.auxdata) if i in aux_indices]
    else:
        # pick up suggested inits (for normfactors)
        inits = model.config.param_set(parameter).suggested_init
    asimov_parameters += inits
lukasheinrich commented 3 years ago

hi @alexander-held ,

I'm just trying to see how to add this feature. asimov parameters are only defined wiith respect to a given dataset (or rather, the asimov data is generated from specific best-fit parameters in the first place)

We already have: pyhf.infer.calculators.generate_asimov_data and it could be expanded to provide an API like

m = pyhf.simplemodels.hepdata_like([5],[50],[7])                                                                                                                                                        
data = m.expected_data(m.config.suggested_init())

asimov_data, asimov_pars = pyhf.infer.calculators.generate_asimov_data(
  0.0,
  data,
  m,
  return_parameters=True
)                                                                                                         

print(asimov_data)                                                                                                                                                                                     
[55.         51.02040816]
print(asimov_pars)                                                                                                                                                                                     
[0.         1.04951738]

but this comes at the cost of a fit (though that seems unavoidable for the general case)

alexander-held commented 3 years ago

Hi @lukasheinrich, I believe the existing generate_asimov_data has a slightly misleading name. I've seen it used for various things in practice, but I will try to outline the various different datasets.

Asimov dataset

As per https://arxiv.org/abs/1007.1727:

We define the Asimov data set such that when one uses it to evaluate the estimators for all parameters, one obtains the true parameter values.

My interpretation of this is that the MLE of all parameters of a model fitted to the Asimov dataset should be equivalent to the MLE of these parameters in the auxiliary measurement (that is what I interpret by "true"). In particular, Gaussian constrained NPs should be fitted to 0 and gammas to 1. For normalization factors (and shapefactors), the situation is less clear.

Normalization factors

Without a MLE defined pre-fit, there is no single choice for what the MLE of a normalization factor should be when fitting the Asimov dataset. In practice I think 0 and 1 are often desired values to build a "signal plus background" or "background-only" Asimov dataset. Since there is no way in the pyhf workspace to define nominal values for normalization factors, I am treating the initial values (via init) in cabinetry as the nominal normalization factors for the purpose of Asimov dataset generation.

Shapefactors

I think the same considerations as for normalization factors apply in general, but I have not tried this out yet. In practice, most users may want to stick with 1 as nominal value.

Generating the Asimov dataset

Ignoring the normalization factor issue for a second, there is one Asimov dataset for a given fit model. I can build it with pyhf.Model.expected_data(asimov_parameters), and get asimov_parameters from this. This is almost the same as using pyhf.Model.nominal_rates, with the exception that nominal_rates has no way to set nominal values for normalization factors. I think it is fair to call both resulting datasets "Asimov".

A common use of the Asimov dataset is analysis optimization in signal + background fits while blinded.

Realistic pseudo-data

The dataset returned by pyhf.infer.calculators.generate_asimov_data is what I would call "realistic pseudo-data". It is obtained from a fit to data, so realistic in that sense, but it is not Asimov. The reason I think "Asimov" is misleading is the following:

If the dataset was Asimov, the two fits should return the exact same parameters. In practice they generally do not. In the first fit, parameters get pulled to a value that is a compromise between improving agreement with data and limiting the likelihood penalty from constraint terms. This might e.g. result in a pull of a constrained parameter to 0.8. When fitting a dataset built with that parameter at 0.8, the pull in a fit to that pseudo-data will in general be smaller than 0.8 (due to the trade-off between constraint term penalty and better agreement with pseudo-data).

An example use case for this pseudo-data is a realistic estimate of sensitivity in analyses with significant pulls. In a first step, a background-only fit to control regions can be performed to obtain pseudo-data that is more realistic than the Asimov dataset, and then signal can be added to this pseudo-data to estimate the discovery significance with a more realistic background estimate.

This pseudo-data is the Asimov dataset for a different fit model, namely one where the auxiliary data is shifted to be centered around the best-fit values from the fit to data. I am not sure whether this is easily possible with pyhf, but could be a nice feature.

I see this "realistic pseudo-dataset" frequently referred to as "Asimov", but find it misleading. I might be wrong here, and am happy to be corrected.