Open alexander-held opened 4 years ago
I just came across model.nominal_rates
, which returns the "actual" data part needed for the Asimov dataset. The missing piece is the aux data, something similar to model.expected_auxdata
but at the respective MLE of the aux measurement instead of the initial value from the workspace.
A sum like
np.sum(model.nominal_rates, axis=1)
returns the Asimov yields per bin.
One more update, what I'm looking for is already possible with current pyhf
:
def build_Asimov(model):
asimov_data = np.sum(model.nominal_rates, axis=1)[0][0]
asimov_aux = model.config.auxdata
return np.hstack((asimov_data, asimov_aux))
This is sufficient for me. Feel free to close this if you want, though there still may be value in having this more cleanly accessible via the public API.
For a given model: pyhf.pdf.Model
, something like the following provides what I would call the "Asimov" parameters. It might be useful to include such functionality within pyhf
.
auxdata_pars_all = []
for parameter in model.config.auxdata_order:
auxdata_pars_all += [parameter]*model.config.param_set(parameter).n_parameters
asimov_parameters = []
for parameter in model.config.par_order:
# indices in auxdata list that match the current parameter
aux_indices = [i for i, par in enumerate(auxdata_pars_all) if par == parameter]
if aux_indices:
# pick up best-fit value from auxdata
inits = [aux for i, aux in enumerate(model.config.auxdata) if i in aux_indices]
else:
# pick up suggested inits (for normfactors)
inits = model.config.param_set(parameter).suggested_init
asimov_parameters += inits
hi @alexander-held ,
I'm just trying to see how to add this feature. asimov parameters are only defined wiith respect to a given dataset (or rather, the asimov data is generated from specific best-fit parameters in the first place)
We already have: pyhf.infer.calculators.generate_asimov_data
and it could be expanded to provide an API like
m = pyhf.simplemodels.hepdata_like([5],[50],[7])
data = m.expected_data(m.config.suggested_init())
asimov_data, asimov_pars = pyhf.infer.calculators.generate_asimov_data(
0.0,
data,
m,
return_parameters=True
)
print(asimov_data)
[55. 51.02040816]
print(asimov_pars)
[0. 1.04951738]
but this comes at the cost of a fit (though that seems unavoidable for the general case)
Hi @lukasheinrich, I believe the existing generate_asimov_data
has a slightly misleading name. I've seen it used for various things in practice, but I will try to outline the various different datasets.
As per https://arxiv.org/abs/1007.1727:
We define the Asimov data set such that when one uses it to evaluate the estimators for all parameters, one obtains the true parameter values.
My interpretation of this is that the MLE of all parameters of a model fitted to the Asimov dataset should be equivalent to the MLE of these parameters in the auxiliary measurement (that is what I interpret by "true"). In particular, Gaussian constrained NPs should be fitted to 0 and gammas to 1. For normalization factors (and shapefactors), the situation is less clear.
Without a MLE defined pre-fit, there is no single choice for what the MLE of a normalization factor should be when fitting the Asimov dataset. In practice I think 0 and 1 are often desired values to build a "signal plus background" or "background-only" Asimov dataset. Since there is no way in the pyhf
workspace to define nominal values for normalization factors, I am treating the initial values (via init
) in cabinetry
as the nominal normalization factors for the purpose of Asimov dataset generation.
I think the same considerations as for normalization factors apply in general, but I have not tried this out yet. In practice, most users may want to stick with 1 as nominal value.
Ignoring the normalization factor issue for a second, there is one Asimov dataset for a given fit model. I can build it with pyhf.Model.expected_data(asimov_parameters)
, and get asimov_parameters
from this. This is almost the same as using pyhf.Model.nominal_rates
, with the exception that nominal_rates
has no way to set nominal values for normalization factors. I think it is fair to call both resulting datasets "Asimov".
A common use of the Asimov dataset is analysis optimization in signal + background fits while blinded.
The dataset returned by pyhf.infer.calculators.generate_asimov_data
is what I would call "realistic pseudo-data". It is obtained from a fit to data, so realistic in that sense, but it is not Asimov. The reason I think "Asimov" is misleading is the following:
pyhf.infer.calculators.generate_asimov_data
corresponding to the best-fit model, using $\hat{\theta}$If the dataset was Asimov, the two fits should return the exact same parameters. In practice they generally do not. In the first fit, parameters get pulled to a value that is a compromise between improving agreement with data and limiting the likelihood penalty from constraint terms. This might e.g. result in a pull of a constrained parameter to 0.8. When fitting a dataset built with that parameter at 0.8, the pull in a fit to that pseudo-data will in general be smaller than 0.8 (due to the trade-off between constraint term penalty and better agreement with pseudo-data).
An example use case for this pseudo-data is a realistic estimate of sensitivity in analyses with significant pulls. In a first step, a background-only fit to control regions can be performed to obtain pseudo-data that is more realistic than the Asimov dataset, and then signal can be added to this pseudo-data to estimate the discovery significance with a more realistic background estimate.
This pseudo-data is the Asimov dataset for a different fit model, namely one where the auxiliary data is shifted to be centered around the best-fit values from the fit to data. I am not sure whether this is easily possible with pyhf
, but could be a nice feature.
I see this "realistic pseudo-dataset" frequently referred to as "Asimov", but find it misleading. I might be wrong here, and am happy to be corrected.
Description
It would be convenient to be able to obtain the parameter settings for the Asimov dataset of a given
model
. This works already in some cases:As long as no parameter
inits
are used inmeasurements
/config
/parameters
in the workspace, then this returns the Asimov dataset. Otherwise it picks up the specified inits, and will return a pseudodataset. When fitted, the MLE will be at the initial parameter settings specified in the workspace.Is your feature request related to a problem? Please describe.
Such a feature would clean up Asimov dataset generation.
Describe the solution you'd like
A new
ModelConfig
method that returns the Asimov parameter settings.Describe alternatives you've considered
There might already be a reasonably clean way to achieve what I am looking for. Other alternatives: overriding workspace properties (removing inits) to build the Asimov dataset, or building the Asimov parameters manually from the workspace. One could for example sum over samples per region to get the Asimov data yields, and add something similar to
model.expected_auxdata
(see comment below).Relevant Issues and Pull Requests
discussion in #1017, maybe #881
Additional context
workspace example:
The following script will print
[4.8 1.2]
. The Asimov data (with auxdata) should be[4.0, 1.0]
.