scikit-hep / pyhf

pure-Python HistFactory implementation with tensors and autodiff
https://pyhf.readthedocs.io/
Apache License 2.0
279 stars 83 forks source link

simplified likelihoods and pyhf #1359

Open WolfgangWaltenberger opened 3 years ago

WolfgangWaltenberger commented 3 years ago

Discussion - simplified likelihoods

To get the discussion started, let me open up this ticket. So, as we have already discussed a bit in other channels, for SModelS we actually would like to adopt pyhf also for our (CMS) simplified likelihoods (SL) we have in the database. To remind ourselves, the CMS simplified likehoods (v1.0) were of the form "Poisson * Gaussian", where the multivariate Gaussian was used to describe all nuisances, collapsed to a single "enveloping" nuisance, with a single big covariance matrix, and the Poissons (one for each signal region) were used to model our counting variables, as always. So ideally, in the long run, we would kick out our own SL implementation and also use pyhf for that task. We SModelS could of course work on that from our side, as all necessary functionality is already in place -- except for, possibly, performance tweaks within pyhf. But since you pyhf seem to anyways have similar ambitions, you guys might just go ahead and implement all that is needed, or we team up, or whatevs. Let the discussion begin.

kratsg commented 3 years ago

Do you mind pointing at an example specification of the CMS simplified likelihood? I thought it was in the Combine card format (#344) but perhaps it was a custom format? There are pieces of pyhf which are somewhat HiFa JSON independent, so this is not the craziest idea.. but it would be nice to see what things look like and we can probably guide towards what the pyhf API is intended to support?

alexander-held commented 3 years ago

Since multivariate Gaussians are not part of the HistFactory model, I assume this would either require something like https://github.com/scikit-hep/pyhf/pull/1188 (if the pdf is still supposed to be built by pyhf), or the pdf could be built externally (e.g. via a tool that translates HistFactory workspace -> simplified likelihood) and fit via pyhf.infer.

WolfgangWaltenberger commented 3 years ago

Do you mind pointing at an example specification of the CMS simplified likelihood? I thought it was in the Combine card format (#344) but perhaps it was a custom format? There are pieces of pyhf which are somewhat HiFa JSON independent, so this is not the craziest idea.. but it would be nice to see what things look like and we can probably guide towards what the pyhf API is intended to support?

Ok, so I personally do not have a Combine card -- in SModelS all we use is the covariance matrix, and number of observed and number of expected events per signal region. Nick Wardle (@nucleosynthesis) is the person to ask. Of course he cannot give the real life "combine" data card, but he might help guide the development here?

nucleosynthesis commented 3 years ago

In CMS, we never use combine for the simplified likelihood (but we do use it to generate the inputs). In terms of pyHF, I think this just means putting the parameterisation for the bin content and adding the gaussian constraint term (I don't think one needs to use any fancy class, its just a matrix mult of course).

Would be super nice if the code here : https://gitlab.cern.ch/SimplifiedLikelihood/SLtools can be re-used. in CMS we already have a way to go from combine datacard -> model.py/HepData format -> SLtools. so if the intermediate format needs to be modified, it should be easy enough.

Even further, it would be really nice if pyHF could implement the simplification in the form as it is in the code above. Its more accurate than correlations alone and doesn't require much additional input.

WolfgangWaltenberger commented 3 years ago

To keep this thread going, trying to obtain a CMS-style simplified likelihood from a pyhf likelihood is still a desirable goal, IMHO. In SLs, we "collapse" all nuisances to a single nuisance described by a multivariate Gaussian. So we would need the correlations of the "collapsed" nuisance between the signal regions, not between the sources of the nuisances.

I see how I can get the latter, i.e. the correlations between the individual nuisances:

result, result_obj = pyhf.infer.mle.fit( asimov, pdf, return_result_obj=True, return_correlations=True, do_grad=True )
correlations = result_obj.corr

But I still fail to see how I can I get the correlations between the signal regions. Any ideas @kratsg @lukasheinrich @matthewfeickert ? Thinking wildly, I imagine that one can somehow extract how a given nuisance parameter affects a given signal region and from this info plus the above "nuisance" correlation matrix I can fit the "signal region" covariance matrix between the signal regions? Something along these lines?

JFYI, the document of the simplified likelihoods v1 is here: https://cds.cern.ch/record/2242860?ln=en

cheers

Wolfgang

nucleosynthesis commented 3 years ago

Hi Wolfgang,

I don't know specifically how one could do this on pyHF (syntax wise) but another route (which is rather straightforward) would be to use the known correlations between nuisance parameters (from the fit) and how they affect each signal region to generate the correlation (or rather covariance) between signal region via MC toys - it should be easy to simply randomize the parameters (according to the correlations from correlations = result_obj.corr) and evaluate the total expected yield in each SR across the toys to do so (does that make sense?)

Note, it would also be great to include the 3rd moment diagonal vector a la SL v2 as an option too!

Best, Nick

On Wed, Jun 23, 2021 at 2:43 PM Wolfgang Waltenberger < @.***> wrote:

To keep this thread going, trying to obtain a CMS-style simplified likelihood from a pyhf likelihood is still a desirable goal, IMHO. In SLs, we "collapse" all nuisances to a single nuisance described by a multivariate Gaussian. So we would need the correlations of the "collapsed" nuisance between the signal regions, not between the sources of the nuisances.

I see how I can get the latter, i.e. the correlations between the individual nuisances:

result, result_obj = pyhf.infer.mle.fit( asimov, pdf, return_result_obj=True, return_correlations=True, do_grad=True ) correlations = result_obj.corr

But I still fail to see how I can I get the correlations between the signal regions. Any ideas @kratsg https://github.com/kratsg @lukasheinrich https://github.com/lukasheinrich @matthewfeickert https://github.com/matthewfeickert ? Thinking wildly, I imagine that one can somehow extract how a given nuisance parameter affects a given signal region and from this info plus the above "nuisance" correlation matrix I can fit the "signal region" covariance matrix between the signal regions? Something along these lines?

JFYI, the document of the simplified likelihoods v1 is here: https://cds.cern.ch/record/2242860?ln=en

cheers

Wolfgang

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-hep/pyhf/issues/1359#issuecomment-866847662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMEVWZWSKR4RMWU4N5UDELTUHQG7ANCNFSM4YZUVLQA .

WolfgangWaltenberger commented 3 years ago

Hey Nick,

yes that's a good idea! I wouldn't mind working on this idea myself, but I currently do not know how I can get information about how the nuisances affect the signal regions. For this I hope for some help from the pyhf devs. Then I yes could just throw toys. Agreed on the 3rd moment, I just wanted to keep things simple for starters ;)

Wolfgang

alexander-held commented 3 years ago

pyhf.pdf.Model.expected_data returns the model prediction (yields and optionally also auxiliary data) for a point in parameter space. Is this what you are looking for? The output can be split into regions via a small utility function like this.

WolfgangWaltenberger commented 3 years ago

Thanks Alexander. I am thinking, maybe I am too focussed on using the correlations matrix. Maybe I should more have MC sampling in mind, as Nick said. So how can I create fake observations from a pyhf model? I could use these to estimate the signal-region covariance matrix from the sample covariance .... Sorry for my ignorance about all this.

Wolfgang

On Wed, 23 Jun 2021 at 16:02, Alexander Held @.***> wrote:

pyhf.pdf.Model.expected_data https://pyhf.readthedocs.io/en/v0.6.2/_generated/pyhf.pdf.Model.html#pyhf.pdf.Model.expected_data returns the model prediction (yields and optionally also auxiliary data) for a point in parameter space. Is this what you are looking for? The output can be split into regions via a small utility function like this https://github.com/alexander-held/cabinetry/blob/d888ec054e080f474a12de6de375857631c55987/src/cabinetry/model_utils.py#L154-L173 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/scikit-hep/pyhf/issues/1359#issuecomment-866863304, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLSE422OIKUTBQ5OETEM4DTUHSO7ANCNFSM4YZUVLQA .

alexander-held commented 3 years ago

Hi Wolfgang, here is an example with a simple model. I get best-fit parameter results + covariance from a MLE, and then sample parameters from a multivariate Gaussian. The model is subsequently evaluated for all samples of parameter points. This can be used for example to evaluate post-fit yield uncertainties via bootstrapping. I think it also allows the yield covariance calculation as described by Nick, though I am currently not completely sure that this is indeed the intended quantity:

import numpy as np
import pyhf

pyhf.set_backend("numpy", pyhf.optimize.minuit_optimizer(verbose=1))

# set up a simple model
model = pyhf.simplemodels.uncorrelated_background(
    signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0]
)
data = [62, 63] + model.config.auxdata

# MLE fit to get best-fit results and covariance matrix
result, result_obj = pyhf.infer.mle.fit(
    data, model, return_uncertainties=True, return_result_obj=True
)

# sample parameters from multivariate Gaussian and evaluate model
sampled_parameters = np.random.multivariate_normal(
    result_obj.minuit.values, result_obj.minuit.covariance, size=10000
)
model_predictions = [
    model.expected_data(p, include_auxdata=False) for p in sampled_parameters
]

yields = np.mean(model_predictions, axis=0)
yield_unc = np.std(model_predictions, axis=0)
print(f"model prediction: {yields} +/- {yield_unc}")

print(f"covariance:\n{np.cov(model_predictions, rowvar=False)}")
print(f"correlation:\n{np.corrcoef(model_predictions, rowvar=False)}")

output example:

model prediction: [62.07857537 63.12867452] +/- [6.56092723 6.24693401]
covariance:
[[43.05007114 20.9128427 ]
 [20.9128427  39.02808728]]
correlation:
[[1.         0.51019653]
 [0.51019653 1.        ]]
lukasheinrich commented 3 years ago

thanks @alexander-held - maybe we can convert this into a backend independent language, what APIs would be needed? Additional kwargs like return_covoariance?

alexander-held commented 3 years ago

Yes, return_covariance should be all that is needed (return_correlations already exists).

nucleosynthesis commented 3 years ago

One minor thing to be careful of (and i'm not sure we need to care about it or not yet), is that when using the simplified likelihood (a la the CMS note), there is a distinction made between the control and signal regions in that the fit from which the nuisance parameters (and their correlations) are determined only uses data from the control regions. The reason is that the data in the signal region are used in the inference step and this way one doesn't double count the data in the signal region (in CMS we call this "masking" parts of the likelihood such that the expected bin yields can be evaluated without explicitly being included in the fit).

Now if Wolfgang needs the covariance between all regions then you can ignore this caveat.

On Thu, Jun 24, 2021 at 3:37 PM Alexander Held @.***> wrote:

Yes, return_covariance should be all that is needed (return_correlations already exists).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-hep/pyhf/issues/1359#issuecomment-867689118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMEVW6ISM2INXLGELRKV2LTUM7JXANCNFSM4YZUVLQA .

WolfgangWaltenberger commented 3 years ago

You are right, Nick, I need the covariance of SRs only. But, this shouldn't make a huge difference, should it? Anyways some progress. I take SUSY-2018-04. It has two signal regions, SRlow with an expected yield of 6+/-1.7 (observed 10) and SRhigh with an expected yield of 10.2+/-3.3 (observed 7). See https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/SUSY-2018-04/tab_06.png

I patch the model with a signal at high stau masses, so the signal should not play much of a role:

jsonpatch SUSY-2018-04_likelihoods/Region-combined/BkgOnly.json SUSY-2018-04_likelihoods/Region-combined/patch.DS_440_80_Staus.json > test.json

(I tried also with pyhf patchset apply, but that failed for me )

I retrieve the channel names: In[]: channelnames = [ x["name"] for x in ws["channels"] ] Out: ['QCR1cut_cuts', 'QCR2cut_cuts', 'SR1cut_cuts', 'SR2cut_cuts', 'WCRcut_cuts']

From these names I expect to be interested in rows number 2 and 3 (0-indexed).

I run Alexander's code (thanks a lot!!), and look at the yields: Out[]: array([ 309.42967087, 12.37700728, 246.4054857 , 7.0491261 , 1108.5018055 ])

Comparing the yields with the expectation values it however seems like row number 1 and 3 are the signal yields -- they would look roughly right. In the covariance matrix somewhere along the diagonal I would expect numbers close to 1.72 ~ 3 and 3.32 ~ 10. But I see much higher numbers.

In[]: np.diag ( np.cov(model_predictions, rowvar=False) ) Out[]: array([2.51467088e+07, 1.24868271e+02, 2.18699298e+07, 1.49262233e+01, 1.45148174e+03])

(I assume the Poissonian errors of the counting variables do not enter here. But even if they do, I am still quite off).

So there is still something I do not yet understand, I guess. Anyways, thanks for all the help so far.

Cheers

Wolfgang

WolfgangWaltenberger commented 3 years ago

These would be the files.

Wolfgang

SL.zip

kratsg commented 3 years ago

Hi @WolfgangWaltenberger use ws.channels or model.config.channels to get the right channel name ordering.

alexander-held commented 3 years ago

The yield uncertainties should indeed be recovered from the diagonal of this yield covariance matrix via np.sqrt(np.diag(np.cov(model_predictions, rowvar=False))). It works fine in that toy setup I posted above. Looking at the numbers in @WolfgangWaltenberger's example, the yields are

[ 309.42967087, 12.37700728, 246.4054857 , 7.0491261 , 1108.5018055 ]

and the variances

[2.51467088e+07, 1.24868271e+02, 2.18699298e+07, 1.49262233e+01, 1.45148174e+03]

are much too large to be compatible with those yields. Is it possible that the fit did not converge, and the parameter covariances reported from the fit are nonsensical?

The Poisson uncertainty component enters here as well, these numbers are the full post-fit uncertainty including statistical and systematic effects.

nucleosynthesis commented 3 years ago

Just to come back to an earlier point (not related to the technical part), Wolfgang, when I said you musn;t include the signal region bins in the fit, I meant that if you ...

1) Include the signal region in the fit, and then obtain the covariance between signal regions (or whatever regions) 2) Use the covariance from 1 as input to the SL

you will double count the data in the signal regions and not get the right thing. Thus in 1), you need to ignore the signal regions when fitting to avoid the double count,

Hope its clear, and sorry to interrupt the technical discussion,

Nick

On Fri, Jun 25, 2021 at 1:26 PM Alexander Held @.***> wrote:

The yield uncertainties should indeed be recovered from the diagonal of this yield covariance matrix via np.sqrt(np.diag(np.cov(model_predictions, rowvar=False))). It works fine in that toy setup I posted above. Looking at the numbers in @WolfgangWaltenberger https://github.com/WolfgangWaltenberger's example, the yields are

[ 309.42967087, 12.37700728, 246.4054857 , 7.0491261 , 1108.5018055 ]

and the variances

[2.51467088e+07, 1.24868271e+02, 2.18699298e+07, 1.49262233e+01, 1.45148174e+03]

are much too large to be compatible with those yields. Is it possible that the fit did not converge, and the parameter covariances reported from the fit are nonsensical?

The Poisson uncertainty component enters here as well, these numbers are the full post-fit uncertainty including statistical and systematic effects.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-hep/pyhf/issues/1359#issuecomment-868463469, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMEVWYONUYSTPOBYR3LEJTTURYV7ANCNFSM4YZUVLQA .

WolfgangWaltenberger commented 3 years ago

@alexander-held result_obj.success is True, so it seems that minuit assumes it converged. @nucleosynthesis thanks, and yes you made yourself clear. What I mean is, assuming that the yields in the signal regions are much smaller than the yields in the control regions, fitting the signal regions as well as the CRs should not have a big impact on the fit result, no?

Will keep playing around with the code snippet.

Wolfgang

alexander-held commented 3 years ago

Hi @WolfgangWaltenberger, even though the fit is reported to be successful, at least in my version of it using the files you linked above I see that MINUIT reports that a parameter hit its boundary (the signal strength in this example, capped at 0). This will make the results unreliable (unrelated to this discussion: maybe such a case should not be flagged as success?).

In this specific example there are some very strong anti-correlations between mu_QCD / mu_W and the gammas in two regions. This may make the estimation via sampling a bit unstable. I have not run any likelihood scans for parameters, if the strongly correlated parameters are very non-Gaussian then I think this approach will quickly run into limitations. For the WCRcut_cuts region, the yield 1109 +/- 39 via sampling is fairly close to 1099.63 ± 33.27 obtained via cabinetry.model_utils.calculate_stdev. The discrepancies in the other regions are much larger.

WolfgangWaltenberger commented 3 years ago

Alright. I allowed for negative signal strengths in the fit (as I use it only to extract the cov matrix), sampled a bit longer, assumed that the order of the yields and cov matrix is not the one given in models.config.channels, and ended up with a covariance matrix that actually looks realistic. The procedure is clearly not yet "correct" (Nick's caveat still applies, plus some more issues regarding the correlations), but numerically not anymore completely bonkers either.

comparison

!/usr/bin/env python3

import cabinetry import pyhf import numpy as np

pyhf.set_backend("numpy", pyhf.optimize.minuit_optimizer(verbose=1))

jsonfile = "example.json" def download(): import os, subprocess if os.path.exists ( jsonfile ): return from pyhf.contrib.utils import download download("https://www.hepdata.net/record/resource/1406212?view=true", "SUSY-2018-04_likelihoods/" ) cmd = "jsonpatch SUSY-2018-04_likelihoods/Region-combined/BkgOnly.json SUSY-2018-04_likelihoods/Region-combined/patch.DS_440_80_Staus.json"

cmd = "jsonpatch SUSY-2018-04_likelihoods/Region-combined/BkgOnly.json SUSY-2018-04_likelihoods/Region-combined/test.json"

cmd += f" > {jsonfile}"
subprocess.getoutput ( cmd )

download()

ws = cabinetry.workspace.load( jsonfile ) model, data = cabinetry.model_utils.model_and_data(ws) channels = model.config.channels

muSigIndex = model.config.parameters.index ( "mu_SIG" ) suggestedBounds = model.config.suggested_bounds() suggestedBounds[muSigIndex]=(-10.,10.)

result, result_obj = pyhf.infer.mle.fit( data, model, return_uncertainties=True, return_result_obj=True, par_bounds = suggestedBounds )

sampled_parameters = np.random.multivariate_normal( result_obj.minuit.values, result_obj.minuit.covariance, size=50000 ) model_predictions = [ model.expected_data(p, include_auxdata=False) for p in sampled_parameters ]

for i,name in enumerate ( model.config.parameters ): fit = result_obj.minuit.values[i] bound = model.config.suggested_bounds()[i] if abs ( fit - bound[0] ) < 1e-5: print ( f"Fitted value {fit} of {name} hit bound {bound}" ) if abs ( fit - bound[1] ) < 1e-5: print ( f"Fitted value {fit} of {name} hit bound {bound}" )

yields = np.mean(model_predictions, axis=0) yield_unc = np.std(model_predictions, axis=0) print(f"model predictions:\n" ) for i,channel in enumerate ( channels ): print ( f" -- {channel}: {yields[i]:.2f}+/-{yield_unc[i]:.2f}" )

np.set_printoptions ( precision = 3 ) ncov = np.cov(model_predictions, rowvar=False) print(f"covariance:\n{ncov}") print(f"correlation:\n{np.corrcoef(model_predictions, rowvar=False)}")

indices = np.array ( [ 2,3 ] ) print ( f"covariance of SRs (2,3):\n{ncov[indices[:,None],indices]}" )

indices = np.array ( [ 1,3 ] ) scov = ncov[indices[:,None],indices]

indices of signal regions

print ( f"covariance of SRs (1,3):\n{scov}" ) for i in indices.tolist(): ncov[i][i]=ncov[i][i]-yields[i] scov = ncov[indices[:,None],indices] print ( f"covariance of SRs w/o Poissonian (1,3):\n{scov}" )

import IPython IPython.embed( using = False )

WolfgangWaltenberger commented 3 years ago

Now the first open issue I would wish to tackle, is the question of the ordering of the channels. Why does that change? Next, I would like to fit without fitting also the signal regions.

alexander-held commented 3 years ago

assumed that the order of the yields and cov matrix is not the one given in models.config.channels

The order in models.config.channels is "correct" in the sense that it is consistently used within pyhf. This is the order in which things like expected_data return you the yields. The order in the workspace itself (ws["channels"]) does not matter, pyhf sorts channels alphabetically when creating a Workspace instance:

https://github.com/scikit-hep/pyhf/blob/03b914b1aa9006d8aa8ca5ecbbd2e6b6e6bdd1ce/src/pyhf/mixins.py#L41

WolfgangWaltenberger commented 3 years ago

Hm, then I don't know what's wrong. I do take the names from models.config.channels.

Wolfgang

On Mon, 28 Jun 2021 at 13:04, Alexander Held @.***> wrote:

assumed that the order of the yields and cov matrix is not the one given in models.config.channels

The order in models.config.channels is "correct" in the sense that it is consistently used within pyhf. This is the order in which things like expected_data return you the yields. The order in the workspace itself ( ws["channels"]) does not matter, pyhf sorts channels alphabetically when creating a Workspace instance:

https://github.com/scikit-hep/pyhf/blob/03b914b1aa9006d8aa8ca5ecbbd2e6b6e6bdd1ce/src/pyhf/mixins.py#L41

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-hep/pyhf/issues/1359#issuecomment-869590242, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLSE445H6RGNVGFGHKTQJDTVBJMXANCNFSM4YZUVLQA .

WolfgangWaltenberger commented 3 years ago

comparison

It's a bit puzzling, I know I do not yet do it right, but the results actually already look quite good. A rare problem.