Open kratsg opened 6 years ago
After having some discussions at the US LUA meeting I think that we might want to talk with Josh Bendavid (@bendavid) about this.
Since this is still an open issue, let me mention that on 25.6 there will be a meeting including Josh and others to talk about the implementations of binned/template pdfs in order to move the community in this niche closer together.
Tagging @mattbellis and @benkrikler here, given the email conversation that Matt started (thanks Matt!) RE: what would need to be done to extend the HistFactory JSON v1.0.0
schema to allow translation of CMS Combine cards. @mattbellis provided a toy Combine card that we can start with. @benkrikler had started some discussion on this front at CHEP 2019, so his thoughts and input are very welcome here too.
I'll assign myself on this for now, since I have a small side project that is looking into Combine (part of my SUSY role in ATLAS) and want to explore some code for this.
So it seems that CMS has added some rather complete tutorials that describe the Combine model (HT @kpedro88):
together with #1188 it should be much more straight forward to built a combine-like model
page @alexander-held
I was curious about the possibility of converting datacards into pyhf workspaces and wrote a small utility https://github.com/alexander-held/datacard-to-pyhf. I do not know much about CMS Combine and the datacard format, so the implementation likely has a range of issues. The most glaring one is that it only supports single-bin channels (and no shape systematics) at the moment. It runs fine with the toy example from above, resulting in a best-fit of
r = 0.9040 -0.2753 +0.3202
The paper reports 0.93 +0.26 −0.23 (stat.) +0.13 −0.09 (syst.)
in the abstract. With another simple example, I do not see perfect agreement between the fit with Combine and pyhf (via MINUIT) either, so there are probably other differences to be understood.
awesome that's a great start.. taking on the simplest example and successively adding features was also pyhf's approach in general. tagging also @clelange
In case its helpful @andrzejnovak put together a conda recipe for combine in https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/pull/648 Though it is still python 2 :(
Are you able to run the "standalone" (works on a CernVM) version of combine [1]? that might help better compare the expected from combine vs pyHF (since the tutorial cards, even the more advanced ones are not identical to the real thing in the papers)
[1] http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#standalone-version-of-combine
Could a standalone Docker image also be possible? Having no CVMFS dependence at all would be useful to allow running validations anywhere.
@alexander-held check the PR Nick linked
Having the conda version available is great! I view a ready-to-use Docker image as complementary to that (I guess with conda there is compilation involved?).
Sure, just wanted to point out that with the conda env you can build the image on the fly as well without having to access cvmfs when compiling stuff.
Any way to run standalone is fine. I'm not sure how well synched the version with conda env is with the main branch (the 102x vs 112x), but for this I don't think it really matters too much. Just wasn't sure whether the comparison by @alexander-held was a direct comparison of a combine run or not.
@nucleosynthesis Yes, for the comparison with pyhf
I was running Combine
on lxplus. I saw small differences for these example datacards when comparing a result obtained with Combine
to a result obtained by converting the model to pyhf
and then minimizing the resulting HistFactory version of the model. The differences were small enough for me to be confident that the conversion is generally working, but slightly larger than what I would expect purely from slight differences in minimization. They probably come down to things like interpolation algorithm differences.
While writing this comment I noticed one discrepancy: my lnN
treatment for a value 1.2
would use 0.8
for -1σ, but I should use 1/1.2
. I will give this a try.
Would you recommend the CombineHarvester
Python API for datacard parsing? I remember looking at it last month but did not know how complete and up-to-date it is. I think the biggest challenge in creating the corresponding pyhf
model for a given datacard is figuring out how to correctly parse the datacard format.
Is there a good place to ask technical questions about Combine model building details (as a non-CMS member)?
CombineHarvester is probably overkill, though it should be up to date. A while ago we made a python dumping option in the Datacard parser --dump-datacard
which prints to stdout an equivalent python script that can be run to just do the same thing as running text2workspace.py
over the datacard. That might be helpful to see whats being mapped to what (see here: http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/settinguptheanalysis/#automatic-production-of-datacards-and-workspaces )
For discussions / Q's to the combine team, probably the easiest thing is to submit an issue here and add the label "question" for now.
I had not noticed --dump-datacard
before, but this looks super useful. Thanks a lot!
Just dumping here that @ajgilbert gave a session on Combine (:+1:) at the first hands-on workshop on publication of statistical models where the last 4 slides are relevant to pyhf and Combine interop and probability model preservation.
There's a CMS Top workshop taking place this week where Combine will be discussed. It is at a time that I can't attend, but I'm going to try to reach out to the speaker(s) to see if there's any interest in understanding how / if we can create examples comparing and contrasting combine and pyhf.
Building some hypersimple comparison examples is on a very long to-do list of mine. :)
Hi! I don't know if this can be useful, but a while ago I started working on @alexander-held's repo with the intention of adding support for shape-based analyses datacards. You can find it here and the output can be tested e.g. with this datacard.
A few huge disclaimers:
shape
with entries up
and down
containing the bin values for each modifier - clearly incompatible with a pyhf analysis flow)
Question
CMS uses a tool called Combine which is built on top of RooStats/RooFit.
It seems very possible, as it appears that CMS' workspace is defined as a plaintext file called a
datacard
, to be able to provide adatacard2json
tool to translate the datacard into something usable bypyhf
.