scikit-hep / pyhf

pure-Python HistFactory implementation with tensors and autodiff
https://pyhf.readthedocs.io/
Apache License 2.0
283 stars 84 forks source link

Consider arviz for storing results of ToyMC #1163

Open cranmer opened 4 years ago

cranmer commented 4 years ago

Description

While looking into pymc3 I see that they use arviz:

ArviZ is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, sample diagnostics, model checking, and comparison.

The goal is to provide backend-agnostic tools for diagnostics and visualizations of Bayesian inference in Python, by first converting inference data into xarray objects. See here for more on xarray and ArviZ usage and here for more on InferenceData structure and specification.

While the ToyMC approach isn't Bayesian and it doesn't produce a Markov Chain, but there are a lot of similarities. You have a bunch of samples. Each sample has best fit values for the nuisance parameters and parameters of interest as well as log likelihood etc.

https://arviz-devs.github.io/arviz/schema/schema.html#schema

Note the package is used by several tools emcee, PyMC3, Pyro and NumPyro, PyStan, CmdStan and CmdStanPy, TensorFlow Probability.

It also has some handy plotting utilities https://arviz-devs.github.io/arviz/getting_started/Introduction.html#get-started-with-plotting

mathematicalmichael commented 4 years ago

the word "likelihood" shows up a lot in this repository but I'm having a hard time understanding if Bayesian methods are at play anywhere here. I searched for several terms in the issues and this was the only one that came up. @cranmer What statistical framework is being leveraged in this package for the analysis of hypotheses? Is it that which is defined in "Asymptotic formulae for likelihood-based tests of new physics”?

cranmer commented 4 years ago

I'm not sure if "this repository" is referring to pyhf or Arviz. I assume pyhf. The core of pyhf is describing the statistical model p(X|theta) where X is a random variable are observed data and theta are parameters of the model. With observed data x that gives the likelihood L(theta) = p(X=x|theta). The actual probability model is detailed in various notes, but it is basically a product of a bunch of Poisson distributions, where the means of the the Poissons are parametrized by theta.

From that likelihood one could either do Bayesian or frequentist analysis. This package does not have any down-stream Bayesian statistical analysis, but there is nothing stopping someone from using MCMC or HMC to calculate a Bayesian posterior based on the likelihood defined and exposed by the library.

The package does have frequentist statistical analysis tools that follow the "Asymptotic formulae for likelihood-based tests of new physics” paper. Did that answer your question @mathematicalmichael ?

mathematicalmichael commented 3 years ago

@cranmer yes! perfectly. that was a wonderful explanation. I really appreciate your prompt reply. I apologize that this is the wrong issue to have this discussion, but I do have some clarifying questions. Am I correct to then infer that what pyhf enables is the parameterization of the product of those poisson distributions? I provide a value for theta, it can evaluate L(theta) (perhaps in the interest of maximizing it)? Then, if so

(a) does that mean that observations from actual experiments are provided as part of the aforementioned parameterization of L? Or, is it that (b) I (as a user) have to come equipped with those as well, supplying theta and x together in order to evaluate L?

cranmer commented 3 years ago

Sorry I didn't see this follow up until now. Roughly the idea is the statistical model is P(n1, ..., n_M, a_1, ..., a_C | theta) = \prod_i Pois(n_i | nu_i(theta)) * \prod_j N(a_j | theta_j, 1)

where

In addition, there are likelihood terms that summarize other measurements. In a bayesian setting they would be priors on the parameters, but we model them with a simple statistical model N(a_j |theta_j, 1). They are like adding a penalty to the likelihood, but usually there are actual measurements so this is just a convenient way to summarize them. So a_j are estimators for the parameter theta_j, and by convention we rescale / reparameterize so that the specific value of a_j from our previous measurements is 0 and the standard error is 1.

Basically we have predictions from our simulator for several parameter settings theta_k. For each of these parameter settings we run the simulator and calculate nu_i(theta_k). Then pyhf interpolates based on a specific interpolation strategy. There are some semantics for describing common situations (correlated effects, uncorrelated effects, etc.) The precise way that nu_i depends on theta is described in in this note: https://twiki.cern.ch/twiki/pub/RooStats/WebHome/HistFactoryLikelihood.pdf