Open kratsg opened 6 years ago
For the evaluation of a single CLs point mu_SIG = 0.0
, I tried to see how slow it was as we keep adding channels. This evaluates
CLsOnePoint(0.0, data, pdf, init_pars, par_bounds)
and the results so far are
[1 channel] 1 loop, best of 3: 148 ms per loop
[2 channel] 1 loop, best of 3: 521 ms per loop
[3 channel] 1 loop, best of 3: 1.32 s per loop
[4 channel] 1 loop, best of 3: 3.3 s per loop
[5 channel] 1 loop, best of 3: 5.45 s per loop
[10 channels] 1 loop, best of 3: 23.5 s per loop
[34 channels] 1 loop, best of 3: 6min 19s per loop
I would have expected this to be faster, but the scaling behaviour between adding bins to a channel and adding channels seems different. I changed the title of this issue and it would be good to do a similar study as @matthewfeickert did for scaling bins, but this time by scaling channels.
In the end the baseline is ROOT to judge how slow is too slow (also we should make sure we use the same number of cores, I remember ROOT having workers to parallelize parts of the computation)
I have this simple code
spec = {
'channels': [
{
'name': 'singlechannel',
'samples': [
{
'name': 'signal',
'data': [0.5],
'modifiers': [{'name': 'mu', 'type': 'normfactor', 'data': None}]
},
{
'name': 'background',
'data': [40.],
'modifiers': [
{'name': 'syst1','type': 'normsys','data': {'lo': 0.5, 'hi': 1.5}},
]
}
]
}
]
}
import copy
import time
deltas = []
for i in range(1,201):
thisspec = copy.deepcopy(spec)
thisspec['channels'] = spec['channels'] * i
import pyhf
pyhf.set_backend(pyhf.tensor.numpy_backend())
debug = pyhf.hfpdf(thisspec, poiname='mu')
list(map(float,debug.config.auxdata))
start = time.time()
p = debug.logpdf(debug.config.suggested_init(), [40.] + debug.config.auxdata)
delta = time.time() - start
deltas.append(delta)
which gives linear scaling so on the pdf evaluation front I think we are fine. Would be good to make this into a test
pytorch is simiar and also linear:
this is this script
spec = {
'channels': [
{
'name': 'singlechannel',
'samples': [
{
'name': 'signal',
'data': [0.5],
'modifiers': [{'name': 'mu', 'type': 'normfactor', 'data': None}]
},
{
'name': 'background',
'data': [40.],
'modifiers': [
{'name': 'syst1','type': 'normsys','data': {'lo': 0.5, 'hi': 1.5}},
]
}
]
}
]
}
import pyhf
pyhf.set_backend(pyhf.tensor.numpy_backend())
import copy
import time
deltas = []
for i in range(1,201):
print('.',i)
thisspec = copy.deepcopy(spec)
thisspec['channels'] = spec['channels'] * i
debug = pyhf.hfpdf(thisspec, poiname='mu')
list(map(float,debug.config.auxdata))
data = [40.] + debug.config.auxdata
start = time.time()
pyhf.runOnePoint(1.0,data,debug,debug.config.suggested_init(), debug.config.suggested_bounds())
delta = time.time() - start
deltas.append(delta)
so far I'm still confused how we hit this issue, these initial tests seem like 34 channels shouldn't really be an issue
this is getting closer to the problem (i.e. each background sample in each channel adds a new syst)
spec = {
'channels': [
{
'name': 'singlechannel',
'samples': [
{
'name': 'signal',
'data': [0.5],
'modifiers': [{'name': 'mu', 'type': 'normfactor', 'data': None}]
},
{
'name': 'background',
'data': [40.],
'modifiers': [
{'name': 'syst1','type': 'normsys','data': {'lo': 0.5, 'hi': 1.5}},
]
}
]
}
]
}
import pyhf
pyhf.set_backend(pyhf.tensor.numpy_backend())
import copy
import time
import json
deltas = []
for i in range(1,51):
print('.',i)
thisspec = copy.deepcopy(spec)
thisspec['channels'] = [copy.deepcopy(spec['channels'][0]) for nc in range(i)]
for m,c in enumerate(thisspec['channels']):
newname = 'syst1_{}'.format(m)
thisspec['channels'][m]['name'] = 'chan_{}'.format(m)
thisspec['channels'][m]['samples'][1]['modifiers'][0]['name'] = newname
debug = pyhf.hfpdf(thisspec, poiname='mu')
data = [40.] + debug.config.auxdata
print(data)
start = time.time()
pyhf.runOnePoint(1.0,data,debug,debug.config.suggested_init(), debug.config.suggested_bounds())
delta = time.time() - start
deltas.append(delta)
Non-linear scaling because of the additional systematics on each background?
yes, so the complexity of the fit increases so you need to minimize in increasingly-dimensional spaces, this is somewhat expected and also present in ROOT see @matthewfeickert's report. But the linear scaling in the channels is a bit annoying, probably one could make this smarter by collecting at p.d.f construction similar terms and making an effort to evaluate them in a vectorized way (e.g. all gaussians at once)
as for comparison with ROOT baseline, the channel calculations are parallelizable and I think ROOT makes use of that
[#1] INFO:Fitting -- RooAbsTestStatistic::initSimMode: created 34 slave calculators.
RooAbsTestStatistic::initSimMode: creating slave calculator #0 for state CR_Hnj_Hmeff_cuts (1 dataset entries)
RooAbsTestStatistic::initSimMode: creating slave calculator #1 for state VR_0l_Hnj_Hmeff_cuts (1 dataset entries)
RooAbsTestStatistic::initSimMode: creating slave calculator #2 for state VR_1l_Hnj_Hmeff_cuts (1 dataset entries)
RooAbsTestStatistic::initSimMode: creating slave calculator #3 for state SR_0l_Hnj_Hmeff_cuts (1 dataset entries)
RooAbsTestStatistic::initSimMode: creating slave calculator #4 for state SR_1l_Hnj_Hmeff_cuts (1 dataset entries)
The core of the slowness seems to be that the numpy ops in expected_sample
just aren't that fast. I broke it down into a minimal example and ran it through cProfile
def expected_sample():
factors = []
for i in range(10):
factors.append(np.ones(5))
for i in range(3):
factors.append(2)
r = np.product(np.stack(np.broadcast_arrays(*factors)), axis=0)
def logpdf():
r = [expected_sample() for i in range(210)]
this dummy version of logpdf
clocks in at 0.08s cumulative time per call. For a large model like the MBJ analysis we saw ~10k function calls in the minimzation (scipy.optimize, which is similar to the number of times MINUIT calls the function in ROOT), where this simple code would already take an hour.
The solution would be to calculate multiple samples in one go, such that you only have one call to np.product(np.stack(np.broadcast_arrays...))
also the broadcast seems to be slow, manual broadcasting is a factor 5 faster for me
def expected_sample():
factors = []
for i in range(10):
factors.append(np.ones(5))
for i in range(10):
factors.append(np.ones(5)*2)
r = np.product(np.stack(factors), axis=0)
return r
vs
def expected_sample():
factors = []
for i in range(10):
factors.append(np.ones(5))
for i in range(10):
factors.append([2])
r = np.product(np.stack(np.broadcast_arrays(*factors)), axis=0)
return r
Description
Here I describe the different steps as part of #3 for the Multi-b-jet analysis which I'm setting up with 34 channels. It seems to take forever to evaluate the CLs at a single mu_SIG value, so I link a few of the profiler.log visualizations generated based on call stacks.
This is the profiler output for 34 channels when calling:
pdf.pdf(init_pars, data)
pyhf.optimizer.unconstrained_bestfit(pyhf.loglambdav,data,pdf,init_pars,par_bounds)
It seems that some optimization could be done to improve the times here. It's certainly not that slow, but there is probably some caching that could be done with some of the function calls.
To make these, I have a python file with something like this
with this
profiler.log
, I runsnakeviz profiler.log
(pip install snakeviz
) and then save the HTML of that page. Follow the instructions here to convert that into a usable HTML that can be saved in a gist and then a URL with the right content-type headers can be generated with https://rawgit.com/.