kratsg commented 6 years ago

Description

Here I describe the different steps as part of #3 for the Multi-b-jet analysis which I'm setting up with 34 channels. It seems to take forever to evaluate the CLs at a single mu_SIG value, so I link a few of the profiler.log visualizations generated based on call stacks.

This is the profiler output for 34 channels when calling:

It seems that some optimization could be done to improve the times here. It's certainly not that slow, but there is probably some caching that could be done with some of the function calls.

To make these, I have a python file with something like this

import cProfile
pr = cProfile.Profile()
pr.enable()
#pdf.pdf(init_pars, data)
pyhf.optimizer.unconstrained_bestfit(pyhf.loglambdav,data,pdf,init_pars,par_bounds)
pr.disable()
pr.dump_stats('profiler.log')

with this profiler.log, I run snakeviz profiler.log (pip install snakeviz) and then save the HTML of that page. Follow the instructions here to convert that into a usable HTML that can be saved in a gist and then a URL with the right content-type headers can be generated with https://rawgit.com/.

kratsg commented 6 years ago

For the evaluation of a single CLs point mu_SIG = 0.0, I tried to see how slow it was as we keep adding channels. This evaluates

CLsOnePoint(0.0, data, pdf, init_pars, par_bounds)

and the results so far are

[1 channel] 1 loop, best of 3: 148 ms per loop
[2 channel] 1 loop, best of 3: 521 ms per loop
[3 channel] 1 loop, best of 3: 1.32 s per loop
[4 channel] 1 loop, best of 3: 3.3 s per loop
[5 channel] 1 loop, best of 3: 5.45 s per loop
[10 channels] 1 loop, best of 3: 23.5 s per loop
[34 channels] 1 loop, best of 3: 6min 19s per loop

lukasheinrich commented 6 years ago

I would have expected this to be faster, but the scaling behaviour between adding bins to a channel and adding channels seems different. I changed the title of this issue and it would be good to do a similar study as @matthewfeickert did for scaling bins, but this time by scaling channels.

In the end the baseline is ROOT to judge how slow is too slow (also we should make sure we use the same number of cores, I remember ROOT having workers to parallelize parts of the computation)

lukasheinrich commented 6 years ago

I have this simple code

spec = {
    'channels': [
        {
            'name': 'singlechannel',
            'samples': [
                {
                    'name': 'signal',
                    'data': [0.5],
                    'modifiers': [{'name': 'mu', 'type': 'normfactor', 'data': None}]
                },
                {
                    'name': 'background',
                    'data': [40.],
                    'modifiers': [
                        {'name': 'syst1','type': 'normsys','data': {'lo': 0.5, 'hi': 1.5}},
                    ]
                }
            ]
        }
    ]
}

import copy
import time
deltas = []
for i in range(1,201):
    thisspec = copy.deepcopy(spec)
    thisspec['channels'] = spec['channels'] * i
    import pyhf
    pyhf.set_backend(pyhf.tensor.numpy_backend())
    debug = pyhf.hfpdf(thisspec, poiname='mu')
    list(map(float,debug.config.auxdata))
    start = time.time()
    p = debug.logpdf(debug.config.suggested_init(), [40.] + debug.config.auxdata)
    delta = time.time() - start
    deltas.append(delta)

which gives linear scaling so on the pdf evaluation front I think we are fine. Would be good to make this into a test

download

lukasheinrich commented 6 years ago

pytorch is simiar and also linear:

download 1

lukasheinrich commented 6 years ago

this is this script

spec = {
    'channels': [
        {
            'name': 'singlechannel',
            'samples': [
                {
                    'name': 'signal',
                    'data': [0.5],
                    'modifiers': [{'name': 'mu', 'type': 'normfactor', 'data': None}]
                },
                {
                    'name': 'background',
                    'data': [40.],
                    'modifiers': [
                        {'name': 'syst1','type': 'normsys','data': {'lo': 0.5, 'hi': 1.5}},
                    ]
                }
            ]
        }
    ]
}

import pyhf

pyhf.set_backend(pyhf.tensor.numpy_backend())

import copy
import time
deltas = []
for i in range(1,201):
    print('.',i)
    thisspec = copy.deepcopy(spec)
    thisspec['channels'] = spec['channels'] * i
    debug = pyhf.hfpdf(thisspec, poiname='mu')
    list(map(float,debug.config.auxdata))
    data = [40.] + debug.config.auxdata
    start = time.time()
    pyhf.runOnePoint(1.0,data,debug,debug.config.suggested_init(), debug.config.suggested_bounds())
    delta = time.time() - start
    deltas.append(delta)

so far I'm still confused how we hit this issue, these initial tests seem like 34 channels shouldn't really be an issue

download 2

lukasheinrich commented 6 years ago

this is getting closer to the problem (i.e. each background sample in each channel adds a new syst)

spec = {
    'channels': [
        {
            'name': 'singlechannel',
            'samples': [
                {
                    'name': 'signal',
                    'data': [0.5],
                    'modifiers': [{'name': 'mu', 'type': 'normfactor', 'data': None}]
                },
                {
                    'name': 'background',
                    'data': [40.],
                    'modifiers': [
                        {'name': 'syst1','type': 'normsys','data': {'lo': 0.5, 'hi': 1.5}},
                    ]
                }
            ]
        }
    ]
}

import pyhf
pyhf.set_backend(pyhf.tensor.numpy_backend())

import copy
import time
import json
deltas = []
for i in range(1,51):
    print('.',i)
    thisspec = copy.deepcopy(spec)
    thisspec['channels'] = [copy.deepcopy(spec['channels'][0]) for nc in range(i)]
    for m,c in enumerate(thisspec['channels']):
        newname =  'syst1_{}'.format(m)
        thisspec['channels'][m]['name'] = 'chan_{}'.format(m)
        thisspec['channels'][m]['samples'][1]['modifiers'][0]['name'] = newname
    debug = pyhf.hfpdf(thisspec, poiname='mu')
    data = [40.] + debug.config.auxdata
    print(data)
    start = time.time()
    pyhf.runOnePoint(1.0,data,debug,debug.config.suggested_init(), debug.config.suggested_bounds())
    delta = time.time() - start
    deltas.append(delta)

download 3

kratsg commented 6 years ago

Non-linear scaling because of the additional systematics on each background?

lukasheinrich commented 6 years ago

yes, so the complexity of the fit increases so you need to minimize in increasingly-dimensional spaces, this is somewhat expected and also present in ROOT see @matthewfeickert's report. But the linear scaling in the channels is a bit annoying, probably one could make this smarter by collecting at p.d.f construction similar terms and making an effort to evaluate them in a vectorized way (e.g. all gaussians at once)

https://github.com/matthewfeickert/DIANA-Proposal-Feickert/blob/master/summary_report/summary_report.pdf

lukasheinrich commented 6 years ago

as for comparison with ROOT baseline, the channel calculations are parallelizable and I think ROOT makes use of that

[#1] INFO:Fitting -- RooAbsTestStatistic::initSimMode: created 34 slave calculators.
RooAbsTestStatistic::initSimMode: creating slave calculator #0 for state CR_Hnj_Hmeff_cuts (1 dataset entries)
RooAbsTestStatistic::initSimMode: creating slave calculator #1 for state VR_0l_Hnj_Hmeff_cuts (1 dataset entries)
RooAbsTestStatistic::initSimMode: creating slave calculator #2 for state VR_1l_Hnj_Hmeff_cuts (1 dataset entries)
RooAbsTestStatistic::initSimMode: creating slave calculator #3 for state SR_0l_Hnj_Hmeff_cuts (1 dataset entries)
RooAbsTestStatistic::initSimMode: creating slave calculator #4 for state SR_1l_Hnj_Hmeff_cuts (1 dataset entries)

lukasheinrich commented 6 years ago

The core of the slowness seems to be that the numpy ops in expected_sample just aren't that fast. I broke it down into a minimal example and ran it through cProfile

def expected_sample():
    factors = []
    for i in range(10):
        factors.append(np.ones(5))
    for i in range(3):
        factors.append(2)

    r = np.product(np.stack(np.broadcast_arrays(*factors)), axis=0)

def logpdf():
    r = [expected_sample() for i in range(210)]

this dummy version of logpdf clocks in at 0.08s cumulative time per call. For a large model like the MBJ analysis we saw ~10k function calls in the minimzation (scipy.optimize, which is similar to the number of times MINUIT calls the function in ROOT), where this simple code would already take an hour.

The solution would be to calculate multiple samples in one go, such that you only have one call to np.product(np.stack(np.broadcast_arrays...))

screenshot

out.prof.txt

also the broadcast seems to be slow, manual broadcasting is a factor 5 faster for me

def expected_sample():
    factors = []
    for i in range(10):
        factors.append(np.ones(5))
    for i in range(10):
        factors.append(np.ones(5)*2)
    r = np.product(np.stack(factors), axis=0)
    return r

vs

def expected_sample():
    factors = []
    for i in range(10):
        factors.append(np.ones(5))
    for i in range(10):
        factors.append([2])
    r = np.product(np.stack(np.broadcast_arrays(*factors)), axis=0)
    return r

scikit-hep / pyhf

Benchmark CLs computation performance as function of number of channels #143

Description