Open kratsg opened 4 years ago
I believe dask would be a nice option due to a variety of reasons, primarily scaling up of data in future but before making an opinion I wanted to know about any personal experiences of limitations of dask over joblib.
@kratsg For those who don't know, like me, what's the advantage of using concurrent.futures
, as is currently done in the draft of PR #1158, over just using joblib
(beyond concurrent.futures
being built into the language)?
So replacing
with something like
from joblib import Parallel, delayed
...
# n_jobs is set as kwarg
signal_teststat = Parallel(n_jobs=n_jobs)(
delayed(teststat_func)(
poi_test,
sample,
self.pdf,
self.init_pars,
self.par_bounds,
self.fixed_params,
)
for sample in tqdm.tqdm(signal_sample, **tqdm_options, desc='Signal-like')
)
(and corresponding code for bkg_teststat
) with the default "loky" backend I was seeing rates of over 500 toys/second on branches that have PR #1610 implimented.
Description
There are starting to be locations in
pyhf
where one can start parallelizing certain calculations on behalf of the user (rather than the user explicitly parallelizing). For example, one that will come up is with the toy calculation added in #790 where we need to do a for-loop and calculate the test statistic for each toy.This cannot be batched or vectorized quite simply because a statistical fit is performed for each toy (and num iterations is not necessarily the same for each toy). There may be other good examples in the code-base in the future that we will want the parallelism.
Is your feature request related to a problem? Please describe.
No.
Describe the solution you'd like
Perhaps something like
pip install pyhf[toytools]
orpyhf[toys-joblib]
orpyhf[toys-dask]
.Describe alternatives you've considered
Dunno. I didn't think hard enough yet.
Relevant Issues and Pull Requests
Additional context
Nope.