Open sergpolly opened 10 months ago
Had issues myself using method="fork"
on the outtermost WorkerPool
- it was crashing shamelessly ...
Here is an example on how to use method="spawn"
or forkserver
(it is also important to add use_dill=True
as well , otherwise pickle is start failing ...)
def _job(data_pack, sample):
# unpack shared data ...
clr_dict, view_df = data_pack
# define cooler to work on ...
_clr = clr_dict[sample]
from cooltools.sandbox.obs_over_exp_cooler import expected_full
# calculate full expected (cis + trans)
_exp_full = expected_full(
_clr,
view_df=view_df,
smooth_cis=False,
aggregate_trans=True,
expected_column_name="expected",
nproc=8,
)
return (sample, _exp_full)
# have to use daemon=False, because _job is multiprocessing-based already ...
with WorkerPool(
n_jobs=16,
shared_objects=(clrs_samples_dict, hg38_arms),
daemon=False,
start_method='forkserver', # or spawn ...
use_dill=True
) as wpool:
results = wpool.map(
_job,
list(clrs_samples_dict.keys()),
progress_bar=True
)
# sort out the results ...
exps_full_dict = {sample: _exp for sample, _exp in results}
key here , is that spawn
isn't "as aware" of all the global vars as fork
, so one has to pass vars explicitly and do imports inside each task ...
since nested multiprocess is experimental - better to keep track of related issues and such on the mpire side https://github.com/sybrenjansen/mpire/issues/105
Consider demonstrating an example of parallel execution of some of the cooltools API functions for multiple samples - i.e. when an API function itself is using multiprocessing and we want to do it in the in the notebook ...
If one have a big multicore system (16 real cores and more) it is easy to run several CLI tasks in parallel for multiple samples, where each task itself is using several cores - i.e. is multiprocessed. Very often such multiprocessed operations does not scale well beyond 8-12 processes - so it is indeed more "economical" to process multiple samples at once with fewer cores each.
Now - what if we want to achieve the same but in the notebook ? It is not trivial to do so - because multiprocess does not allow nesting (the way we typically use it/out of the box). Now it can be easily done with MPIRE https://github.com/sybrenjansen/mpire , which allows running multiple multiprocessed task in parallel and its API is very similar to multiprocess itself ... Check it out:
mpire test:
one-by-one using a ton of cores per task:
this has limited application to projects with many samples and people with big workstations - but when those 2 criteria are both met - the speed up is very appreciated