speed up refutation by parallelization

adg-ci commented 2 years ago

Hi, could we try to speed up the refutation by running the loops in parallel, both within a method while looping over number of simulations and also while looping through a list of refutation methods?

Thanks.

amit-sharma commented 2 years ago

That is a good idea @adg-ci. Any thoughts on how we can achieve that in the cleanest manner (i.e., without having to parallelize each refuter separately)? If you have any suggestions for libraries to use for this, that will be useful too.

amit-sharma commented 2 years ago

It may be a good idea to start with parallelization when looping through a list of refutation methods. This can be done without changing any internal code: we simply need to add an example in the docs on how to achieve this.

astoeffelbauer commented 2 years ago

Hi @amit-sharma,

I'm thinking of using this as my very first open source contribution.

I'd say the multiprocessing or the joblib package would be good options here since they're both standard Python libraries. Not sure if one has an intrinsic advantage over the other so I tried them both.

Please let me know what you think of my two parallel implementations below, or if you had any other solution/library in mind.

Running the three refutation methods in the Getting Started Notebook locally on my laptop took

1:59 min without parallelization
59 sec using multiprocessing
56 sec using joblib

so I think parallelization would definitely be useful.

from multiprocessing import Pool
from joblib import Parallel, delayed
from functools import partial

# create dictionaries with method specific kwargs
kwargs_common = dict(
    method_name="random_common_cause")

kwargs_random = dict(
    method_name="placebo_treatment_refuter",
    placebo_type="permute")

kwargs_subset = dict(
    method_name="data_subset_refuter",
    subset_fraction=0.9,
    random_seed=1)

# put into a list
kwargs_list = [kwargs_common, kwargs_random, kwargs_subset]

# ---------- using multiprocessing ----------

with Pool(processes=4) as pool:

    # partial function to fix some arguments
    refute = partial(model.refute_estimate, identified_estimand, estimate)

    # run refutation methods asynchronously in parallel
    results = [pool.apply_async(refute, kwds=kwds) for kwds in kwargs_list]

    # print results
    [print(res.get()) for res in results]

# -------------- using joblib --------------

# partial function to fix some arguments
refute = partial(model.refute_estimate, identified_estimand, estimate)

# run refutation methods in parallel
results = Parallel(n_jobs=4)(delayed(refute)(**kwds) for kwds in kwargs_list)

# print results
[print(res) for res in results]

amit-sharma commented 2 years ago

thanks for the code example, @astoeffelbauer . Let's go with joblib since it is fastest and EconML also uses it., so that will lead to some consistency benefit.

I'd suggest starting a PR with the joblib version added as a faster alternative at the end of the getting started notebook, and adding joblib to the requirements.txt file.

In addition to applying joblib across different refute methods, can we also apply it inside each method? Each refutation method involves a loop over multiple independent simulations. See, for example, the loop in placebo_treatment_refuter.py. Each iteration produces one element of the sample_estimates array, which is later averaged to give the refuter result. Other refuters employ a similar loop. Will be great if we can apply joblib inside the refuter methods too. What do you think?

astoeffelbauer commented 2 years ago

@amit-sharma yep agree, think that would be great. Happy to work on this too.

rahulbshrestha commented 4 months ago

Hey! I noticed most of the refuters are already parallelized after this PR. The one remaining is DummyOutcomeRefuter. Does it make sense to add it for this too? I can work on it.

amit-sharma commented 4 months ago

Yes, that will be useful, @rahulbshrestha

py-why / dowhy

speed up refutation by parallelization #410