Open adg-ci opened 2 years ago
That is a good idea @adg-ci. Any thoughts on how we can achieve that in the cleanest manner (i.e., without having to parallelize each refuter separately)? If you have any suggestions for libraries to use for this, that will be useful too.
It may be a good idea to start with parallelization when looping through a list of refutation methods. This can be done without changing any internal code: we simply need to add an example in the docs on how to achieve this.
Hi @amit-sharma,
I'm thinking of using this as my very first open source contribution.
I'd say the multiprocessing
or the joblib
package would be good options here since they're both standard Python libraries. Not sure if one has an intrinsic advantage over the other so I tried them both.
Please let me know what you think of my two parallel implementations below, or if you had any other solution/library in mind.
Running the three refutation methods in the Getting Started Notebook locally on my laptop took
so I think parallelization would definitely be useful.
from multiprocessing import Pool
from joblib import Parallel, delayed
from functools import partial
# create dictionaries with method specific kwargs
kwargs_common = dict(
method_name="random_common_cause")
kwargs_random = dict(
method_name="placebo_treatment_refuter",
placebo_type="permute")
kwargs_subset = dict(
method_name="data_subset_refuter",
subset_fraction=0.9,
random_seed=1)
# put into a list
kwargs_list = [kwargs_common, kwargs_random, kwargs_subset]
# ---------- using multiprocessing ----------
with Pool(processes=4) as pool:
# partial function to fix some arguments
refute = partial(model.refute_estimate, identified_estimand, estimate)
# run refutation methods asynchronously in parallel
results = [pool.apply_async(refute, kwds=kwds) for kwds in kwargs_list]
# print results
[print(res.get()) for res in results]
# -------------- using joblib --------------
# partial function to fix some arguments
refute = partial(model.refute_estimate, identified_estimand, estimate)
# run refutation methods in parallel
results = Parallel(n_jobs=4)(delayed(refute)(**kwds) for kwds in kwargs_list)
# print results
[print(res) for res in results]
thanks for the code example, @astoeffelbauer . Let's go with joblib since it is fastest and EconML also uses it., so that will lead to some consistency benefit.
I'd suggest starting a PR with the joblib version added as a faster alternative at the end of the getting started notebook, and adding joblib
to the requirements.txt file.
In addition to applying joblib across different refute methods, can we also apply it inside each method? Each refutation method involves a loop over multiple independent simulations. See, for example, the loop in placebo_treatment_refuter.py. Each iteration produces one element of the sample_estimates
array, which is later averaged to give the refuter result. Other refuters employ a similar loop. Will be great if we can apply joblib inside the refuter methods too. What do you think?
@amit-sharma yep agree, think that would be great. Happy to work on this too.
Hey! I noticed most of the refuters are already parallelized after this PR. The one remaining is DummyOutcomeRefuter
. Does it make sense to add it for this too? I can work on it.
Yes, that will be useful, @rahulbshrestha
Hi, could we try to speed up the refutation by running the loops in parallel, both within a method while looping over number of simulations and also while looping through a list of refutation methods?
Thanks.