ENH: scipy._lib._util: add Parallelizer

mdhaber commented 1 year ago

@tupui this is what I had in mind for multiprocessing: instead of adding workers to every function that accepts a callable (e.g. the resampling methods, sobol_indices) have the user wrap their callable for multiprocessing (since that's the only thing that might be expensive enough to make multiprocessing worth it).

import numpy as np
from scipy import stats
from scipy._lib._util import Parallelizer

rng = np.random.default_rng()
x = rng.random(size=(12, 13, 14))
ref = stats.moment(x, axis=1, moment=[2, 3, 4])
ref.shape

if __name__ == '__main__':
    parallel_moment = Parallelizer(stats.moment, workers=2,
                                   split_axis=-1, concatenate_axis=-1)
    res = parallel_moment(x, axis=1, moment=[2, 3, 4])
    np.testing.assert_equal(res, ref)

    res = parallel_moment(x, axis=1, moment=[2, 3, 4],
                          workers=3, split_axis=0, concatenate_axis=1)
    np.testing.assert_equal(res, ref)

tupui commented 1 year ago

Interesting idea. This would solve the issue of parallelism on the stat function itself. To solve the problem on sobol_indices we would call this from within the function correct? With func being the user function. I am not sure if this simplifies much things compare to calling our mapper though. Or this function can be used by a user to not use the vectorisation on the number of output but rely on this.

mdhaber commented 1 year ago

To solve the problem on sobol_indices we would call this from within the function correct?

This was not the intent. The idea is that the user would use it to parallelize their function before passing it into sobol_indices. That is because multiprocessing can't help the code inside sobol_indices much at all because it's so simple and vectorized. The only potential use for multiprocessing is in vectorizing the user function.

We could have sobol_indices accept a workers parameter and then use Parallelizer on the function, but it would not be so much harder for the user to do it themselves. Since multiprocessing introduces pain points (e.g. needing if __name__ == '__main__', function pickleability) I think the users need to take some responsibility for it rather than hiding everything inside the SciPy functions. And if users have trouble, I'd rather get bug reports about Parallelizer than function-specific bug reports that boil down to multiprocessing issues.

With func being the user function. I am not sure if this simplifies much things compare to calling our mapper though.

Do you mean MapWrapper? This simplifies things compared to using MapWrapper directly. It takes one line rather than many.

Or this function can be used by a user to not use the vectorisation on the number of output but rely on this.

No, I don't think that is the intent. This is for use especially when func is already vectorized.

mdhaber / scipy

ENH: scipy._lib._util: add Parallelizer #93