sphinx-gallery / sphinx-gallery

Sphinx extension for automatic generation of an example gallery
https://sphinx-gallery.github.io
BSD 3-Clause "New" or "Revised" License
418 stars 203 forks source link

Multiprocesor support? #25

Closed Titan-C closed 4 months ago

Titan-C commented 9 years ago

I was wondering if one could generate the examples in parallel processes? I now python has a feature for sending processes to individual cpu. I can not be directly used right now because currently we scan recursively the example folder instead of generating a list of example files to treat first. It will also need to keep track of the back references.

GaelVaroquaux commented 9 years ago

Hi,

I think that this is not high priority. I would much more like to focus on getting IPython notebook output.

lesteve commented 9 years ago

Agreed. FWIW I have some hacky lines of bash using GNU parallel to run the examples in parallel and make sure that they all work. For some reason I haven't fully investigated, the speed-up I got running the nilearn examples was 2x (i.e. roughly 10 minutes instead of 20 minutes) even on 4 cores.

All I am trying to say is that such a speed-up is not going to change your life that much.

GaelVaroquaux commented 9 years ago

the speed-up I got running the nilearn examples was 2x (i.e. roughly 10 minutes instead of 20 minutes).

I/O?

All I am trying to say is that such a speed-up is not going to change your life that much.

Well, for the wider ecosystem anyhow IPython notebook support would be a much bigger benefit.

larsoner commented 8 years ago

FYI you can now run your own examples using make html_dev-pattern, so some lines of bash can be used to customize it for each repo. It's not automated, but it works e.g. for CircleCI.

ogrisel commented 8 years ago

Now that notebook exports have been implemented I think we should bump up the priority of this one. Building the scikit-learn documentation takes ages and lazy reviewers tend to not check the rendering and the integration of example generated figures because of that. I have 12 cores on my workstation and only one is working at the moment...

ogrisel commented 8 years ago

@lesteve this could probably be implemented with joblib.Parallel quite easily instead of using gnu parallel. One just need to make sure that the example building function does not return large python objects back to the main process but instead directly writes the output of the execution to disk. (e.g. images and joblib cached stuff).

lesteve commented 8 years ago

Now that notebook exports have been implemented I think we should bump up the priority of this one.

I wasn't aware that generating notebooks would take a lot of time ... are you confident that's the culprit here ?

I guess using joblib.Parallel would alleviate the problem seen in #57 (seaborn style set in one example was kept for all the other examples) since you could run each example in a separate process as mentioned in https://github.com/sphinx-gallery/sphinx-gallery/pull/140#issuecomment-243461269.

NelleV commented 7 years ago

That would be super useful for Matplotlib!

agramfort commented 7 years ago

I think this is now timely

NelleV commented 7 years ago

I've worked on this, and I have a working proof of concept using joblib. There is still a bunch of things I need to figure out such as how to get the number of jobs provided by the user to sphinx (which isn't documented at all…). How do you guys feel about adding joblib as a dependency? Should I work only from the stdlib?

choldgraf commented 7 years ago

I dunno - I hear those joblib folks are a buncha jerks ;-)

(in seriousness, I'm +1 on a joblib dependency if it means avoiding a lot of multiprocessing complexity here...)

larsoner commented 7 years ago

I imagine it could pretty easily be made an optional dependency only used/imported if the default is changed from 1 to something else, so +1 from me.

lesteve commented 7 years ago

@NelleV out of interest what is the kind of speed-up you get with multiprocessing ?

GaelVaroquaux commented 7 years ago

I imagine it could pretty easily be made an optional dependency only used/ imported if the default is changed from 1 to something else, so +1 from me.

+1 for optional. +0 for non optional.

agramfort commented 7 years ago

let's make it optional.

here is how we handle it in mne

https://github.com/mne-tools/mne-python/blob/master/mne/parallel.py#L22

NelleV commented 7 years ago

I'll make it optional.

choldgraf commented 7 years ago

(in particular here's the joblib checking code in MNE: https://github.com/mne-tools/mne-python/blob/master/mne/parallel.py#L77)

jschueller commented 3 years ago

hi @NelleV, do you still have that code somewhere ?

NelleV commented 3 years ago

Not easily available: I changed computer since, and I don't have the branch on github… Also, the code changed so much since then, that I'm pretty sure my code would be useless these days.

jschueller commented 3 years ago

I tried a simple approach to parallelize the loop in generate_dir_rst which iterates over the files within the same directory with ProcessPoolExecutor with no luck, maybe someone would want to check #877

larsoner commented 3 years ago

Thinking about this a bit more, I'd expect this not to work for (at least) the matplotlib, mayavi, and pyvista scrapers because these are all global-state based. And then there will be tricky interactions with reset_modules, which also by default does global state stuff at least for matplotlib. So I'm not sure this will ever work (easily) at least for the majority of our users :(

NelleV commented 3 years ago

It depends how it is implemented. Using multiprocessing (/joblib), there should be no problem.

larsoner commented 3 years ago

Ahh right, I hadn't thought about that!

jschueller commented 3 years ago

ProcessPoolExecutor uses multiprocessing right ? https://github.com/python/cpython/blob/3.10/Lib/concurrent/futures/process.py