Closed Titan-C closed 4 months ago
Hi,
I think that this is not high priority. I would much more like to focus on getting IPython notebook output.
Agreed. FWIW I have some hacky lines of bash using GNU parallel to run the examples in parallel and make sure that they all work. For some reason I haven't fully investigated, the speed-up I got running the nilearn examples was 2x (i.e. roughly 10 minutes instead of 20 minutes) even on 4 cores.
All I am trying to say is that such a speed-up is not going to change your life that much.
the speed-up I got running the nilearn examples was 2x (i.e. roughly 10 minutes instead of 20 minutes).
I/O?
All I am trying to say is that such a speed-up is not going to change your life that much.
Well, for the wider ecosystem anyhow IPython notebook support would be a much bigger benefit.
FYI you can now run your own examples using make html_dev-pattern
, so some lines of bash
can be used to customize it for each repo. It's not automated, but it works e.g. for CircleCI.
Now that notebook exports have been implemented I think we should bump up the priority of this one. Building the scikit-learn documentation takes ages and lazy reviewers tend to not check the rendering and the integration of example generated figures because of that. I have 12 cores on my workstation and only one is working at the moment...
@lesteve this could probably be implemented with joblib.Parallel quite easily instead of using gnu parallel. One just need to make sure that the example building function does not return large python objects back to the main process but instead directly writes the output of the execution to disk. (e.g. images and joblib cached stuff).
Now that notebook exports have been implemented I think we should bump up the priority of this one.
I wasn't aware that generating notebooks would take a lot of time ... are you confident that's the culprit here ?
I guess using joblib.Parallel would alleviate the problem seen in #57 (seaborn style set in one example was kept for all the other examples) since you could run each example in a separate process as mentioned in https://github.com/sphinx-gallery/sphinx-gallery/pull/140#issuecomment-243461269.
That would be super useful for Matplotlib!
I think this is now timely
I've worked on this, and I have a working proof of concept using joblib. There is still a bunch of things I need to figure out such as how to get the number of jobs provided by the user to sphinx (which isn't documented at all…). How do you guys feel about adding joblib as a dependency? Should I work only from the stdlib?
I dunno - I hear those joblib folks are a buncha jerks ;-)
(in seriousness, I'm +1 on a joblib dependency if it means avoiding a lot of multiprocessing complexity here...)
I imagine it could pretty easily be made an optional dependency only used/imported if the default is changed from 1 to something else, so +1 from me.
@NelleV out of interest what is the kind of speed-up you get with multiprocessing ?
I imagine it could pretty easily be made an optional dependency only used/ imported if the default is changed from 1 to something else, so +1 from me.
+1 for optional. +0 for non optional.
let's make it optional.
here is how we handle it in mne
https://github.com/mne-tools/mne-python/blob/master/mne/parallel.py#L22
I'll make it optional.
(in particular here's the joblib checking code in MNE: https://github.com/mne-tools/mne-python/blob/master/mne/parallel.py#L77)
hi @NelleV, do you still have that code somewhere ?
Not easily available: I changed computer since, and I don't have the branch on github… Also, the code changed so much since then, that I'm pretty sure my code would be useless these days.
I tried a simple approach to parallelize the loop in generate_dir_rst which iterates over the files within the same directory with ProcessPoolExecutor with no luck, maybe someone would want to check #877
Thinking about this a bit more, I'd expect this not to work for (at least) the matplotlib, mayavi, and pyvista scrapers because these are all global-state based. And then there will be tricky interactions with reset_modules, which also by default does global state stuff at least for matplotlib. So I'm not sure this will ever work (easily) at least for the majority of our users :(
It depends how it is implemented. Using multiprocessing (/joblib), there should be no problem.
Ahh right, I hadn't thought about that!
ProcessPoolExecutor uses multiprocessing right ? https://github.com/python/cpython/blob/3.10/Lib/concurrent/futures/process.py
I was wondering if one could generate the examples in parallel processes? I now python has a feature for sending processes to individual cpu. I can not be directly used right now because currently we scan recursively the example folder instead of generating a list of example files to treat first. It will also need to keep track of the back references.