Restructuring multiprocessing

danwos commented 1 year ago

Is your feature request related to a problem? Please describe. The current implementation of the multiprocessing architecture has some problems, especially with bigger projects (500 - 25.000 pages):.

The memory consumption can be extremely high and may need 32 GB or more
The created processes are not running in parallel for some time, which may cost ~30% of build time

I would like to work on this, but need some feedback about concepts and ideas, as the needed changes will affect some part of the architecture.

From some comments and older issues it looks like the multiprocessing part was mainly created to deal with IO waiting times and make the writing more efficient. It looks like parallelizing the computation itself was not one of the main goals.

The reasons for the above problems are:

1. Processes are created to often

Sphinx is calculating chunks based on the number of documents and cores/-j X.

Each chunk gets its own process, which is started if a prio-process is done and terminated (simply said). The amount of chunks is calculated in a way, that for bigger projects you get much more chunks as you have configured via -j. But this also means, that several new processes are "forked" over the time and get a copy of the raising ENV of the main processes. Process creation costs time.

Solution idea: The amount of chunks should match the amount of -j X, so that we create long running processes only once (Drawback: The implemented log-collection needs an update, as collecting the logs when a process is done leads to no output to the user for several minutes.)

2. Processes are "forked", not "spawned"

Each forked process gets a copy of the main process, which contains the complete ENV including all document information from already calculated docs.

This means if the main process is using 4 GB of RAM and you are working on an 8 core system (-j 8), Sphinx will create 7 parallel processes and all of them get a copy of the 4 GB => 4GB + 7*4 GB = 32 GB free RAM needed. The only solution is to reduce the amount of cores e.g. -j 4, but that may cost you up to 50% of build performance.

Solution idea: Less process creation. And if possible "spawing" instead of "forking". "Spawing" will not make a memory copy of the main process, so data needs to be given to its child-processes via a pipe. This would be a huge conceptional change. I also found a PR, where "forking" was added for Mac OS X support.

3. Serialized vs Parallel tasks

Before Sphinx starts the parallel tasks, it calculates serial tasks in the main process. The serial calculation of a document is taking often longer as the parallel calculation part for the document.

Also, a process gets started after Sphinx has done all the serial calculations for a number of docs (a chunk.) So if you have 7 chunks, the process for the last chunk gets started when the serial calculation for all of the other chunks in the main process is done, which may take 20min in bigger projects. So at the end the parallel processes are not started at the same time, so that some of the processes are started, some are waiting for getting started, and some are already done with their tasks.

Solution idea: Recheck if certain tasks can be put into parallel execution to keep the amount of serialized Build time as small as possible. By a given -j X, Sphinx should create X parallel processes. Currently, it is X-1 to save one core for the main process. But the main process has nothing to do after starting the last processes, so one core is not used.

Additional context I already have tested a solution for problem 1, by creating only so many chunks as configured by -j X. This already reduces the memory consumption by ~20-30%, as the processes are forked at the beginning when the ENV is still quite small and so the copied memory.

My background I'm a maintainer of some Sphinx extensions (like Sphinx-Needs) and provide support for some bigger company-internal Sphinx projects in the Automotive industry. There are a lot of API-docs get created and different extensions are used, so that project size and complexity are often really huge. Scalability of Sphinx is a big topic, to keep the build time short for local builds on developer machines. Therefore I already have done a lot of analysis and enhancement for Sphinx extensions and created also Sphinx-Performance for easier technical memory and runtime comparisons of different Sphinx setups.

Question Does a core developer think working on multiprocessing makes sense? Would like to get some thoughts before working on a PR that may not get merged later :)

And sorry for the long text, it is a not so easy topic :)

picnixz commented 1 year ago

Before addressing multiprocessing in its globality, it would also make sense to know where most of the workload is located and eventually optimize these parts.

As pointed by another issue (#11282), resolving references is usually a bottleneck in the overall computation. In addition, concerning the memory complexity, I don't know whether #11337 (addressed by #11338) can improve this or not. Anyway, I think that finding the culprits in large projects would also give us better insight on how to parallelize things (and what we actually need to communicate to the different processes). Also, I think that it makes sens to have more than one PR addressing this issue, or multiple people working on that. But I'd be happy if we could improve the complexity of the build process in general.

danwos commented 1 year ago

Thanks for the feedback and I totally agree, there seem to be some functions, which take a lot of performance and can (hopefully) be optimized.

For me getting the multiprocessing right would bring the most performance boost (as a single task). So me personal would concentrate on this for the moment.

However, I have done some Sphinx analysis months ago and created a document with some numbers: Google Docs sheet: Sphinx analysis It mostly concentrates on showing the impact of parallel builds and selecting the right theme.

It was created with the help of Sphinx-Performance, which created all needed bigger projects in seconds and automated the measurement. So it could be a great support tool, especially as some analysis tools like memray and pyinstrument got integrated.

An interesting result: I would have expected that the build-time per file stays nearly the same, no matter if 10 docs or 100 docs get generated. But that's not the case and the same file may need 50ms if build in a small project or 150ms in part of a bigger project. So Sphinx scales not really well...

And yes, we should totally split the different tasks into single PRs or even issues. But maybe this issue can be used to discuss the overall picture and make a list of findings/ideas for mutiprocessing, before breaking it down.

picnixz commented 1 year ago

But maybe this issue can be used to discuss the overall picture and make a list of findings/ideas for mutiprocessing, before breaking it down.

Yes. It's good to have a "global" issue where we can put some tasks. Unfortunately, I don't think there is a possiblity to have "sub-issues" (which would be good).

I would have expected that the build-time per file stays nearly the same, no matter if 10 docs or 100 docs get generated.

Problem is that the more documents you have, the more references you will have as well. Also, it is important to distinguish between the building phase and the writing phase. Even more if you are using autodoc or autosummary. I have worked a lot on autodoc because I want to minimize the number of documentation I duplicate (I am always writing the documentation at the level of the code and hardly any documentation in standalone RSTs). Now, it is worth looking at the following flow:

1. event.config-inited(app,config)
2. event.builder-inited(app)
3. event.env-get-outdated(app, env, added, changed, removed)
4. event.env-before-read-docs(app, env, docnames)

for docname in docnames:
   5. event.env-purge-doc(app, env, docname)

   if doc changed and not removed:
      6. source-read(app, docname, source)
      7. run source parsers: text -> docutils.document
         - parsers can be added with the app.add_source_parser() API
      8. apply transforms based on priority: docutils.document -> docutils.document
         - event.doctree-read(app, doctree) is called in the middle of transforms,
           transforms come before/after this event depending on their priority.

9. event.env-merge-info(app, env, docnames, other)
   - if running in parallel mode, this event will be emitted for each process

10. event.env-updated(app, env)
11. event.env-get-updated(app, env)
12. event.env-check-consistency(app, env)

# The updated-docs list can be builder dependent, but generally includes all new/changed documents,
# plus any output from `env-get-updated`, and then all "parent" documents in the ToC tree
# For builders that output a single page, they are first joined into a single doctree before post-transforms
# or the doctree-resolved event is emitted
for docname in updated-docs:
   13. apply post-transforms (by priority): docutils.document -> docutils.document
   14. event.doctree-resolved(app, doctree, docname)
       - In the event that any reference nodes fail to resolve, the following may emit:
       - event.missing-reference(env, node, contnode)
       - event.warn-missing-reference(domain, node)

15. Generate output files
16. event.build-finished(app, exception)

The main phases are:

Preparing (1-4)
Reading (5-8)
Merging and pickling (9-12). At this point, the RST sources are done being handled and we have stored document trees.
Processing the document tress (13-14). This phase may take a very long time depending on the cross-references to resolve (the resolver is a post-transformation).
Write and cleanup (15-16). The bottleneck is the writing part.

In your above graph, it would be more relevant to know which phase is the bottleneck. Also, I didn't check what files are being built but having dependencies across files or using autodoc may also increase the building phase (and the resolution phase). Concerning the theme to chose, this is essentially because of the underlying implementation (IIRC only the alabaster theme is maintained by Sphinx).

For now, I suggest focusing on getting the right timings and memory consumption for each phase.

danwos commented 1 year ago

Thanks for the explanation. I have fought with the event system of Sphinx already a lot during extension development.

The multiprocessing part is a generic implementation, which is used in Read and Write phase. So fixing this deals already with two phases and makes them more efficient.

We maybe can also put some tasks of the other phases into a parallel execution. For sure not resolving references, but I'm sure we can find something (e.g. image generation in write_serial).

Also threading could be used for the IO tasks in the writing phase. For better IO performance we do not need processes.

The biggest problem I currently have with multiprocessing is that everything needs to be pickable. This works for a doctree-object in the reading phase, so that a long-runnng process can get it via a pipe.

However, during write_serial some class-instances (like some reporter) are added to the doctree-object, so that it can't be sent to a process anymore. Only a completely new process has access to it. Python does not throw any error, that these embedded elements can't be pickled. Only Sphinx is complaining in later steps, that e.g. reporter is None. If we can keep the doctree-object pickable for all possible parallel phases, we would have a lot more possibilities to design a clean multiprocessing. Maybe without "forking" and any memory copy.

twodrops commented 1 year ago

Thanks @danwos for the detailed issue description.

Current Problem

Yes, multi-processing is a huge issue for us as it doesn't seem to be implemented with scalability in focus. Here is a related issue https://github.com/swyddfa/esbonio/issues/502 and the root cause is Sphinx multi-processing.

Background

We have more than 65,000 rSTs which we process with Sphinx. We also have machines upto 72 virtual cores as mentioned in the issue above.

What did we do until now

For now, we have a monkeypatch created by @Rubyfi which made Sphinx 300% faster. We would like to contibute these fixes as well. First issue created by @Rubyfi here some time back. https://github.com/sphinx-doc/sphinx/issues/10967

Next Steps

Even these fixes have reached its limit because of some conceptual problems in Sphinx Multiprocessing. Together with @danwos we would like to fix and contribute these.

@danwos Another issue which we noticed with Sphinx multi-processing is that, Sphinx (or the multi-processing library which Sphinx uses) allocates processes and waits for all of them to be done to start with the next allocation. This means, for example, if we have 8 processes running on 8 cores and one of them has a large file, the 7 cores will be idle until this single process is complete. We fix this currently by shuffling the batch-size so that the large file gets allocated as part of an optimal chunk. This is however highly indeterministic and could change with the changes in file order. Is this also part of your finding?

Rubyfi commented 1 year ago

Weighing in here as well: Would this be an opportunity to get parallel processing working on NT systems?

If possible, the rework could be done using MPIRE. That way, the implementation would be (as far as I can tell) OS agnostic and still performant. Maybe @mb-emag can share his experience from using it for doxysphinx.

AA-Turner commented 12 months ago

@danwos -- thank you for writing this up.

Multiprocessing is a deficient area in Sphinx, and one I would like to see improve significantly. If you (or anyone else) are able to support to improve the parallel capabilities I would be very grateful. I intend to have a look myself at some point, but I can't promise anything unfortunatley.

Thanks, Adam

mb-emag commented 12 months ago

Weighing in here as well: Would this be an opportunity to get parallel processing working on NT systems?

If possible, the rework could be done using MPIRE. That way, the implementation would be (as far as I can tell) OS agnostic and still performant. Maybe @mb-emag can share his experience from using it for doxysphinx.

I use MPIRE in Doxysphinx and for me it works like a drop in replacement for python multiprocessing (but I only use it as a kind of "parallel for loop" and there is not much data shared between processes). Not sure how that would work with sphinx's environment, especially if it gets large because of a huge amount of files (any maybe needs to be synced between processes?). When I chose MPIRE back then, Ray was also in the selection - also fast, even multi-machine capable but maybe also like using a sledgehammer to crack a nut.

twodrops commented 12 months ago

@AA-Turner Thanks for your feedback. We have projects with more than 70k rST files to process. We internally optimized the sphinx multi-process read such that 30k files can be read under 1 minute and necessary doctrees created. The parallel write, as it is implemented now, however does not scale. We plan to reimplment this through @danwos and make necessary contributions. It will be great if you could review/test our contributions once they are available.

danieleades commented 6 months ago

are there mechanisms for testing parallel builds using pytest?

@pytest.mark.sphinx(testroot="my_test_project")
def test_parallel_build(app: Sphinx) -> None:
    app.warningiserror = True
    app.build()  # how to test this works when distributed across multiple processes?

sphinx-doc / sphinx