Open Rubyfi opened 1 year ago
Oh, what a coincidence. I likely face the very same issue for newly converted GCC documentation. On a machine with 160 cores I get the following cProfile (after ~5 minutes):
ncalls tottime percall cumtime percall filename:lineno(function)
497/1 0.001 0.000 311.019 311.019 {built-in method builtins.exec}
1 0.000 0.000 311.019 311.019 sphinx-build:1(<module>)
1 0.000 0.000 310.746 310.746 build.py:306(main)
1 0.004 0.004 310.745 310.745 build.py:268(build_main)
1 0.000 0.000 310.399 310.399 application.py:339(build)
1 0.000 0.000 310.399 310.399 __init__.py:301(build_update)
1 0.000 0.000 310.398 310.398 __init__.py:314(build)
1 0.000 0.000 310.395 310.395 __init__.py:384(read)
1 0.009 0.009 310.386 310.386 __init__.py:456(_read_parallel)
324 0.044 0.000 310.023 0.957 parallel.py:120(_join_one)
293 0.008 0.000 307.161 1.048 __init__.py:476(merge)
293 0.050 0.000 273.810 0.935 __init__.py:351(merge_info_from)
293 0.009 0.000 273.521 0.934 cpp.py:7917(merge_domaindata)
101038/293 1.031 0.000 273.512 0.933 cpp.py:4807(merge_with)
618010 59.922 0.000 261.630 0.000 cpp.py:4412(_find_named_symbols)
131399834 61.641 0.000 184.644 0.000 cpp.py:4440(matches)
1 0.005 0.005 167.875 167.875 parallel.py:102(join)
311 0.239 0.001 142.469 0.458 parallel.py:88(add_task)
131399833 96.795 0.000 123.003 0.000 cfamily.py:84(__eq__)
586 33.457 0.057 33.643 0.057 {built-in method _pickle.loads}
131500914 16.998 0.000 16.998 0.000 cpp.py:4462(candidates)
131408156/131408152 14.050 0.000 14.050 0.000 {built-in method builtins.getattr}
131412043 12.171 0.000 12.171 0.000 {method 'items' of 'dict' objects}
201490 0.156 0.000 10.658 0.000 cpp.py:4050(get_newest_id)
Ok, so my issue is a different and reported that in a separate issue.
I have just been experimenting with different batch sizes when building the Linux kernel docs. My results suggest that the best performance is achieved by using a minimum batch size of 200 for reads because batches smaller than that have too high a merge overhead back into the main process. I also experimented with a minimum threshold of 500 before even splitting into batches, i.e. if there are less than 500 changed docs then just process them serially.
With the existing make_chunks
behaviour, a small number of changed docs gives worst case behaviour of 1 doc per chunk. Merging single docs back into a ~3.5k main process destroys any benefit from the parallel processing. E.g. running make htmldocs SPHINXOPTS=-j12
Running Sphinx v7.2.6
[...]
building [html]: targets for 3445 source files that are out of date
updating environment: [new config] 3445 added, 0 changed, 0 removed
[...]
real 7m46.198s
user 14m18.597s
sys 0m54.925s
for a full build of 3445 files vs an incremental build of just 114 files:
Running Sphinx v7.2.6
[...]
building [html]: targets for 114 source files that are out of date
updating environment: 0 added, 114 changed, 0 removed
real 5m50.746s
user 6m33.199s
sys 0m13.034s
When I run the incremental build serially with make htmldocs SPHINXOPTS=-j1
then it is much faster:
building [html]: targets for 114 source files that are out of date
updating environment: 0 added, 114 changed, 0 removed
real 1m5.034s
user 1m3.183s
sys 0m1.616s
Is your feature request related to a problem? Please describe. I'm currently working on a project with a large documentation (~22k files). We noticed that reading in parallel is particularly slow:
I managed to track this issue down to the calculation of the chunk size for parallel processing: https://github.com/sphinx-doc/sphinx/blob/cc314f13e8a98393ab018d83d8957a724a6f338a/sphinx/util/parallel.py#L137-L150 When setting
maxbatch
from 10 to 1000 the read performance improves significantly:Describe the solution you'd like It would be ideal if Sphinx provided a means to set this value manually, e.g. as an argument for sphinx-build.