Allow setting maxbatch to adjust chunk size for parallel read/write

Rubyfi commented 1 year ago

Is your feature request related to a problem? Please describe. I'm currently working on a project with a large documentation (~22k files). We noticed that reading in parallel is particularly slow:

sphinx-build init        0:00:47.712593
sphinx-build read        0:09:39.427063
sphinx-build write       0:03:55.736320
sphinx-build copy, dump  0:04:10.119199

I managed to track this issue down to the calculation of the chunk size for parallel processing: https://github.com/sphinx-doc/sphinx/blob/cc314f13e8a98393ab018d83d8957a724a6f338a/sphinx/util/parallel.py#L137-L150 When setting maxbatch from 10 to 1000 the read performance improves significantly:

sphinx-build init        0:00:45.914173
sphinx-build read        0:00:50.814347
sphinx-build write       0:03:36.089072
sphinx-build copy, dump  0:04:06.186373

Describe the solution you'd like It would be ideal if Sphinx provided a means to set this value manually, e.g. as an argument for sphinx-build.

marxin commented 1 year ago

Oh, what a coincidence. I likely face the very same issue for newly converted GCC documentation. On a machine with 160 cores I get the following cProfile (after ~5 minutes):

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    497/1    0.001    0.000  311.019  311.019 {built-in method builtins.exec}
        1    0.000    0.000  311.019  311.019 sphinx-build:1(<module>)
        1    0.000    0.000  310.746  310.746 build.py:306(main)
        1    0.004    0.004  310.745  310.745 build.py:268(build_main)
        1    0.000    0.000  310.399  310.399 application.py:339(build)
        1    0.000    0.000  310.399  310.399 __init__.py:301(build_update)
        1    0.000    0.000  310.398  310.398 __init__.py:314(build)
        1    0.000    0.000  310.395  310.395 __init__.py:384(read)
        1    0.009    0.009  310.386  310.386 __init__.py:456(_read_parallel)
      324    0.044    0.000  310.023    0.957 parallel.py:120(_join_one)
      293    0.008    0.000  307.161    1.048 __init__.py:476(merge)
      293    0.050    0.000  273.810    0.935 __init__.py:351(merge_info_from)
      293    0.009    0.000  273.521    0.934 cpp.py:7917(merge_domaindata)
101038/293    1.031    0.000  273.512    0.933 cpp.py:4807(merge_with)
   618010   59.922    0.000  261.630    0.000 cpp.py:4412(_find_named_symbols)
131399834   61.641    0.000  184.644    0.000 cpp.py:4440(matches)
        1    0.005    0.005  167.875  167.875 parallel.py:102(join)
      311    0.239    0.001  142.469    0.458 parallel.py:88(add_task)
131399833   96.795    0.000  123.003    0.000 cfamily.py:84(__eq__)
      586   33.457    0.057   33.643    0.057 {built-in method _pickle.loads}
131500914   16.998    0.000   16.998    0.000 cpp.py:4462(candidates)
131408156/131408152   14.050    0.000   14.050    0.000 {built-in method builtins.getattr}
131412043   12.171    0.000   12.171    0.000 {method 'items' of 'dict' objects}
   201490    0.156    0.000   10.658    0.000 cpp.py:4050(get_newest_id)

marxin commented 1 year ago

Ok, so my issue is a different and reported that in a separate issue.

donaldh commented 7 months ago

I have just been experimenting with different batch sizes when building the Linux kernel docs. My results suggest that the best performance is achieved by using a minimum batch size of 200 for reads because batches smaller than that have too high a merge overhead back into the main process. I also experimented with a minimum threshold of 500 before even splitting into batches, i.e. if there are less than 500 changed docs then just process them serially.

With the existing make_chunks behaviour, a small number of changed docs gives worst case behaviour of 1 doc per chunk. Merging single docs back into a ~3.5k main process destroys any benefit from the parallel processing. E.g. running make htmldocs SPHINXOPTS=-j12

Running Sphinx v7.2.6
[...]
building [html]: targets for 3445 source files that are out of date
updating environment: [new config] 3445 added, 0 changed, 0 removed
[...]
real    7m46.198s
user    14m18.597s
sys 0m54.925s

for a full build of 3445 files vs an incremental build of just 114 files:

Running Sphinx v7.2.6
[...]
building [html]: targets for 114 source files that are out of date
updating environment: 0 added, 114 changed, 0 removed
real    5m50.746s
user    6m33.199s
sys 0m13.034s

When I run the incremental build serially with make htmldocs SPHINXOPTS=-j1 then it is much faster:

building [html]: targets for 114 source files that are out of date
updating environment: 0 added, 114 changed, 0 removed
real    1m5.034s
user    1m3.183s
sys 0m1.616s

sphinx-doc / sphinx

Allow setting maxbatch to adjust chunk size for parallel read/write #10967