spatialaudio / nbsphinx

:ledger: Sphinx source parser for Jupyter notebooks
https://nbsphinx.readthedocs.io/
MIT License
453 stars 130 forks source link

Building notebooks in parallel does not work (on Windows) #801

Closed SergejKr closed 1 month ago

SergejKr commented 3 months ago

Hello,

I have noticed that the parallel execution of jupyter notebooks does not work (for me). See below or use the zip (source.zip) for a minimal example of the problem. Executing sphinx-build source build -j4 -b html does not show any performance increase as compared to sphinx-build source build -j1 -b html. Both take about 38 seconds on my machine. Each notebook waits for 10 sconds, thus there is no parallel execution of the notebooks. On my main project, .rst files are build parallel as expected but the notebooks always slowdown the build process.

From the documentation https://nbsphinx.readthedocs.io/en/0.9.3/usage.html#Running-Sphinx I would expect that the parallelization works without much set up. Is this a bug or the expected behaviour?

I am using:

Sphinx = "7.3.7"
nbsphinx = "0.9.4"
notebook = "7.2.1"

This is the complete list in the newly set up virtaul env after installing the above packages:

alabaster                     0.7.16
anyio                         4.4.0
argon2-cffi                   23.1.0
argon2-cffi-bindings          21.2.0
arrow                         1.3.0
asttokens                     2.4.1
async-lru                     2.0.4
attrs                         23.2.0
Babel                         2.15.0
beautifulsoup4                4.12.3
bleach                        6.1.0
certifi                       2024.6.2
cffi                          1.16.0
charset-normalizer            3.3.2
colorama                      0.4.6
comm                          0.2.2
debugpy                       1.8.1
decorator                     5.1.1
defusedxml                    0.7.1
docutils                      0.21.2
exceptiongroup                1.2.1
executing                     2.0.1
fastjsonschema                2.20.0
fqdn                          1.5.1
h11                           0.14.0
httpcore                      1.0.5
httpx                         0.27.0
idna                          3.7
imagesize                     1.4.1
importlib_metadata            7.1.0
ipykernel                     6.29.4
ipython                       8.18.1
isoduration                   20.11.0
jedi                          0.19.1
Jinja2                        3.1.4
json5                         0.9.25
jsonpointer                   3.0.0
jsonschema                    4.22.0
jsonschema-specifications     2023.12.1
jupyter_client                8.6.2
jupyter_core                  5.7.2
jupyter-events                0.10.0
jupyter-lsp                   2.2.5
jupyter_server                2.14.1
jupyter_server_terminals      0.5.3
jupyterlab                    4.2.2
jupyterlab_pygments           0.3.0
jupyterlab_server             2.27.2
MarkupSafe                    2.1.5
matplotlib-inline             0.1.7
mistune                       3.0.2
nbclient                      0.10.0
nbconvert                     7.16.4
nbformat                      5.10.4
nbsphinx                      0.9.4
nest-asyncio                  1.6.0
notebook                      7.2.1
notebook_shim                 0.2.4
overrides                     7.7.0
packaging                     24.1
pandocfilters                 1.5.1
parso                         0.8.4
pip                           23.2.1
platformdirs                  4.2.2
prometheus_client             0.20.0
prompt_toolkit                3.0.47
psutil                        5.9.8
pure-eval                     0.2.2
pycparser                     2.22
Pygments                      2.18.0
python-dateutil               2.9.0.post0
python-json-logger            2.0.7
pywin32                       306
pywinpty                      2.0.13
PyYAML                        6.0.1
pyzmq                         26.0.3
referencing                   0.35.1
requests                      2.32.3
rfc3339-validator             0.1.4
rfc3986-validator             0.1.1
rpds-py                       0.18.1
Send2Trash                    1.8.3
setuptools                    68.2.0
six                           1.16.0
sniffio                       1.3.1
snowballstemmer               2.2.0
soupsieve                     2.5
Sphinx                        7.3.7
sphinxcontrib-applehelp       1.0.8
sphinxcontrib-devhelp         1.0.6
sphinxcontrib-htmlhelp        2.0.5
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.7
sphinxcontrib-serializinghtml 1.1.10
stack-data                    0.6.3
terminado                     0.18.1
tinycss2                      1.3.0
tomli                         2.0.1
tornado                       6.4.1
traitlets                     5.14.3
types-python-dateutil         2.9.0.20240316
typing_extensions             4.12.2
uri-template                  1.3.0
urllib3                       2.2.2
wcwidth                       0.2.13
webcolors                     24.6.0
webencodings                  0.5.1
websocket-client              1.8.0
wheel                         0.41.2
zipp                          3.19.2

Setup of the minimal example

source -- conf.py -- index.rst -- test1.ipynb -- test2.ipynb -- test3.ipynb

The conf.py file:

# 
extensions = [
    'nbsphinx'
]

#
pygments_style = 'sphinx'

The index.rst file:

************************************
Minimal example project
************************************

.. toctree::
    :maxdepth: 2

    test1.ipynb
    test2.ipynb
    test3.ipynb

The test1.ipynb, etc. files have two cells, one markdown, one python cell:

    # %% [markdown]
    # # Title1

    # %%
    "import time\n",
    "time.sleep(10)"
mgeier commented 3 months ago

I would expect that the parallelization works without much set up. Is this a bug or the expected behaviour?

This is a bug.

I have no idea why it doesn't work, though. I can reproduce the problem with your example, but I also tried a different example (https://github.com/AudioSceneDescriptionFormat/splines) where parallelization worked fine!

That's really strange ...

SergejKr commented 2 months ago

I tried to build the docs of the splines package on a fresh virtual enviroment. I got an error in the compilation of the documentation at about 70%. I guess the versions of the packages I have used do not work with it. During the compilation I did not have the impression that it is running in parallel.

alabaster                     0.7.16
asttokens                     2.4.1
attrs                         23.2.0
Babel                         2.15.0
beautifulsoup4                4.12.3
bleach                        6.1.0
certifi                       2024.7.4
charset-normalizer            3.3.2
colorama                      0.4.6
comm                          0.2.2
contourpy                     1.2.1
cycler                        0.12.1
debugpy                       1.8.2
decorator                     5.1.1
defusedxml                    0.7.1
docutils                      0.21.2
exceptiongroup                1.2.2
executing                     2.0.1
fastjsonschema                2.20.0
fonttools                     4.53.1
idna                          3.7
imagesize                     1.4.1
importlib_metadata            8.2.0
importlib_resources           6.4.0
insipid-sphinx-theme          0.4.2
ipykernel                     6.29.5
ipython                       8.18.1
jedi                          0.19.1
Jinja2                        3.1.4
jsonschema                    4.23.0
jsonschema-specifications     2023.12.1
jupyter_client                8.6.2
jupyter_core                  5.7.2
jupyterlab_pygments           0.3.0
kiwisolver                    1.4.5
latexcodec                    3.0.0
MarkupSafe                    2.1.5
matplotlib                    3.9.1
matplotlib-inline             0.1.7
mistune                       3.0.2
mpmath                        1.3.0
nbclient                      0.10.0
nbconvert                     7.16.4
nbformat                      5.10.4
nbsphinx                      0.9.4
nest-asyncio                  1.6.0
numpy                         2.0.1
packaging                     24.1
pandocfilters                 1.5.1
parso                         0.8.4
pillow                        10.4.0
pip                           24.0
platformdirs                  4.2.2
prompt_toolkit                3.0.47
psutil                        6.0.0
pure_eval                     0.2.3
pybtex                        0.24.0
pybtex-docutils               1.0.3
Pygments                      2.18.0
pyparsing                     3.1.2
python-dateutil               2.9.0.post0
pywin32                       306
PyYAML                        6.0.1
pyzmq                         26.0.3
referencing                   0.35.1
requests                      2.32.3
rpds-py                       0.19.1
scipy                         1.13.1
setuptools                    70.0.0
six                           1.16.0
snowballstemmer               2.2.0
soupsieve                     2.5
Sphinx                        7.4.7
sphinx-codeautolink           0.15.2
sphinx-last-updated-by-git    0.3.7
sphinxcontrib-applehelp       1.0.8
sphinxcontrib-bibtex          2.6.2
sphinxcontrib-devhelp         1.0.6
sphinxcontrib-htmlhelp        2.0.6
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.8
sphinxcontrib-serializinghtml 1.1.10
splines                       0.3.2
stack-data                    0.6.3
sympy                         1.13.1
tinycss2                      1.3.0
tomli                         2.0.1
tornado                       6.4.1
traitlets                     5.14.3
typing_extensions             4.12.2
urllib3                       2.2.2
wcwidth                       0.2.13
webencodings                  0.5.1
wheel                         0.43.0
zipp                          3.19.2
mgeier commented 1 month ago

I got an error in the compilation of the documentation at about 70%.

This should be enough to see whether it is reading (and executing) the notebooks in parallel.

When I run it with -j4 I can clearly see that 4 cores are maxing out.

There is also a change in the terminal output:

$ python -m sphinx doc _build
[...]
reading sources... [  4%] euclidean/bezier-de-casteljau
$ python -m sphinx doc _build -j4
[...]
reading sources... [ 20%] euclidean/bezier .. euclidean/end-conditions-natural

Note that when reading in parallel, a range of notebooks is is shown instead of a single one.

If you want to try a project with fewer dependencies, you can try this: https://github.com/AudioSceneDescriptionFormat/asdf

$ python -m sphinx doc _build -j4
[...]
reading sources... [ 80%] seq-par .. splines
Sphinx parallel build error:
nbsphinx.NotebookError: CellExecutionError in quaternions.ipynb:
------------------
Quaternion.rotate_point((0, 1, 0), q_z.subs(alpha, sp.pi / 2))
[...]

Conveniently, this currently raises an error, which even mentions that it is doing a parallel build!

SergejKr commented 1 month ago

Hi, I tried it again.

  1. I do not see the change in the terminal output, so something does not work as intented.
  2. For adsf, I do not recognize any improvement in the runtime. Your observed error in the notebook execution does not occur for me.

Could you please send me your package versions, so that I can try it with these. Maybe there is some differen in the OS. Do you use windows or linux?

mgeier commented 1 month ago

I also tried it again and I found out why your minimal example didn't work in parallel for me: apparently Sphinx only does parallel processing if there are more than 5 source files: https://github.com/sphinx-doc/sphinx/blob/d56cf30ecb2d68651c75b454f0aeae74304285dd/sphinx/builders/__init__.py#L431.

I have reported this surprising behavior in https://github.com/sphinx-doc/sphinx/pull/12796.

I guess your example still doesn't run in parallel when you add two more notebooks?

For me, it took 17 seconds.

I would like to check if this really is related to nbsphinx ... did you try your example without nbsphinx?

I tried it with this example setup:

conf.py:

import time

def source_read(app, docname, content):
    time.sleep(10)

def setup(app):
    app.connect('source-read', source_read)

index.rst:

Test
====

.. toctree::

    test1
    test2
    test3
    test4
    test5

test1.rst to test5.rst:

Test Page
=========

When running this, I get:

$ time python -m sphinx . _build -j6
Sphinx v8.1.0+/f1078bdfa [...]
[...]
real    0m10,993s
user    0m1,660s
sys 0m0,192s

Does that work for you?

mgeier commented 1 month ago

Could you please send me your package versions, so that I can try it with these.

I was using the latest Git versions of Sphinx and SymPy, I guess.

Do you use windows or linux?

Linux

SergejKr commented 1 month ago

Hi I tested your example without nbsphinx, and in fact no multiprocessing is active. I Looked through github and found an old issue https://github.com/sphinx-doc/sphinx/issues/8296 stating that sphinx does not run parallel on windows. Further searching revealed that the parallel execution of sphinx does still only work on systems allowing "fork". This can be checked in the source code of sphinx under "sphinx/sphinx/util/parallel.py" (https://github.com/sphinx-doc/sphinx/tree/v8.0.2/sphinx/util):

# our parallel functionality only works for the forking Process
parallel_available = multiprocessing and os.name == 'posix'
precv, psend = multiprocessing.Pipe(False)
context: Any = multiprocessing.get_context('fork')

I already feared that this would be OS dependant. Interestingly, there is no remark in the documenation of Sphinx for that. This issue can be closed because it is not related to nbsphinx. Thanks for your time.

mgeier commented 1 month ago

Thanks for tracking this down!

Interestingly, there is no remark in the documenation of Sphinx for that.

Would you like to create a PR at https://github.com/sphinx-doc/sphinx/pulls for this? I think this would be helpful.