python / cpython

The Python programming language
https://www.python.org
Other
63.08k stars 30.21k forks source link

Doc tests action takes too long #118891

Closed Privat33r-dev closed 2 weeks ago

Privat33r-dev commented 5 months ago

Issue

Usually doctest's action's "Run documentation doctest"^1 job takes around 12 minutes^2 to execute. Most of this time is taken by the internal framework doctest.py^3. The cause of the issue is GC regression in Python 3.13.

Potential Solution

Fix #124567

Additional Improvements

Improve doctest performance

Since the tests are not interconnected (between different doc files), they can be easily parallelized. I suggest using ProcessPoolExecutor that will be suitable to "bypass" GIL (assuming that tasks are CPU-bound). Github runners have 4 cores available^4, so, potentially, it can speed up execution by up to 4 times.

@AlexWaygood @erlend-aasland

erlend-aasland commented 5 months ago

Sounds good; do you have a PR ready? :)

Privat33r-dev commented 5 months ago

I was waiting for a feedback because I might've missed something :) My schedule is a bit tight lately, but I will try to find a time to implement it (hopefully in a week or two).

Privat33r-dev commented 4 months ago

Sorry, it's going to be delayed. Currently I am loaded with work, but hopefully soon I can start work on the PR.

hugovk commented 4 months ago

No rush, thanks for the update :)

willingc commented 2 months ago

@Privat33r-dev Do you still wish to work on a PR for this? If you are busy, would you mind someone picking up the work to do the PR?

Privat33r-dev commented 2 months ago

I will try to do it within a week, if I can't manage to find time for this, then it will be a good idea for another person to do it. Sorry for the unexpected delay.

willingc commented 2 months ago

No worries on a delay @Privat33r-dev. Seems like it would be a nice addition since the doctests are slow 😉

Privat33r-dev commented 2 months ago

After further investigation I understood the following. The command to run doctests that takes around 13-16 minutes is: xvfb-run make -C Doc/ PYTHON=../python SPHINXERRORHANDLING="-W --keep-going" doctest xvfb-run is needed to simulate graphic environment (my guess it is needed to test turtle and some other modules) make -C Doc/ doctest is launching Doc/Makefile with doctest builder. Other flags are less relevant, but in short they point out to the Python installation and how the sphinx should handle errors.

Inside of the Makefile, the sphinx-build is being executed with the doctest builder.

So, basically, the sphinx wrapper for the doctests is used. Source

Likely the parallelization should be applied there, I will continue and I might update this comment/add a new comment with further findings.


However, what's actually interesting is that speed of the doctest depends on Python version. Python 3.13 tests are running 3 times slower than 3.12. Example #1 (Python 3.13) 14m 57s test that ran yesterday Example #2 (Python 3.12) 5m 18s test that also ran yesterday

Search request: https://github.com/python/cpython/pulls?q=is%3Apr+label%3Adocs+%5B3.12%5D+


Also caching does not work efficiently and it might contribute to the problem, but it's a different topic. image

Privat33r-dev commented 2 months ago

I would be glad if someone can join in resolving this issue since currently I am, unfortunately, still occupied by other business. At this weekend I will try to continue.

General conclusion from earlier findings:

  1. The multiprocessing change is needed to be done in the Sphinx, rather than in the Python's core library. However, due to the speed degradation comparing to earlier Python version, I think that it would be highly beneficial to investigate it first, instead of just masking the issue by applying multiprocessing.
  2. There is a major speed degeneration in Python 3.13 doctest on GH Runner comparing to Python 3.12. I have a few theories (all require further testing/findings to be confirmed/rejected):
    1. Due to some changes, 3.13 is slower on some operations. I just hope that it's not PEP 703-related. How to check: doctest.py performance tests with profiling on both versions.
    2. There were some changes in the doctest library that cause speed loss. How to check: diff doctest.py between both versions.

I can't think of another reason quickly, but it might be as well caused by some Python API changes, building flags or other thing. However, if we can fix and return to near-3.12 speed, it would be a major win already. Applying multiprocessing to it would just skyrocket the speed.

The Plan

  1. Investigate the performance issue
  2. https://github.com/sphinx-doc/sphinx/blob/master/sphinx/ext/doctest.py#L363C1-L368C1 -> multiprocessing
hugovk commented 3 weeks ago

@AA-Turner Do you have any ideas?

The doctest on CPython CI is much slower on 3.13 and 3.14 (main) than with 3.12. It's not specifically doctest, but the main Sphinx build is slow before starting the doctest.

For example:

With the example builds, when the full 3.12 Sphinx build + doctest had finished, 3.13 was only at reading sources... [ 73%] library/stdtypes .. library/trace.

AA-Turner commented 3 weeks ago

Didn't we previously reason that this is as pre-releases don't use an optimised build of Python? What happens if we run CI with the released rc2?

A

hugovk commented 3 weeks ago

For both of these, we use non-optimised Python built from the branch -- because we're testing the branch's Python.

https://github.com/python/cpython/blob/f4997bb3ac961d6aaf07ce650cd074e28ce6ccd0/.github/workflows/reusable-docs.yml#L95_L118

willingc commented 3 weeks ago

Hmm...I'm wondering if we use pytest to run these doctests instead of sphinx build plus doctest. We could then parallelize the doctests if needed. @hugovk @AA-Turner

AA-Turner commented 3 weeks ago

For both of these, we use non-optimised Python built from the branch -- because we're testing the branch's Python.

Right, I was just wondering if the slowdown was soley due to this (which we have no real choice over) or if other factors in Python 3.13+ have caused Sphinx to slow down (which we might be able to resolve).

I'm wondering if we use pytest to run these doctests instead of sphinx build plus doctest.

I don't think pytest supports test{setup,cleanup,code} directives which we make heavy use of.

A

Privat33r-dev commented 3 weeks ago

@AA-Turner

Maybe we can do tests parallelization in the Sphinx itself? If you think that it will work, I can find a time this week to implement it.

https://github.com/sphinx-doc/sphinx/blob/master/sphinx/ext/doctest.py#L363C1-L368C1 -> multiprocessing

AlexWaygood commented 3 weeks ago

I wrote the following script, with which I can reproduce locally a significant performance regression over the course of Python 3.13 development when using Sphinx to build a single documentation file:

Script ```py import contextlib import shutil import subprocess import time import venv from pathlib import Path def run(args): try: subprocess.run(args, check=True, capture_output=True, text=True) except subprocess.CalledProcessError as e: print(e.stdout) print(e.stderr) raise with contextlib.chdir("Doc"): try: for path in Path(".").iterdir(): if path.is_dir() and not str(path).startswith("."): for doc_path in path.rglob("*.rst"): if doc_path != Path("library/typing.rst"): doc_path.write_text("foo") venv.create(".venv", with_pip=True) run([ ".venv/bin/python", "-m", "pip", "install", "-r", "requirements.txt", "--no-binary=':all:'", ]) start = time.perf_counter() run([ ".venv/bin/python", "-m", "sphinx", "-b", "html", ".", "build/html", "library/typing.rst", ]) print(time.perf_counter() - start) shutil.rmtree(".venv") shutil.rmtree("build") finally: subprocess.run(["git", "restore", "."], check=True, capture_output=True) ```

If I save the script as bisect_sphinx.py in the root of my CPython clone, and then run ./configure --with-pydebug && make -j10 && ./python.exe bisect_sphinx.py, the script reports that building library/typing.rst takes roughly eight seconds on main, but only around 3s with the v3.13.0a1 tag checked out.

I'll now try to use this script to see if I can bisect the performance regression to a specific commit.

AA-Turner commented 3 weeks ago

Thanks Alex. You prompted a memory of https://github.com/sphinx-doc/sphinx/issues/12174, which might be relevant.

A

AlexWaygood commented 3 weeks ago

Thanks Alex. You prompted a memory of https://github.com/sphinx-doc/sphinx/issues/12174, which might be relevant.

Indeed! The results of the bisection are in:

15309329b65a285cb7b3071f0f08ac964b61411b is the first bad commit
commit 15309329b65a285cb7b3071f0f08ac964b61411b
Author: Mark Shannon <mark@hotpy.org>
Date:   Wed Mar 20 08:54:42 2024 +0000

    GH-108362: Incremental Cycle GC (GH-116206)

 Doc/whatsnew/3.13.rst                              |  30 +
 Include/internal/pycore_gc.h                       |  41 +-
 Include/internal/pycore_object.h                   |  18 +-
 Include/internal/pycore_runtime_init.h             |   8 +-
 Lib/test/test_gc.py                                |  72 +-
 .../2024-01-07-04-22-51.gh-issue-108362.oB9Gcf.rst |  12 +
 Modules/gcmodule.c                                 |  25 +-
 Objects/object.c                                   |  21 +
 Objects/structseq.c                                |   5 +-
 Python/gc.c                                        | 806 +++++++++++++--------
 Python/gc_free_threading.c                         |  23 +-
 Python/import.c                                    |   2 +-
 Python/optimizer.c                                 |   2 +-
 Tools/gdb/libpython.py                             |   7 +-
 14 files changed, 684 insertions(+), 388 deletions(-)
 create mode 100644 Misc/NEWS.d/next/Core and Builtins/2024-01-07-04-22-51.gh-issue-108362.oB9Gcf.rst

I did another bisect to find out what caused the performance improvement between 15309329b65a285cb7b3071f0f08ac964b61411b and 909c6f718913e713c990d69e6d8a74c05f81e2c2, and the result was e28477f214276db941e715eebc8cdfb96c1207d9 (git bisect reports it as the "first bad commit", but in this case it's actually the first good commit):

e28477f214276db941e715eebc8cdfb96c1207d9 is the first bad commit
commit e28477f214276db941e715eebc8cdfb96c1207d9
Author: Mark Shannon <mark@hotpy.org>
Date:   Fri Mar 22 18:43:25 2024 +0000

    GH-117108: Change the size of the GC increment to about 1% of the total heap size. (GH-117120)

 Include/internal/pycore_gc.h                       |  3 +-
 Lib/test/test_gc.py                                | 35 +++++++++++++++-------
 .../2024-03-21-12-10-11.gh-issue-117108._6jIrB.rst |  3 ++
 Modules/gcmodule.c                                 |  2 +-
 Python/gc.c                                        | 30 +++++++++----------
 Python/gc_free_threading.c                         |  2 +-
 6 files changed, 47 insertions(+), 28 deletions(-)
 create mode 100644 Misc/NEWS.d/next/Core and Builtins/2024-03-21-12-10-11.gh-issue-117108._6jIrB.rst

So, we now know which CPython commits are the root cause of our Doctest CI job taking so much longer on the 3.13 branch than on the 3.12 branch! I'm now going to see if the same performance regression can be reproduced with a non-debug build. (So far, I've only been testing with a debug build, as this is what we use for our Doctest CI job, so this is what was relevant for understanding why our Doctest CI job might be taking so much longer on the 3.13 branch.

nascheme commented 3 weeks ago

Here is what linux-perf shows. 47% of time spent in cyclic GC. I don't see any other obvious hot spots to optimize.

image

I did a second run with the GC completely disabled and without --with-pydebug. Again no obvious hot spots.

image

AlexWaygood commented 3 weeks ago

The good news is that the bisection script is hugely faster on a PGO-optimized, non-debug build. The bad news is that it's still a fair amount slower than on 3.13:

That's around a 48% performance regression on my bisection script with a PGO-optimized build, which is still quite bad!

nascheme commented 3 weeks ago

I tried running the tests with Python 3.12 and with the gc.DEBUG_STATS flag turned on. The output suggests that this workload is creating a lot of cyclic garbage and the gen 0 (youngest) collection of the cyclic GC is dealing with it. The "unreachable" number is counting cyclic garbage objects found the GC.

I don't know enough about the "incremental" collector to comment on how it might be better optimized for this workload. I do suspect that there are many more similar workloads in the world and the GC that comes with the 3.13 release will hopefully not be much slower for them.


gc: collecting generation 0...
gc: objects in each generation: 1175 3804 1663292
gc: objects in permanent generation: 375
gc: done, 520 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 976 4408 1663292
gc: objects in permanent generation: 375
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 982 5336 1663289
gc: objects in permanent generation: 375
gc: done, 404 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 959 5866 1663289
gc: objects in permanent generation: 375
gc: done, 404 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 948 6397 1663288
gc: objects in permanent generation: 375
gc: done, 116 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 813 7178 1663288
gc: objects in permanent generation: 375
gc: done, 404 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 1...
gc: objects in each generation: 951 7537 1663273
gc: objects in permanent generation: 375
gc: done, 5219 unreachable, 0 uncollectable, 0.0010s elapsed
AlexWaygood commented 3 weeks ago

For colour, I created a variant of my script that measures the execution time of parsing and building all of CPython's docs with a PGO-optimized build. (The script I used for bisecting the perf regression only parsed and built library/typing.rst.) Here are the results on v3.13.0a1 and https://github.com/python/cpython/commit/ede1504c4b37b98636bf935679746ba2932f5677:

That's a 38% regression; pretty similar to what we saw with the smaller sample (and again, very bad!).

Variant of my script that parses and builds all of CPython's docs ```py import contextlib import shutil import subprocess import time import venv from pathlib import Path def run(args): subprocess.run(args, check=True, text=True) with contextlib.chdir("Doc"): venv.create(".venv", with_pip=True) run([ ".venv/bin/python", "-m", "pip", "install", "-r", "requirements.txt", "--no-binary=':all:'", ]) start = time.perf_counter() run([ ".venv/bin/python", "-m", "sphinx", "-b", "html", ".", "build/html", ]) print(time.perf_counter() - start) shutil.rmtree(".venv") shutil.rmtree("build") ```
AlexWaygood commented 3 weeks ago

I created a new issue summarising our findings here. It seems clear that the new incremental GC causes significant performance regressions for workloads like Sphinx that create lots of reference cycles; that seems like something that's worth investigating more:

Privat33r-dev commented 3 weeks ago

@AlexWaygood amazing findings, thanks for your investigation :) I felt that the issue is with likely with Python itself (because GH Actions are the same & many other things are the same as well), however, I couldn't make a proper investigation due to time constraints.

Thanks for everyone involved, I am glad that you found the root cause.

Privat33r-dev commented 2 weeks ago

The issue was mostly resolved by #124770.

However, there still is a room for improvement, but mostly on the Sphinx's side. I don't see any obvious improvements that can be done in this repository, so the issue is resolved.

I am glad that performance regression was prevented in 3.13 as a result of a teamwork. Thanks for all the participants.

AlexWaygood commented 2 weeks ago

@Privat33r-dev, thanks so much for opening the issue and for spotting how much slower the job was on 3.13 than 3.12. You helped prevent a new version of Python being released with a serious performance regression that could have affected lots of workloads (not just Sphinx!) — it's great that we identified it before the release went out! I'm only sorry it took us a while to investigate.