python / cpython

The Python programming language
https://www.python.org
Other
63.37k stars 30.33k forks source link

Incremental GC causes a significant slowdown for Sphinx #124567

Open AlexWaygood opened 1 month ago

AlexWaygood commented 1 month ago

Bug report

A significant performance regression in Sphinx caused by changes in CPython 3.13

Here is a script that does the following things:

  1. Replaces the contents of all CPython documentation files except Doc/library/typing.rst with simply "foo"
  2. Creates a virtual environment
  3. Installs our doc dependencies into the environment (making sure that we use pure-Python versions for all doc dependencies rather than built wheels that might include C extensions)
  4. Times how long it takes to build the docs using that environment
  5. Restores all the modified docs files and deletes the virtual environment again
The script ```py import contextlib import shutil import subprocess import time import venv from pathlib import Path def run(args): try: subprocess.run(args, check=True, capture_output=True, text=True) except subprocess.CalledProcessError as e: print(e.stdout) print(e.stderr) raise with contextlib.chdir("Doc"): try: for path in Path(".").iterdir(): if path.is_dir() and not str(path).startswith("."): for doc_path in path.rglob("*.rst"): if doc_path != Path("library/typing.rst"): doc_path.write_text("foo") venv.create(".venv", with_pip=True) run([ ".venv/bin/python", "-m", "pip", "install", "-r", "requirements.txt", "--no-binary=':all:'", ]) start = time.perf_counter() run([ ".venv/bin/python", "-m", "sphinx", "-b", "html", ".", "build/html", "library/typing.rst", ]) print(time.perf_counter() - start) shutil.rmtree(".venv") shutil.rmtree("build") finally: subprocess.run(["git", "restore", "."], check=True, capture_output=True) ```

Using a PGO-optimized build with LTO enabled, the script reports that there is a significant performance regression in Sphinx's parsing and building of library/typing.rst between v3.13.0a1 and https://github.com/python/cpython/commit/909c6f718913e713c990d69e6d8a74c05f81e2c2:

A similar regression is reported in this (much slower) variation of the script that builds the entire set of CPython's documentation rather than just library/typing.rst.

More comprehensive variation of the script ```py import contextlib import shutil import subprocess import time import venv def run(args): subprocess.run(args, check=True, text=True) with contextlib.chdir("Doc"): venv.create(".venv", with_pip=True) run([ ".venv/bin/python", "-m", "pip", "install", "-r", "requirements.txt", "--no-binary=':all:'", ]) start = time.perf_counter() run([ ".venv/bin/python", "-m", "sphinx", "-b", "html", ".", "build/html", ]) print(time.perf_counter() - start) shutil.rmtree(".venv") shutil.rmtree("build") ```

The PGO-optimized timings for building the entire CPython documentation is as follows:

This indicates a 38% performance regression for building the entire set of CPython's documentation.

Cause of the performance regression

This performance regression was initially discovered in #118891: in our own CI, we use a fresh build of CPython in our Doctest CI workflow (since otherwise, we wouldn't be testing the tip of the main branch), and it was observed that the CI job was taking significantly longer on the 3.13 branch than the 3.12 branch. In the context of our CI, the performance regression is even worse, because of the fact that our Doctest CI workflow uses a debug build rather than a PGO-optimized build, and the regression is even more pronounced in a Debug build.

Using a debug build, I used the first script posted above to bisect the performance regression to commit 15309329b65a285cb7b3071f0f08ac964b61411b (below), which seemed to cause a performance regression of around 300% in a debug build

15309329b65a285cb7b3071f0f08ac964b61411b is the first bad commit
commit 15309329b65a285cb7b3071f0f08ac964b61411b
Author: Mark Shannon <mark@hotpy.org>
Date:   Wed Mar 20 08:54:42 2024 +0000

    GH-108362: Incremental Cycle GC (GH-116206)

 Doc/whatsnew/3.13.rst                              |  30 +
 Include/internal/pycore_gc.h                       |  41 +-
 Include/internal/pycore_object.h                   |  18 +-
 Include/internal/pycore_runtime_init.h             |   8 +-
 Lib/test/test_gc.py                                |  72 +-
 .../2024-01-07-04-22-51.gh-issue-108362.oB9Gcf.rst |  12 +
 Modules/gcmodule.c                                 |  25 +-
 Objects/object.c                                   |  21 +
 Objects/structseq.c                                |   5 +-
 Python/gc.c                                        | 806 +++++++++++++--------
 Python/gc_free_threading.c                         |  23 +-
 Python/import.c                                    |   2 +-
 Python/optimizer.c                                 |   2 +-
 Tools/gdb/libpython.py                             |   7 +-
 14 files changed, 684 insertions(+), 388 deletions(-)
 create mode 100644 Misc/NEWS.d/next/Core and Builtins/2024-01-07-04-22-51.gh-issue-108362.oB9Gcf.rst

Performance was then significantly improved by commit e28477f214276db941e715eebc8cdfb96c1207d9 (below), but it's unfortunately still the case that Sphinx is far slower on Python 3.13 than on Python 3.12:

commit e28477f214276db941e715eebc8cdfb96c1207d9
Author: Mark Shannon <mark@hotpy.org>
Date:   Fri Mar 22 18:43:25 2024 +0000

    GH-117108: Change the size of the GC increment to about 1% of the total heap size. (GH-117120)

 Include/internal/pycore_gc.h                       |  3 +-
 Lib/test/test_gc.py                                | 35 +++++++++++++++-------
 .../2024-03-21-12-10-11.gh-issue-117108._6jIrB.rst |  3 ++
 Modules/gcmodule.c                                 |  2 +-
 Python/gc.c                                        | 30 +++++++++----------
 Python/gc_free_threading.c                         |  2 +-
 6 files changed, 47 insertions(+), 28 deletions(-)
 create mode 100644 Misc/NEWS.d/next/Core and Builtins/2024-03-21-12-10-11.gh-issue-117108._6jIrB.rst

See https://github.com/python/cpython/issues/118891#issuecomment-2375334948 for more details on the bisection results.

Profiling by @nascheme in https://github.com/python/cpython/issues/118891#issuecomment-2375362608 and https://github.com/python/cpython/issues/118891#issuecomment-2375446761 also confirms that Sphinx spends a significant amount of time in the GC, so it seems very likely that the changes to introduce an incremental GC in Python 3.13 is the cause of this performance regression.

Cc. @markshannon for expertise on the new incremental GC, and cc. @hugovk / @AA-Turner for Sphinx expertise.

CPython versions tested on:

3.12, 3.13, CPython main branch

Operating systems tested on:

macOS

Linked PRs

picnixz commented 1 month ago

I think I wanted to investigate it in https://github.com/sphinx-doc/sphinx/issues/12181 but then I didn't have the motivation nor the time... (and we wanted to avoid keeping opened issues for too long). The root cause was the memory usage for docutils that increased a lot so this might be the reason why we are also seeing those slowdowns.

AlexWaygood commented 1 month ago

I think I wanted to investigate it in sphinx-doc/sphinx#12181 but then I didn't have the motivation nor the time... (and we wanted to avoid keeping opened issues for too long).

Yes, @AA-Turner mentioned as much in https://github.com/python/cpython/issues/118891#issuecomment-2375279122. This seems like the kind of issue that could cause slowdowns for other workloads as well as Sphinx, however. (But I'll defer to our performance and GC experts on if there are obvious ways to mitigate it.)

AlexWaygood commented 1 month ago

Prompted by @ned-deily, I reran my PGO-optimized timings using --with-lto=full --enable-optimizations builds rather than simply --with-lto --enable-optimizations builds. (Since I'm benchmarking using a Mac, this could have been relevant.) The timings on v3.13.0a1 and main using this build configuration were essentially identical to the ones I reported above, however.

willingc commented 1 month ago

@AlexWaygood Let's rerun after https://github.com/python/cpython/issues/124538 lands.

Privat33r-dev commented 1 month ago

I suggest to add time test for doctests to notice such performance degradations in the future automatically. Preferably, with high margin, so new additions to the doctests do not cause it to trigger.

hauntsaninja commented 1 month ago

Could be cool to make the sphinx build a pyperformance benchmark (I think there might be a docutils pyperf benchmark, wonder if that saw a regression)

JelleZijlstra commented 1 month ago

@willingc I think that's unlikely to make a difference; that issue is related to capsules and probably doesn't interact with the regression we're seeing here.

@hauntsaninja there is a docutils benchmark, https://github.com/python/pyperformance/tree/main/pyperformance/data-files/benchmarks/bm_docutils. Looks like it builds a bunch of RST files, though not CPython's own docs. According to Michael that benchmark didn't show a regression, so it would be interesting to figure out what's different about the CPython docs relative to that benchmark.

picnixz commented 1 month ago

According to Michael that benchmark didn't show a regression

There is a bit of a regression actually, at least in the memory usage as reported by https://github.com/faster-cpython/ideas/issues/668 (roughly a 20% increase of memory). I don't think that's all there is to actually make Sphinx slower.

AA-Turner commented 1 month ago

Thanks Alex for the excellent write-up. A note for reproduction, in my local tests the slowdown could be reproduced just in the parsing step, so you can use 'dummy' instead of 'html' in the reproduction script to save time (as no HTML files are produced).

library/typing.rst

Interestingly I found that when building just that document, timings were broadly the same. However when building everything, 3.13.0a5 was consistently 1.6x faster than 3.13.0a6 (using the built optimised releases from https://www.python.org/downloads/ for Windows, 64 bit).

Could be cool to make the sphinx build a pyperformance benchmark

When I added the Docutils benchmark, Michael Droettboom requested that I/O variables (reading & writing files) be removed as it adds too much noise. There isn't a good way to avoid I/O in Sphinx, unlike with Docutils, unfortunatley. I would be keen to add Sphinx to the benchmark suite.

Looks like it builds a bunch of RST files

Docutils' documentation, we can't use Python's as the Python documentation uses Sphinx-only extensions.

A

mdboom commented 1 month ago

According to Michael that benchmark didn't show a regression, so it would be interesting to figure out what's different about the CPython docs relative to that benchmark.

Here's the timing benchmark of 3.13.0rc2 vs 3.12.0. docutils is actually slightly faster (0.8%) but that's within the noise.

There is a 33% increase in memory usage however.

Yhg1s commented 1 month ago

This is still an issue that needs more investigation for 3.14, but for 3.13 the incremental GC has been reverted and that solves the problem. I'll edit the issue title accordingly (and remove the release-blocker label).

Privat33r-dev commented 1 month ago

This is still an issue that needs more investigation for 3.14, but for 3.13 the incremental GC has been reverted and that solves the problem. I'll edit the issue title accordingly (and remove the release-blocker label).

Wouldn't reverting it resolve the issue for 3.13 and hence the title should be that it's "slower in 3.14 than in 3.13", instead of "than in 3.12"?

AlexWaygood commented 1 month ago

Wouldn't reverting it resolve the issue for 3.13 and hence the title should be that it's "slower in 3.14 than in 3.13", instead of "than in 3.12"?

If we're being pedantic, strictly speaking the modified title was still accurate ;) but I have modified the title again so that it doesn't reference any specific version of Python

Privat33r-dev commented 1 month ago

If we're being pedantic, strictly speaking the modified title was still accurate ;) but I have modified the title again so that it doesn't reference any specific version of Python

I concur. The previous title made a grey zone for 3.13 and my primary intent was to disambiguate it :)

ned-deily commented 1 month ago

For what it's worth, another performance datapoint: a couple of microbenchmarks whose performance were somewhat negatively impacted by the incremental GC revert in 3.13.0rc3.

Background: since there were performance regressions in two microbenchmarks reported when using the python.org macOS binary installer for 3.13.0rc1 (#122580 and #122832) and resolved with the 3.13.0rc2 installer build (changing --with-lto=thin to --with-lto=full), I have been using these two microbenchmarks as a performance "smoke test" for macOS installers. With the 3.13.0rc3 release installer, I noticed some performance differences

For the gh-122580 microbenchmark:

 3.13.0rc2 ran  1.17 ± 0.02 times faster than 3.13.0rc3 on Apple Silicon M3 macBook Pro
           ran ~1.13 ± 0.02 times faster than 3.13.0rc3 on Intel macBook Pro

For the gh-122832 microbenchmark:

 3.13.0rc2 ran  1.17 ± 0.02 times faster than 3.13.0rc3 on Apple Silicon M3 macBook Pro
           ran ~1.04 ± 0.02 times faster than 3.13.0rc3 on Intel macBook Pro

(FTR, the M3 macBook Pro used was the same hardware as using in the earlier issues; but a faster and bigger Intel iMac was used for the earlier issues rather than a slow Intel macBook Pro.)

I have subsequently run some bisecting between the v3.13.0rc2 and v3.13.0rc3 tags using a simple build configuration (on macOS with optimizations and --with-lto=full) and it appears that the above degradations from rc2 are entirely attributable to the revert of the incremental GC in e0eb44ad49926dd131dc639f5506c6769e45b4eb from PR #124770.

mdboom commented 1 month ago

I finally have complete benchmarking results for this, and the results are mainly neutral or slightly better for reverting this change. The outlier is macOS, where I am seeing a 5% performance regression similar to @ned-deily. Note that these are our own builds on our own benchmarking machine (not the official python.org release). I'm not sure what to attribute the difference to -- maybe larger page sizes on macos?

commit change
darwin arm64 1.05x ↓
linux x86_64 1.00x ↑
linux aarch64 1.00x ↑
windows x86 1.01x ↑
windows x86_64 1.01x ↑
pablogsal commented 1 month ago

I measured in my M1 mac the benchmark in https://github.com/python/cpython/issues/122580 after moving to pyperf and I cannot see any major different above the noise:

 RC3

~/github/python/3.13 tags/v3.13.0rc3* 38s
❯ ./python.exe benchmark.py
.....................
benchmark: Mean +- std dev: 634 us +- 14 us

~/github/python/3.13 tags/v3.13.0rc3* 16s
❯ ./python.exe benchmark.py
.....................
benchmark: Mean +- std dev: 632 us +- 5 us

 RC2

~/github/python/3.13 tags/v3.13.0rc2* 17s
❯ ./python.exe benchmark.py
.....................
benchmark: Mean +- std dev: 636 us +- 16 us

~/github/python/3.13 tags/v3.13.0rc2* 16s
❯ ./python.exe benchmark.py
.....................
benchmark: Mean +- std dev: 636 us +- 18 us