Closed Privat33r-dev closed 2 weeks ago
Sounds good; do you have a PR ready? :)
I was waiting for a feedback because I might've missed something :) My schedule is a bit tight lately, but I will try to find a time to implement it (hopefully in a week or two).
Sorry, it's going to be delayed. Currently I am loaded with work, but hopefully soon I can start work on the PR.
No rush, thanks for the update :)
@Privat33r-dev Do you still wish to work on a PR for this? If you are busy, would you mind someone picking up the work to do the PR?
I will try to do it within a week, if I can't manage to find time for this, then it will be a good idea for another person to do it. Sorry for the unexpected delay.
No worries on a delay @Privat33r-dev. Seems like it would be a nice addition since the doctests are slow 😉
After further investigation I understood the following.
The command to run doctests that takes around 13-16 minutes is:
xvfb-run make -C Doc/ PYTHON=../python SPHINXERRORHANDLING="-W --keep-going" doctest
xvfb-run
is needed to simulate graphic environment (my guess it is needed to test turtle
and some other modules)
make -C Doc/ doctest
is launching Doc/Makefile with doctest
builder. Other flags are less relevant, but in short they point out to the Python installation and how the sphinx should handle errors.
Inside of the Makefile, the sphinx-build is being executed with the doctest builder.
So, basically, the sphinx wrapper for the doctests is used. Source
Likely the parallelization should be applied there, I will continue and I might update this comment/add a new comment with further findings.
However, what's actually interesting is that speed of the doctest depends on Python version. Python 3.13 tests are running 3 times slower than 3.12. Example #1 (Python 3.13) 14m 57s test that ran yesterday Example #2 (Python 3.12) 5m 18s test that also ran yesterday
Search request: https://github.com/python/cpython/pulls?q=is%3Apr+label%3Adocs+%5B3.12%5D+
Also caching does not work efficiently and it might contribute to the problem, but it's a different topic.
I would be glad if someone can join in resolving this issue since currently I am, unfortunately, still occupied by other business. At this weekend I will try to continue.
General conclusion from earlier findings:
I can't think of another reason quickly, but it might be as well caused by some Python API changes, building flags or other thing. However, if we can fix and return to near-3.12 speed, it would be a major win already. Applying multiprocessing to it would just skyrocket the speed.
@AA-Turner Do you have any ideas?
The doctest on CPython CI is much slower on 3.13 and 3.14 (main
) than with 3.12. It's not specifically doctest, but the main Sphinx build is slow before starting the doctest.
For example:
With the example builds, when the full 3.12 Sphinx build + doctest had finished, 3.13 was only at reading sources... [ 73%] library/stdtypes .. library/trace
.
Didn't we previously reason that this is as pre-releases don't use an optimised build of Python? What happens if we run CI with the released rc2?
A
For both of these, we use non-optimised Python built from the branch -- because we're testing the branch's Python.
Hmm...I'm wondering if we use pytest to run these doctests instead of sphinx build plus doctest. We could then parallelize the doctests if needed. @hugovk @AA-Turner
For both of these, we use non-optimised Python built from the branch -- because we're testing the branch's Python.
Right, I was just wondering if the slowdown was soley due to this (which we have no real choice over) or if other factors in Python 3.13+ have caused Sphinx to slow down (which we might be able to resolve).
I'm wondering if we use pytest to run these doctests instead of sphinx build plus doctest.
I don't think pytest supports test{setup,cleanup,code}
directives which we make heavy use of.
A
@AA-Turner
Maybe we can do tests parallelization in the Sphinx itself? If you think that it will work, I can find a time this week to implement it.
https://github.com/sphinx-doc/sphinx/blob/master/sphinx/ext/doctest.py#L363C1-L368C1 -> multiprocessing
I wrote the following script, with which I can reproduce locally a significant performance regression over the course of Python 3.13 development when using Sphinx to build a single documentation file:
If I save the script as bisect_sphinx.py
in the root of my CPython clone, and then run ./configure --with-pydebug && make -j10 && ./python.exe bisect_sphinx.py
, the script reports that building library/typing.rst
takes roughly eight seconds on main
, but only around 3s with the v3.13.0a1
tag checked out.
I'll now try to use this script to see if I can bisect the performance regression to a specific commit.
Thanks Alex. You prompted a memory of https://github.com/sphinx-doc/sphinx/issues/12174, which might be relevant.
A
Thanks Alex. You prompted a memory of https://github.com/sphinx-doc/sphinx/issues/12174, which might be relevant.
Indeed! The results of the bisection are in:
15309329b65a285cb7b3071f0f08ac964b61411b is the first bad commit
commit 15309329b65a285cb7b3071f0f08ac964b61411b
Author: Mark Shannon <mark@hotpy.org>
Date: Wed Mar 20 08:54:42 2024 +0000
GH-108362: Incremental Cycle GC (GH-116206)
Doc/whatsnew/3.13.rst | 30 +
Include/internal/pycore_gc.h | 41 +-
Include/internal/pycore_object.h | 18 +-
Include/internal/pycore_runtime_init.h | 8 +-
Lib/test/test_gc.py | 72 +-
.../2024-01-07-04-22-51.gh-issue-108362.oB9Gcf.rst | 12 +
Modules/gcmodule.c | 25 +-
Objects/object.c | 21 +
Objects/structseq.c | 5 +-
Python/gc.c | 806 +++++++++++++--------
Python/gc_free_threading.c | 23 +-
Python/import.c | 2 +-
Python/optimizer.c | 2 +-
Tools/gdb/libpython.py | 7 +-
14 files changed, 684 insertions(+), 388 deletions(-)
create mode 100644 Misc/NEWS.d/next/Core and Builtins/2024-01-07-04-22-51.gh-issue-108362.oB9Gcf.rst
library/typing.rst
took 4.47sI did another bisect to find out what caused the performance improvement between 15309329b65a285cb7b3071f0f08ac964b61411b and 909c6f718913e713c990d69e6d8a74c05f81e2c2, and the result was e28477f214276db941e715eebc8cdfb96c1207d9 (git bisect
reports it as the "first bad commit", but in this case it's actually the first good commit):
e28477f214276db941e715eebc8cdfb96c1207d9 is the first bad commit
commit e28477f214276db941e715eebc8cdfb96c1207d9
Author: Mark Shannon <mark@hotpy.org>
Date: Fri Mar 22 18:43:25 2024 +0000
GH-117108: Change the size of the GC increment to about 1% of the total heap size. (GH-117120)
Include/internal/pycore_gc.h | 3 +-
Lib/test/test_gc.py | 35 +++++++++++++++-------
.../2024-03-21-12-10-11.gh-issue-117108._6jIrB.rst | 3 ++
Modules/gcmodule.c | 2 +-
Python/gc.c | 30 +++++++++----------
Python/gc_free_threading.c | 2 +-
6 files changed, 47 insertions(+), 28 deletions(-)
create mode 100644 Misc/NEWS.d/next/Core and Builtins/2024-03-21-12-10-11.gh-issue-117108._6jIrB.rst
So, we now know which CPython commits are the root cause of our Doctest CI job taking so much longer on the 3.13
branch than on the 3.12
branch! I'm now going to see if the same performance regression can be reproduced with a non-debug build. (So far, I've only been testing with a debug build, as this is what we use for our Doctest CI job, so this is what was relevant for understanding why our Doctest CI job might be taking so much longer on the 3.13
branch.
Here is what linux-perf shows. 47% of time spent in cyclic GC. I don't see any other obvious hot spots to optimize.
I did a second run with the GC completely disabled and without --with-pydebug
. Again no obvious hot spots.
The good news is that the bisection script is hugely faster on a PGO-optimized, non-debug build. The bad news is that it's still a fair amount slower than on 3.13:
That's around a 48% performance regression on my bisection script with a PGO-optimized build, which is still quite bad!
I tried running the tests with Python 3.12 and with the gc.DEBUG_STATS flag turned on. The output suggests that this workload is creating a lot of cyclic garbage and the gen 0 (youngest) collection of the cyclic GC is dealing with it. The "unreachable" number is counting cyclic garbage objects found the GC.
I don't know enough about the "incremental" collector to comment on how it might be better optimized for this workload. I do suspect that there are many more similar workloads in the world and the GC that comes with the 3.13 release will hopefully not be much slower for them.
gc: collecting generation 0...
gc: objects in each generation: 1175 3804 1663292
gc: objects in permanent generation: 375
gc: done, 520 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 976 4408 1663292
gc: objects in permanent generation: 375
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 982 5336 1663289
gc: objects in permanent generation: 375
gc: done, 404 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 959 5866 1663289
gc: objects in permanent generation: 375
gc: done, 404 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 948 6397 1663288
gc: objects in permanent generation: 375
gc: done, 116 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 813 7178 1663288
gc: objects in permanent generation: 375
gc: done, 404 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 1...
gc: objects in each generation: 951 7537 1663273
gc: objects in permanent generation: 375
gc: done, 5219 unreachable, 0 uncollectable, 0.0010s elapsed
For colour, I created a variant of my script that measures the execution time of parsing and building all of CPython's docs with a PGO-optimized build. (The script I used for bisecting the perf regression only parsed and built library/typing.rst
.) Here are the results on v3.13.0a1
and https://github.com/python/cpython/commit/ede1504c4b37b98636bf935679746ba2932f5677:
v3.13.0a1
: 45.5smain
: 62.7sThat's a 38% regression; pretty similar to what we saw with the smaller sample (and again, very bad!).
I created a new issue summarising our findings here. It seems clear that the new incremental GC causes significant performance regressions for workloads like Sphinx that create lots of reference cycles; that seems like something that's worth investigating more:
@AlexWaygood amazing findings, thanks for your investigation :) I felt that the issue is with likely with Python itself (because GH Actions are the same & many other things are the same as well), however, I couldn't make a proper investigation due to time constraints.
Thanks for everyone involved, I am glad that you found the root cause.
The issue was mostly resolved by #124770.
However, there still is a room for improvement, but mostly on the Sphinx's side. I don't see any obvious improvements that can be done in this repository, so the issue is resolved.
I am glad that performance regression was prevented in 3.13 as a result of a teamwork. Thanks for all the participants.
@Privat33r-dev, thanks so much for opening the issue and for spotting how much slower the job was on 3.13 than 3.12. You helped prevent a new version of Python being released with a serious performance regression that could have affected lots of workloads (not just Sphinx!) — it's great that we identified it before the release went out! I'm only sorry it took us a while to investigate.
Issue
Usually doctest's action's "Run documentation doctest"^1 job takes around 12 minutes^2 to execute. Most of this time is taken by the internal framework doctest.py^3. The cause of the issue is GC regression in Python 3.13.
Potential Solution
Fix #124567
Additional Improvements
Improve doctest performance
Since the tests are not interconnected (between different doc files), they can be easily parallelized. I suggest using
ProcessPoolExecutor
that will be suitable to "bypass" GIL (assuming that tasks are CPU-bound). Github runners have 4 cores available^4, so, potentially, it can speed up execution by up to 4 times.@AlexWaygood @erlend-aasland