Severe performance degradation for tracing under 3.11

nedbat commented 2 years ago

Bug report

Coverage.py is seeing a significant increase in overhead for tracing code in 3.11 compared to 3.10: https://github.com/nedbat/coveragepy/issues/1287

As an example:

cov	proj	python3.10	python3.11	3.11 vs 3.10
none	bug1339.py	0.184 s	0.142 s	76%
none	bm_sudoku.py	10.789 s	9.901 s	91%
none	bm_spectral_norm.py	14.305 s	9.185 s	64%
6.4.1	bug1339.py	0.450 s	0.854 s	189%
6.4.1	bm_sudoku.py	27.141 s	55.504 s	204%
6.4.1	bm_spectral_norm.py	36.793 s	67.970 s	184%

(This is the output of lab/benchmark.py.)

Your environment

CPython versions tested on: 3.10, 3.11
Operating system and architecture: MacOS, Intel

Linked PRs

gh-108028

nedbat commented 2 years ago

@pablogsal @markshannon

Fidget-Spinner commented 2 years ago

Related issue where a user reported that code with cProfile slowed down 1.6x, and the only thing I could pinpoint was that tracing itself slowed down, not cProfile https://github.com/python/cpython/issues/93381.

sweeneyde commented 2 years ago

One reproducer script with only coverage.py as a dependency:

from time import perf_counter
import coverage

def fib(n):
    if n <= 1:
        return n
    else:
        return fib(n-1) + fib(n-2)

t0 = perf_counter()
cov = coverage.Coverage()
cov.start()
fib(35)
cov.stop()
cov.save()
t1 = perf_counter()
print(t1 - t0)

Here's some output of Visual Studio performance profiler:

On the current 3.10 branch (took 13.0 seconds):

|Function Name|Total CPU \[unit, %\]|Self CPU \[unit, %\]|Module| |-|-|-|-| |\| - \_PyEval\_EvalFrameDefault|12606 \(99.60%\)|3594 \(28.40%\)|python310.dll| |\| - \[External Call\] tracer.cp310-win\_amd64.pyd|2687 \(21.23%\)|1360 \(10.75%\)|tracer.cp310-win\_amd64.pyd| |\| - maybe\_call\_line\_trace|3731 \(29.48%\)|1158 \(9.15%\)|python310.dll| |\| - \_PyCode\_CheckLineNumber|1154 \(9.12%\)|1005 \(7.94%\)|python310.dll| |\| - call\_trace|3322 \(26.25%\)|500 \(3.95%\)|python310.dll| |\| - PyLong\_FromLong|358 \(2.83%\)|357 \(2.82%\)|python310.dll| |\| - \_PyFrame\_New\_NoTrack|411 \(3.25%\)|346 \(2.73%\)|python310.dll| |\| - \[External Code\]|12527 \(98.97%\)|324 \(2.56%\)|Multiple modules| |\| - set\_add\_entry|291 \(2.30%\)|291 \(2.30%\)|python310.dll| |\| - lookdict\_unicode\_nodummy|288 \(2.28%\)|288 \(2.28%\)|python310.dll| |\| - call\_trace\_protected|2051 \(16.20%\)|282 \(2.23%\)|python310.dll| |\| - PyDict\_GetItem|541 \(4.27%\)|262 \(2.07%\)|python310.dll|

On the current main branch (took 126.9 seconds)

|Function Name|Total CPU \[unit, %\]|Self CPU \[unit, %\]|Module| |-|-|-|-| |\| - \_PyEval\_EvalFrameDefault|123576 \(99.75%\)|48437 \(39.10%\)|python312.dll| |\| - \_PyObject\_GenericGetAttrWithDict|23038 \(18.60%\)|7204 \(5.81%\)|python312.dll| |\| - unicodekeys\_lookup\_unicode|5535 \(4.47%\)|5535 \(4.47%\)|python312.dll| |\| - \_PyType\_Lookup|7903 \(6.38%\)|5424 \(4.38%\)|python312.dll| |\| - PyObject\_GetAttr|26969 \(21.77%\)|3634 \(2.93%\)|python312.dll| |\| - \_PyFrame\_Clear|3285 \(2.65%\)|2870 \(2.32%\)|python312.dll| |\| - \_PyCode\_CheckLineNumber|6036 \(4.87%\)|2732 \(2.21%\)|python312.dll|

The first difference I notice is that there's a lot of overhead in some call to _PyObject_GenericGetAttrWithDict.

sweeneyde commented 2 years ago

On main, if I add printf("lookup %s\n", (const char *)PyUnicode_DATA(key)); to the top of unicodekeys_lookup_unicode and run that fib script, I get the following hot loop repeated over and over:

``` lookup fib lookup THIS_FILE lookup stopped lookup should_start_context lookup started_context lookup _activity lookup data_stack lookup cur_file_data lookup cur_file_name lookup last_line lookup cur_file_name lookup cur_file_data lookup RESUME lookup last_line lookup _cached_bound_method_trace lookup THIS_FILE lookup stopped lookup cur_file_data lookup trace_arcs lookup cur_file_data lookup last_line lookup _cached_bound_method_trace lookup THIS_FILE lookup stopped lookup cur_file_data lookup trace_arcs lookup cur_file_data lookup last_line lookup _cached_bound_method_trace ```

sweeneyde commented 2 years ago

I now realize this is probably due to using the Python tracer rather than the C tracer, will try again.

sweeneyde commented 2 years ago

Okay, now with Code coverage for Python, version 6.4.1 with C extension.

Script used:

```python import coverage from time import perf_counter def fib(n): if n <= 1: return n else: return fib(n-1) + fib(n-2) t0 = perf_counter() cov = coverage.Coverage() cov.start() fib(35) cov.stop() cov.save() t1 = perf_counter() print(t1 - t0) ```

Profile on 3.10 branch: took 13.3 seconds

Functions taking more than 1% of CPU: |Function Name|Total CPU \[unit, %\]|Self CPU \[unit, %\]|Module| |-|-|-|-| |\| - \_PyEval\_EvalFrameDefault|12372 \(99.78%\)|3540 \(28.55%\)|python310.dll| |\| - \[External Call\] tracer.cp310-win\_amd64.pyd|2769 \(22.33%\)|1474 \(11.89%\)|tracer.cp310-win\_amd64.pyd| |\| - maybe\_call\_line\_trace|3872 \(31.23%\)|1206 \(9.73%\)|python310.dll| |\| - \_PyCode\_CheckLineNumber|1168 \(9.42%\)|988 \(7.97%\)|python310.dll| |\| - call\_trace|3431 \(27.67%\)|548 \(4.42%\)|python310.dll| |\| - PyLong\_FromLong|370 \(2.98%\)|369 \(2.98%\)|python310.dll| |\| - \_PyFrame\_New\_NoTrack|385 \(3.11%\)|338 \(2.73%\)|python310.dll| |\| - lookdict\_unicode\_nodummy|324 \(2.61%\)|324 \(2.61%\)|python310.dll| |\| - call\_trace\_protected|2107 \(16.99%\)|287 \(2.31%\)|python310.dll| |\| - frame\_dealloc|270 \(2.18%\)|270 \(2.18%\)|python310.dll| |\| - set\_add\_entry|264 \(2.13%\)|264 \(2.13%\)|python310.dll| |\| - PyDict\_GetItem|579 \(4.67%\)|255 \(2.06%\)|python310.dll| |\| - \_PyEval\_MakeFrameVector|633 \(5.11%\)|248 \(2.00%\)|python310.dll| |\| - \_PyEval\_Vector|12372 \(99.78%\)|196 \(1.58%\)|python310.dll| |\| - PyLineTable\_NextAddressRange|180 \(1.45%\)|180 \(1.45%\)|python310.dll| |\| - call\_function|12372 \(99.78%\)|169 \(1.36%\)|python310.dll| |\| - PyObject\_RichCompare|457 \(3.69%\)|169 \(1.36%\)|python310.dll| |\| - binary\_op1|486 \(3.92%\)|157 \(1.27%\)|python310.dll| |\| - long\_richcompare|207 \(1.67%\)|157 \(1.27%\)|python310.dll|

Profile on main branch: took 23.0 seconds

Functions taking more than 1% of CPU: |Function Name|Total CPU \[unit, %\]|Self CPU \[unit, %\]|Module| |-|-|-|-| |\| - \_PyEval\_EvalFrameDefault|22180 \(99.87%\)|4448 \(20.03%\)|python312.dll| |\| - \_PyCode\_CheckLineNumber|5389 \(24.27%\)|2583 \(11.63%\)|python312.dll| |\| - maybe\_call\_line\_trace|6461 \(29.09%\)|1966 \(8.85%\)|python312.dll| |\| - retreat|2242 \(10.10%\)|1644 \(7.40%\)|python312.dll| |\| - \[External Call\] tracer.cp312-win\_amd64.pyd|6274 \(28.25%\)|1428 \(6.43%\)|tracer.cp312-win\_amd64.pyd| |\| - get\_line\_delta|1162 \(5.23%\)|1162 \(5.23%\)|python312.dll| |\| - unicodekeys\_lookup\_unicode|844 \(3.80%\)|728 \(3.28%\)|python312.dll| |\| - \_Py\_dict\_lookup|1476 \(6.65%\)|632 \(2.85%\)|python312.dll| |\| - call\_trace|9802 \(44.14%\)|630 \(2.84%\)|python312.dll| |\| - siphash13|442 \(1.99%\)|442 \(1.99%\)|python312.dll| |\| - PyUnicode\_New|745 \(3.35%\)|376 \(1.69%\)|python312.dll| |\| - pymalloc\_alloc|362 \(1.63%\)|362 \(1.63%\)|python312.dll| |\| - \_PyType\_Lookup|1786 \(8.04%\)|319 \(1.44%\)|python312.dll| |\| - set\_add\_entry|271 \(1.22%\)|271 \(1.22%\)|python312.dll| |\| - PyDict\_GetItem|824 \(3.71%\)|242 \(1.09%\)|python312.dll| |\| - call\_trace\_protected|8562 \(38.55%\)|236 \(1.06%\)|python312.dll| |\| - PyNumber\_Subtract|449 \(2.02%\)|221 \(1.00%\)|python312.dll|

sweeneyde commented 2 years ago

This seems to suggest to me that this could be made much faster via a dedicated C API function:

https://github.com/nedbat/coveragepy/blob/master/coverage/ctracer/util.h#L41

#define MyCode_GetCode(co)      (PyObject_GetAttrString((PyObject *)(co), "co_code"))

sweeneyde commented 2 years ago

Indeed, on main, if I add printf("lookup %s\n", (const char *)PyUnicode_DATA(key)); to the top of unicodekeys_lookup_unicode and run that fib script, I get the following hot loop repeated over and over:

lookup fib
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup co_code
lookup fib
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup co_code
...

Each of these calls to PyObject_GetAttrString(..., "co_code") requires PyUnicode_FromString->unicode_decode_utf8->PyUnicode_New->_PyObject_Malloc, followed by PyObject_GetAttr->_PyObject_GenericGetAttrWithDict->_PyType_Lookup->find_name_in_mro->(unicode_hash+_Py_Dict_Lookup)->unicodekeys_lookup_unicode, where it should be just a C API function call.

sweeneyde commented 2 years ago

Ah! PyCode_GetCode exists since https://github.com/python/cpython/pull/92168

markshannon commented 2 years ago

Some performance degradation under tracing is expected, but not as much as reported. This is a deliberate tradeoff: faster execution when not tracing, for slower execution when tracing.

From the profiles that @sweeneyde gathered it looks like we are seeing slowdowns in:

_PyEval_EvalFrameDefault, which is expected
Calculation of line numbers. This is probably due to the new line table format.

The new line number table gives better error messages without taking up too much extra space, but it is more complex and thus slower to parse. We should be able to speed it up, but don't expect things to be as fast as 3.10. Sorry.

Once https://github.com/nedbat/coveragepy/pull/1394 is merged, we can look at the profile again to see if much has changed.

sweeneyde commented 2 years ago

New profile of fib_cover.py on main branch, with the PyCode_GetCode addition (took 18.1 seconds)

|Function Name|Total CPU \[unit, %\]|Self CPU \[unit, %\]|Module| |-|-|-|-| |\| - \_PyEval\_EvalFrameDefault|19239 \(99.66%\)|4481 \(23.21%\)|python312.dll| |\| - \_PyCode\_CheckLineNumber|5344 \(27.68%\)|2676 \(13.86%\)|python312.dll| |\| - maybe\_call\_line\_trace|6397 \(33.14%\)|1997 \(10.34%\)|python312.dll| |\| - retreat|2197 \(11.38%\)|1629 \(8.44%\)|python312.dll| |\| - \[External Call\] tracer.cp312-win\_amd64.pyd|3073 \(15.92%\)|1360 \(7.04%\)|tracer.cp312-win\_amd64.pyd| |\| - get\_line\_delta|1038 \(5.38%\)|1038 \(5.38%\)|python312.dll| |\| - call\_trace|6630 \(34.34%\)|605 \(3.13%\)|python312.dll| |\| - unicodekeys\_lookup\_unicode|549 \(2.84%\)|548 \(2.84%\)|python312.dll| |\| - \_Py\_dict\_lookup|1060 \(5.49%\)|511 \(2.65%\)|python312.dll| |\| - PyDict\_GetItem|929 \(4.81%\)|266 \(1.38%\)|python312.dll| |\| - set\_add\_entry|265 \(1.37%\)|265 \(1.37%\)|python312.dll| |\| - PyNumber\_Subtract|447 \(2.32%\)|226 \(1.17%\)|python312.dll| |\| - call\_trace\_protected|5408 \(28.01%\)|222 \(1.15%\)|python312.dll| |\| - initialize\_locals|199 \(1.03%\)|199 \(1.03%\)|python312.dll|

I wonder: would there be any way to retain some specializations during tracing? Some specialized opcodes are certainly incorrect to use during tracing, e.g., STORE_FAST__LOAD_FAST. However, couldn't others be retained, e.g. LOAD_GLOBAL_MODULE, or any specialized opcode that covers only one unspecialized opcode? (EDIT: maybe not, if it uses NOTRACE_DISPATCH())

pablogsal commented 2 years ago

We should be able to speed it up, but don't expect things to be as fast as 3.10. Sorry.

Well, two times slower than 3.10, as some of @nedbat's benchmarks show, is not acceptable in my opinion. We should try to get this below 20% at the very least.

I marked this as a release blocker and this will block any future releases.

markshannon commented 2 years ago

The problem with saying that something is unacceptable, is that implies there is an acceptable alternative. What is that alternative?

What does it mean making this a release blocker? Are you not going to release 3.11 if coverage is >20% slower?

I don't think cutoffs, such as 20%, are constructive. We should do as well as we are reasonably able, regardless of whether we achieve some arbitrary number. I expect that will be better than a 20% slowdown, but I'm not making any pronouncements.

pablogsal commented 2 years ago

What is that alternative?

I don't know, but there is always the possibility to revert the changes that made coverage slower (worst case scenario).

Are you not going to release 3.11 if coverage is >20% slower?

I will check with the SC, but I can already tell you that I would not be comfortable releasing 3.11 if is 2 times slower in coverage, unless the SC mandates otherwise.

I don't think cutoffs, such as 20%, are constructive. We should do as well as we are reasonably able, regardless of whether we achieve some arbitrary number.

The problem here is that this tradeoff was not discussed anywhere or advertised and no one had the possibility to pronounce or to make tradeoffs or otherwise, and therefore we cannot just "decide" out of the blue what is and what is not acceptable because the community didn't had the change to object.

markshannon commented 2 years ago

The largest contributor to the slowdown is computing line numbers. The root cause of that is PEP 657. The proximate cause is my PR to reduce the memory consumed of PEP 657. Would you revert PEP 657, or just accept a large increase in memory use?

I think it would be a lot more helpful if we fixed the issue, rather than making dire pronouncements.

the community didn't had the change to object.

Everything we do is public. I even opened https://github.com/nedbat/coveragepy/issues/1287 back in November, so that @nedbat was aware that there might be slowdowns to coverage.py

pablogsal commented 2 years ago

The largest contributor to the slowdown is computing line numbers. The root cause of that is PEP 657.

That's not true and I feel is misleading. When line numbers and column offsets were separated, PEP 657 had 0 overhead on computing line numbers, so the root cause is certainly not PEP 657.

The root cause was merging line numbers and column offsets so you need to pay for decoding both just to get line numbers.

Everything we do is public. I even opened https://github.com/nedbat/coveragepy/issues/1287 back in November, so that @nedbat was aware that there might be slowdowns to coverage.py

That's not enough. If that were true we would not need to do PEPs because "what we do is already public". The point of having discussion on mailing lists is precisely so everyone can be aware of the tradeoff and give us their opinions.

pablogsal commented 2 years ago

In any case, there is no point to discuss in these terms because I am not the one making decisions here. As RM my only authority is to consider the current slowdown a blocker and ask for a bigger discussion to be taken if we cannot fix the issue.

Let's see what we can do and then let's discuss how to proceed once we understand our options.

markshannon commented 2 years ago

The root cause was merging line numbers and column offsets so you need to pay for decoding both just to get line numbers.

No, that was the proximate cause. It wouldn't have been necessary if not for PEP 657. I didn't implement it to make things slower, I implemented it because column and endline tables were consuming a lot of memory.

Claiming that PEP 657 had zero overhead is not true. The original implementation had a large overhead in terms on memory use.

pablogsal commented 2 years ago

I didn't implement it to make things slower, I implemented it because column and endline tables were consuming a lot of memory.

Mark, I understand that and we are on the same page. I know why you made that change, and IIRC I even reviewed it. The problem here is that that change (or other changes) had unexpected consequences on timing and we need to make decisions either trying to fix together it if we can or potentially reverting it we cannot. That's all I am saying

Claiming that PEP 657 had zero overhead is not true. The original implementation had a large overhead in terms on memory use.

The claim was that PEP 657 had zero overhead on the time computing the line numbers, not on memory.My sentence was:

PEP 657 had 0 overhead on computing line numbers, so the root cause is certainly not PEP 657.

gpshead commented 2 years ago

Marking this as a release or deferred blocker is fine and entirely what I'd expect any RM to want to do. That just means we collect more information and make a decision before we exit beta stages.

I don't personally have a problem with coverage being slower, even by 2x. It's more about understanding what all is impacted by this and how so we can explain our measured rationale to users. ie: Look beyond just coverage which should only be a Developer/CI time tool rather than something slowing people's "production" release tool/application/notebook code.

Q1: What else uses tracing?

Q1b: What of those use it at other than development time?

These questions are informational and do not block performance work on reducing the impact to tracing.

sweeneyde commented 2 years ago

How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy co_extra or similar?

Fidget-Spinner commented 2 years ago

How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy co_extra or similar?

In https://github.com/python/cpython/pull/93383#issuecomment-1145729820 I benchmarked adding an additional field to the code object and there was no slowdown in pyperformance. I do not know how memory consumption is affected.

I'm not sure co_extra is the right place for that info considering its for consumers of the PEP 523 API. Maybe I can merge your idea into a simple lazily created _PyCodeTracingInfo * C struct?

PyCodeObject {
    ...
     void *co_tracing_info
}

struct _PyCodeTracingInfo {
     // cached code bytes object
    PyObject *co_code;
     // decompressed line number table
    ...
}

markshannon commented 2 years ago

How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy co_extra or similar?

An early design for the compressed line number table did just this. My concern, then and now, is that the uncompressed table is a lot bigger, and never goes away.

It isn't just tracing that needs line numbers, tracebacks need them too. Which means that the uncompressed table would be created every time a code object ends up in a traceback.

I think we should see how much faster we can make tracing with the compressed table first. If that isn't good enough, then we can reconsidered the decompressed table.

markshannon commented 2 years ago

@gpshead

Q1: What else uses tracing?

Debuggers and profiling tools, like cProfile.

Q1b: What of those use it at other than development time?

Hopefully, no one. Tracing has always imposed a considerable overhead. Anyone using a profiler in production should be using a statistical profiler.

pablogsal commented 2 years ago

I think we should see how much faster we can make tracing with the compressed table first. If that isn't good enough, then we can reconsidered the decompressed table.

Maybe we could leave line numbers uncompressed and compress the rest?

Alternatively, we can leave uncompressed just the line numbers if and only if we detect a tracing or profiling function? This way the extra memory is only paid if someone is profiling or tracing.

markshannon commented 2 years ago

Maybe we could leave line numbers uncompressed and compress the rest?

That's going to be complicated. Too complicated for a beta release, IMO.

Alternatively, we can leave uncompressed just the line numbers if and only if we detect a tracing or profiling function? This way the extra memory is only paid if someone is profiling or tracing.

If we add a small/large flag, then the extra memory will be 2 bytes for instructions for files < 64k lines (basically all code), and 4 bytes per instruction for giant files. I'll try it out.

markshannon commented 2 years ago

I've been experimenting with streamlining the advance and maybe_call_line_trace functions. I'm getting roughly a 10% speedup, but a flat table lookup will be way faster.

markshannon commented 2 years ago

Actually, if we store the line delta from co_firstline, many functions will only need 1 bytes per code unit, which should be fine.

nedbat commented 2 years ago

I'm very interested in helping to test people's proposed fixes, but am losing track of what is being changed where. We have #93493, which I have tried. Are there other pull requests in flight to try?

pablogsal commented 2 years ago

I would like to see how much we gain back by reverting 944fffee8916cb94321fa33cd3a43f4108717746 (not that we are planning to do it, just want to know how much the problem is due to that).

I can prepare a PR for you to test later today but in the interim you can try reverting yourself (it doesn't cleanly revert, but the conflicts are due to automatically generated files mainly that you can regenerate by running make regen-all).

Fidget-Spinner commented 2 years ago

I would like to see how much we gain back by reverting 944fffee8916cb94321fa33cd3a43f4108717746 (not that we are planning to do it, just want to know how much the problem is due to that).

IIUC, Kumar benchmarked that commit here and got a 1.3x slowdown before and after.

https://github.com/python/pyperformance/pull/213#issuecomment-1147699232

pablogsal commented 2 years ago

IIUC, Kumar benchmarked that commit here and got a 1.3x slowdown before and after.

~That's in pyperformance, no? I meant to see how much it performs in coverage~ Sorry, I checked the link in more detail. Seems that this is what I wanted to check 😓 Sorry for the confusion :)

markshannon commented 2 years ago

Looking at the output of perf (using an optimized, but no pgo, lto build) and the code, I see the following reasons for the slowdown:

Increased overhead in the _PyEval_EvalFrameDefault determining whether the event is "call", possible "line" or no event.
Increased overhead in the _PyEval_EvalFrameDefault as we no longer specialize when tracing.
Increased overhead in maybe_call_line_trace finding the first traceable instruction
Large slowdown in determining the line number from the index.

All but 2 can be fixed without any negative impact on performance when not tracing. Fixing 4 will increase memory use when tracing, but when not tracing, will only need a tiny bit of extra memory.

Specializing during tracing, increases the overhead of dispatching when not-tracing, resulting in a ~1% slowdown.

kumaraditya303 commented 2 years ago

FYI, I have added a coverage ^1 benchmark to pyperformance to ease benchmarking coverage / tracing performance.

markshannon commented 2 years ago

https://github.com/python/cpython/pull/93763

markshannon commented 2 years ago

https://github.com/python/cpython/pull/93769

markshannon commented 2 years ago

https://github.com/python/cpython/pull/93769 is merged. https://github.com/python/cpython/pull/93818 is the backport to 3.11

pablogsal commented 2 years ago

Ok I merged the backport. @nedbat can you do another measurement with the 3.11 branch, please? I plan to mark this as deferred and release beta 4 at least.

Fidget-Spinner commented 2 years ago

@pablogsal can I trouble you to regen the ABI for #93493 please? It should be the last PR for this issue and it's a roughly 10% improvement in coverage.py's benchmark.

pablogsal commented 2 years ago

@pablogsal can I trouble you to regen the ABI for #93493, please? It should be the last PR for this issue and it's a roughly 10% improvement in coverage.py's benchmark.

Can you try this section I just wrote about how to regenerate to see if you run into any problems?

https://devguide.python.org/setup/#regenerate-the-abi-dump

Fidget-Spinner commented 2 years ago

Ooh thanks for writing that new section in the devguide! I'll try and report back (hopefully within the next day or two).

pablogsal commented 2 years ago

I plan to try to release beta 4 tomorrow, so it would be great if we can get a decision with https://github.com/python/cpython/pull/93493

fabioz commented 2 years ago

I just stumbled upon this, so, I ran my profilings from pydevd -- in the pure-python tracer (I'm still working on the cython and frame eval modes), the numbers I get is that 3.11 is 50-60% slower than 3.10 in general.

-- I must say I wasn't expecting a speedup when tracing is enabled in 3.11, but I also wasn't expecting such a big slowdown...

The raw numbers for the benchmarks I have are below:

Python 3.11 beta 3
method_calls_with_breakpoint: 0.383s (/ 0.250 = 1.53: slower)
method_calls_without_breakpoint: 0.381s (/ 0.247 = 1.54: slower)
method_calls_with_step_over: 0.378s (/ 0.236 = 1.60: slower)
method_calls_with_exception_breakpoint: 0.375s (/ 0.249 = 1.50: slower)
global_scope_1_with_breakpoint: 0.854s (/ 0.557 = 1.53: slower)
global_scope_2_with_breakpoint: 0.364s (/ 0.238 = 1.52: slower)

Python 3.11 (repository: 41e4b426ce)
method_calls_with_breakpoint: 0.380s 
method_calls_without_breakpoint: 0.368s 
method_calls_with_step_over: 0.357s 
method_calls_with_exception_breakpoint: 0.380s 
global_scope_1_with_breakpoint: 0.920s 
global_scope_2_with_breakpoint: 0.337s 

Python 3.10.0
method_calls_with_breakpoint: 0.250s 
method_calls_without_breakpoint: 0.247s 
method_calls_with_step_over: 0.236s 
method_calls_with_exception_breakpoint: 0.249s 
global_scope_1_with_breakpoint: 0.557s 
global_scope_2_with_breakpoint: 0.238s

pablogsal commented 2 years ago

I just stumbled upon this, so, I ran my profilings from pydevd -- in the pure-python tracer (I'm still working on the cython and frame eval modes), the numbers I get is that 3.11 is 50-60% slower than 3.10 in general.

-- I must say I wasn't expecting a speedup when tracing is enabled in 3.11, but I also wasn't expecting such a big slowdown...

The raw numbers for the benchmarks I have are below:

Python 3.11 beta 3
method_calls_with_breakpoint: 0.383s (/ 0.250 = 1.53: slower)
method_calls_without_breakpoint: 0.381s (/ 0.247 = 1.54: slower)
method_calls_with_step_over: 0.378s (/ 0.236 = 1.60: slower)
method_calls_with_exception_breakpoint: 0.375s (/ 0.249 = 1.50: slower)
global_scope_1_with_breakpoint: 0.854s (/ 0.557 = 1.53: slower)
global_scope_2_with_breakpoint: 0.364s (/ 0.238 = 1.52: slower)

Python 3.11 (repository: 41e4b426ce)
method_calls_with_breakpoint: 0.380s 
method_calls_without_breakpoint: 0.368s 
method_calls_with_step_over: 0.357s 
method_calls_with_exception_breakpoint: 0.380s 
global_scope_1_with_breakpoint: 0.920s 
global_scope_2_with_breakpoint: 0.337s 

Python 3.10.0
method_calls_with_breakpoint: 0.250s 
method_calls_without_breakpoint: 0.247s 
method_calls_with_step_over: 0.236s 
method_calls_with_exception_breakpoint: 0.249s 
global_scope_1_with_breakpoint: 0.557s 
global_scope_2_with_breakpoint: 0.238s

@fabioz are you using the latest released beta or the tip of the 3.11 branch?

fabioz commented 2 years ago

@fabioz are you using the latest released beta or the tip of the 3.11 branch?

@pablogsal I've tried with both (the benchmarks I posted are for Python 3.11 beta 3 and Python 3.11 tip which I compiled locally -- note that I did not compile with PGO just release -- i.e.: PCbuild\build.bat -e -p x64)... The differences from the tip to the release version weren't significant (some were a bit slower, some were a bit faster some almost the same, so, I'd say it's in the noise range).

fabioz commented 2 years ago

Better formatted table for pydevd benchmarks.

Benchmark	Python 3.10 (time in s)	Python 3.11 beta 3 (time in s)	Slower than 3.10 in %	Python 3.11 tip ( 41e4b426ce ) (time in s)	Slower than 3.10 in %
method_calls_with_breakpoint	0,25	0,383	53,20%	0,38	52,00%
method_calls_without_breakpoint	0,247	0,381	54,25%	0,368	48,99%
method_calls_with_step_over	0,236	0,378	60,17%	0,357	51,27%
method_calls_with_exception_breakpoint	0,249	0,375	50,60%	0,38	52,61%
global_scope_1_with_breakpoint	0,557	0,854	53,32%	0,92	65,17%
global_scope_2_with_breakpoint	0,238	0,364	52,94%	0,337	41,60%

hodgestar commented 2 years ago

As a random Python developer, I would be unhappy with a 20% slowdown in the speed of coverage, under the assumption that that would make all of my continuous integration runs take 20% longer. :/

nedbat commented 2 years ago

I did a benchmark, and the latest tip of 3.11 (commit c966e0816e) is looking better, though still slower than 3.10:	cov	proj	v3.10.5	v3.11.0b3	3.11_tip	3.11b3 vs 3.10
6.4.1	bug1339.py	0.399 s	0.832 s	0.469 s	208.41%	117.42%
6.4.1	bm_sudoku.py	26.273 s	55.805 s	34.877 s	212.40%	132.75%
6.4.1	bm_spectral_norm.py	32.660 s	71.043 s	44.546 s	217.52%	136.39%

gpshead commented 2 years ago

Quite frankly, those 3.11 tip numbers look well within reason to accept.

Tracing is a relatively rare operation in terms of all CPU cycles consumed by CPython processes on all computers around the world. It is worth sacrificing some tracing performance there if deemed necessary in order to gain performance where it is actually statistically significant: in deployed python programs.

alex commented 2 years ago

As a top line conclusion, I agree that with these improvements this no longer needs to block the release.

That said, I think it's important to note that coverage performance has a substantial impact on developer experience. For many projects tests-with-coverage is a blocking element of their CI pipeline, so this performance ultimately dictates their iteration cycle time. It cannot happen for 3.11, but I think on a longer timeline it's worth exploring what can be done to improve coverage's performance (which perhaps is another external tool like https://github.com/plasma-umass/slipcover) -- particularly if it's going to be in tension with other optimizations we want to land for the general case!

python / cpython

Severe performance degradation for tracing under 3.11 #93516

Linked PRs