Open nedbat opened 2 years ago
@pablogsal @markshannon
Related issue where a user reported that code with cProfile slowed down 1.6x, and the only thing I could pinpoint was that tracing itself slowed down, not cProfile https://github.com/python/cpython/issues/93381.
One reproducer script with only coverage.py as a dependency:
from time import perf_counter
import coverage
def fib(n):
if n <= 1:
return n
else:
return fib(n-1) + fib(n-2)
t0 = perf_counter()
cov = coverage.Coverage()
cov.start()
fib(35)
cov.stop()
cov.save()
t1 = perf_counter()
print(t1 - t0)
Here's some output of Visual Studio performance profiler:
On the current 3.10 branch (took 13.0 seconds):
On the current main branch (took 126.9 seconds)
The first difference I notice is that there's a lot of overhead in some call to _PyObject_GenericGetAttrWithDict.
On main, if I add printf("lookup %s\n", (const char *)PyUnicode_DATA(key));
to the top of unicodekeys_lookup_unicode
and run that fib
script, I get the following hot loop repeated over and over:
I now realize this is probably due to using the Python tracer rather than the C tracer, will try again.
Okay, now with Code coverage for Python, version 6.4.1 with C extension.
This seems to suggest to me that this could be made much faster via a dedicated C API function:
https://github.com/nedbat/coveragepy/blob/master/coverage/ctracer/util.h#L41
#define MyCode_GetCode(co) (PyObject_GetAttrString((PyObject *)(co), "co_code"))
Indeed, on main, if I add printf("lookup %s\n", (const char *)PyUnicode_DATA(key));
to the top of unicodekeys_lookup_unicode
and run that fib script, I get the following hot loop repeated over and over:
lookup fib
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup co_code
lookup fib
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup co_code
...
Each of these calls to PyObject_GetAttrString(..., "co_code")
requires PyUnicode_FromString->unicode_decode_utf8->PyUnicode_New->_PyObject_Malloc
, followed by PyObject_GetAttr->_PyObject_GenericGetAttrWithDict->_PyType_Lookup->find_name_in_mro->(unicode_hash+_Py_Dict_Lookup)->unicodekeys_lookup_unicode
, where it should be just a C API function call.
Ah! PyCode_GetCode
exists since https://github.com/python/cpython/pull/92168
Some performance degradation under tracing is expected, but not as much as reported. This is a deliberate tradeoff: faster execution when not tracing, for slower execution when tracing.
From the profiles that @sweeneyde gathered it looks like we are seeing slowdowns in:
The new line number table gives better error messages without taking up too much extra space, but it is more complex and thus slower to parse. We should be able to speed it up, but don't expect things to be as fast as 3.10. Sorry.
Once https://github.com/nedbat/coveragepy/pull/1394 is merged, we can look at the profile again to see if much has changed.
I wonder: would there be any way to retain some specializations during tracing? Some specialized opcodes are certainly incorrect to use during tracing, e.g., STORE_FAST__LOAD_FAST
. However, couldn't others be retained, e.g. LOAD_GLOBAL_MODULE
, or any specialized opcode that covers only one unspecialized opcode? (EDIT: maybe not, if it uses NOTRACE_DISPATCH())
We should be able to speed it up, but don't expect things to be as fast as 3.10. Sorry.
Well, two times slower than 3.10, as some of @nedbat's benchmarks show, is not acceptable in my opinion. We should try to get this below 20% at the very least.
I marked this as a release blocker and this will block any future releases.
The problem with saying that something is unacceptable, is that implies there is an acceptable alternative. What is that alternative?
What does it mean making this a release blocker? Are you not going to release 3.11 if coverage is >20% slower?
I don't think cutoffs, such as 20%, are constructive. We should do as well as we are reasonably able, regardless of whether we achieve some arbitrary number. I expect that will be better than a 20% slowdown, but I'm not making any pronouncements.
What is that alternative?
I don't know, but there is always the possibility to revert the changes that made coverage slower (worst case scenario).
Are you not going to release 3.11 if coverage is >20% slower?
I will check with the SC, but I can already tell you that I would not be comfortable releasing 3.11 if is 2 times slower in coverage, unless the SC mandates otherwise.
I don't think cutoffs, such as 20%, are constructive. We should do as well as we are reasonably able, regardless of whether we achieve some arbitrary number.
The problem here is that this tradeoff was not discussed anywhere or advertised and no one had the possibility to pronounce or to make tradeoffs or otherwise, and therefore we cannot just "decide" out of the blue what is and what is not acceptable because the community didn't had the change to object.
The largest contributor to the slowdown is computing line numbers. The root cause of that is PEP 657. The proximate cause is my PR to reduce the memory consumed of PEP 657. Would you revert PEP 657, or just accept a large increase in memory use?
I think it would be a lot more helpful if we fixed the issue, rather than making dire pronouncements.
the community didn't had the change to object.
Everything we do is public. I even opened https://github.com/nedbat/coveragepy/issues/1287 back in November, so that @nedbat was aware that there might be slowdowns to coverage.py
The largest contributor to the slowdown is computing line numbers. The root cause of that is PEP 657.
That's not true and I feel is misleading. When line numbers and column offsets were separated, PEP 657 had 0 overhead on computing line numbers, so the root cause is certainly not PEP 657.
The root cause was merging line numbers and column offsets so you need to pay for decoding both just to get line numbers.
Everything we do is public. I even opened https://github.com/nedbat/coveragepy/issues/1287 back in November, so that @nedbat was aware that there might be slowdowns to coverage.py
That's not enough. If that were true we would not need to do PEPs because "what we do is already public". The point of having discussion on mailing lists is precisely so everyone can be aware of the tradeoff and give us their opinions.
In any case, there is no point to discuss in these terms because I am not the one making decisions here. As RM my only authority is to consider the current slowdown a blocker and ask for a bigger discussion to be taken if we cannot fix the issue.
Let's see what we can do and then let's discuss how to proceed once we understand our options.
The root cause was merging line numbers and column offsets so you need to pay for decoding both just to get line numbers.
No, that was the proximate cause. It wouldn't have been necessary if not for PEP 657. I didn't implement it to make things slower, I implemented it because column and endline tables were consuming a lot of memory.
Claiming that PEP 657 had zero overhead is not true. The original implementation had a large overhead in terms on memory use.
I didn't implement it to make things slower, I implemented it because column and endline tables were consuming a lot of memory.
Mark, I understand that and we are on the same page. I know why you made that change, and IIRC I even reviewed it. The problem here is that that change (or other changes) had unexpected consequences on timing and we need to make decisions either trying to fix together it if we can or potentially reverting it we cannot. That's all I am saying
Claiming that PEP 657 had zero overhead is not true. The original implementation had a large overhead in terms on memory use.
The claim was that PEP 657 had zero overhead on the time computing the line numbers, not on memory.My sentence was:
PEP 657 had 0 overhead on computing line numbers, so the root cause is certainly not PEP 657.
Marking this as a release or deferred blocker is fine and entirely what I'd expect any RM to want to do. That just means we collect more information and make a decision before we exit beta stages.
I don't personally have a problem with coverage being slower, even by 2x. It's more about understanding what all is impacted by this and how so we can explain our measured rationale to users. ie: Look beyond just coverage
which should only be a Developer/CI time tool rather than something slowing people's "production" release tool/application/notebook code.
Q1: What else uses tracing?
Q1b: What of those use it at other than development time?
These questions are informational and do not block performance work on reducing the impact to tracing.
How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy co_extra
or similar?
How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy
co_extra
or similar?
In https://github.com/python/cpython/pull/93383#issuecomment-1145729820 I benchmarked adding an additional field to the code object and there was no slowdown in pyperformance. I do not know how memory consumption is affected.
I'm not sure co_extra is the right place for that info considering its for consumers of the PEP 523 API. Maybe I can merge your idea into a simple lazily created _PyCodeTracingInfo *
C struct?
PyCodeObject {
...
void *co_tracing_info
}
struct _PyCodeTracingInfo {
// cached code bytes object
PyObject *co_code;
// decompressed line number table
...
}
How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy
co_extra
or similar?
An early design for the compressed line number table did just this. My concern, then and now, is that the uncompressed table is a lot bigger, and never goes away.
It isn't just tracing that needs line numbers, tracebacks need them too. Which means that the uncompressed table would be created every time a code object ends up in a traceback.
I think we should see how much faster we can make tracing with the compressed table first. If that isn't good enough, then we can reconsidered the decompressed table.
@gpshead
Q1: What else uses tracing?
Debuggers and profiling tools, like cProfile.
Q1b: What of those use it at other than development time?
Hopefully, no one. Tracing has always imposed a considerable overhead. Anyone using a profiler in production should be using a statistical profiler.
I think we should see how much faster we can make tracing with the compressed table first. If that isn't good enough, then we can reconsidered the decompressed table.
Maybe we could leave line numbers uncompressed and compress the rest?
Alternatively, we can leave uncompressed just the line numbers if and only if we detect a tracing or profiling function? This way the extra memory is only paid if someone is profiling or tracing.
Maybe we could leave line numbers uncompressed and compress the rest?
That's going to be complicated. Too complicated for a beta release, IMO.
Alternatively, we can leave uncompressed just the line numbers if and only if we detect a tracing or profiling function? This way the extra memory is only paid if someone is profiling or tracing.
If we add a small/large flag, then the extra memory will be 2 bytes for instructions for files < 64k lines (basically all code), and 4 bytes per instruction for giant files. I'll try it out.
I've been experimenting with streamlining the advance
and maybe_call_line_trace
functions. I'm getting roughly a 10% speedup, but a flat table lookup will be way faster.
Actually, if we store the line delta from co_firstline
, many functions will only need 1 bytes per code unit, which should be fine.
I'm very interested in helping to test people's proposed fixes, but am losing track of what is being changed where. We have #93493, which I have tried. Are there other pull requests in flight to try?
I would like to see how much we gain back by reverting 944fffee8916cb94321fa33cd3a43f4108717746 (not that we are planning to do it, just want to know how much the problem is due to that).
I can prepare a PR for you to test later today but in the interim you can try reverting yourself (it doesn't cleanly revert, but the conflicts are due to automatically generated files mainly that you can regenerate by running make regen-all
).
I would like to see how much we gain back by reverting 944fffee8916cb94321fa33cd3a43f4108717746 (not that we are planning to do it, just want to know how much the problem is due to that).
IIUC, Kumar benchmarked that commit here and got a 1.3x slowdown before and after.
https://github.com/python/pyperformance/pull/213#issuecomment-1147699232
IIUC, Kumar benchmarked that commit here and got a 1.3x slowdown before and after.
~That's in pyperformance, no? I meant to see how much it performs in coverage~ Sorry, I checked the link in more detail. Seems that this is what I wanted to check 😓 Sorry for the confusion :)
Looking at the output of perf (using an optimized, but no pgo, lto build) and the code, I see the following reasons for the slowdown:
_PyEval_EvalFrameDefault
determining whether the event is "call", possible "line" or no event._PyEval_EvalFrameDefault
as we no longer specialize when tracing.maybe_call_line_trace
finding the first traceable instructionAll but 2 can be fixed without any negative impact on performance when not tracing. Fixing 4 will increase memory use when tracing, but when not tracing, will only need a tiny bit of extra memory.
Specializing during tracing, increases the overhead of dispatching when not-tracing, resulting in a ~1% slowdown.
FYI, I have added a coverage ^1 benchmark to pyperformance to ease benchmarking coverage / tracing performance.
https://github.com/python/cpython/pull/93769 is merged. https://github.com/python/cpython/pull/93818 is the backport to 3.11
Ok I merged the backport. @nedbat can you do another measurement with the 3.11 branch, please? I plan to mark this as deferred and release beta 4 at least.
@pablogsal can I trouble you to regen the ABI for #93493 please? It should be the last PR for this issue and it's a roughly 10% improvement in coverage.py's benchmark.
@pablogsal can I trouble you to regen the ABI for #93493, please? It should be the last PR for this issue and it's a roughly 10% improvement in coverage.py's benchmark.
Can you try this section I just wrote about how to regenerate to see if you run into any problems?
Ooh thanks for writing that new section in the devguide! I'll try and report back (hopefully within the next day or two).
I plan to try to release beta 4 tomorrow, so it would be great if we can get a decision with https://github.com/python/cpython/pull/93493
I just stumbled upon this, so, I ran my profilings from pydevd
-- in the pure-python tracer (I'm still working on the cython and frame eval modes), the numbers I get is that 3.11 is 50-60% slower than 3.10 in general.
-- I must say I wasn't expecting a speedup when tracing is enabled in 3.11, but I also wasn't expecting such a big slowdown...
The raw numbers for the benchmarks I have are below:
Python 3.11 beta 3
method_calls_with_breakpoint: 0.383s (/ 0.250 = 1.53: slower)
method_calls_without_breakpoint: 0.381s (/ 0.247 = 1.54: slower)
method_calls_with_step_over: 0.378s (/ 0.236 = 1.60: slower)
method_calls_with_exception_breakpoint: 0.375s (/ 0.249 = 1.50: slower)
global_scope_1_with_breakpoint: 0.854s (/ 0.557 = 1.53: slower)
global_scope_2_with_breakpoint: 0.364s (/ 0.238 = 1.52: slower)
Python 3.11 (repository: 41e4b426ce)
method_calls_with_breakpoint: 0.380s
method_calls_without_breakpoint: 0.368s
method_calls_with_step_over: 0.357s
method_calls_with_exception_breakpoint: 0.380s
global_scope_1_with_breakpoint: 0.920s
global_scope_2_with_breakpoint: 0.337s
Python 3.10.0
method_calls_with_breakpoint: 0.250s
method_calls_without_breakpoint: 0.247s
method_calls_with_step_over: 0.236s
method_calls_with_exception_breakpoint: 0.249s
global_scope_1_with_breakpoint: 0.557s
global_scope_2_with_breakpoint: 0.238s
I just stumbled upon this, so, I ran my profilings from
pydevd
-- in the pure-python tracer (I'm still working on the cython and frame eval modes), the numbers I get is that 3.11 is 50-60% slower than 3.10 in general.-- I must say I wasn't expecting a speedup when tracing is enabled in 3.11, but I also wasn't expecting such a big slowdown...
The raw numbers for the benchmarks I have are below:
Python 3.11 beta 3 method_calls_with_breakpoint: 0.383s (/ 0.250 = 1.53: slower) method_calls_without_breakpoint: 0.381s (/ 0.247 = 1.54: slower) method_calls_with_step_over: 0.378s (/ 0.236 = 1.60: slower) method_calls_with_exception_breakpoint: 0.375s (/ 0.249 = 1.50: slower) global_scope_1_with_breakpoint: 0.854s (/ 0.557 = 1.53: slower) global_scope_2_with_breakpoint: 0.364s (/ 0.238 = 1.52: slower) Python 3.11 (repository: 41e4b426ce) method_calls_with_breakpoint: 0.380s method_calls_without_breakpoint: 0.368s method_calls_with_step_over: 0.357s method_calls_with_exception_breakpoint: 0.380s global_scope_1_with_breakpoint: 0.920s global_scope_2_with_breakpoint: 0.337s Python 3.10.0 method_calls_with_breakpoint: 0.250s method_calls_without_breakpoint: 0.247s method_calls_with_step_over: 0.236s method_calls_with_exception_breakpoint: 0.249s global_scope_1_with_breakpoint: 0.557s global_scope_2_with_breakpoint: 0.238s
@fabioz are you using the latest released beta or the tip of the 3.11 branch?
@fabioz are you using the latest released beta or the tip of the 3.11 branch?
@pablogsal I've tried with both (the benchmarks I posted are for Python 3.11 beta 3 and Python 3.11 tip which I compiled locally -- note that I did not compile with PGO just release -- i.e.: PCbuild\build.bat -e -p x64
)... The differences from the tip to the release version weren't significant (some were a bit slower, some were a bit faster some almost the same, so, I'd say it's in the noise range).
Better formatted table for pydevd
benchmarks.
Benchmark | Python 3.10 (time in s) | Python 3.11 beta 3 (time in s) | Slower than 3.10 in % | Python 3.11 tip ( 41e4b426ce ) (time in s) | Slower than 3.10 in % |
---|---|---|---|---|---|
method_calls_with_breakpoint | 0,25 | 0,383 | 53,20% | 0,38 | 52,00% |
method_calls_without_breakpoint | 0,247 | 0,381 | 54,25% | 0,368 | 48,99% |
method_calls_with_step_over | 0,236 | 0,378 | 60,17% | 0,357 | 51,27% |
method_calls_with_exception_breakpoint | 0,249 | 0,375 | 50,60% | 0,38 | 52,61% |
global_scope_1_with_breakpoint | 0,557 | 0,854 | 53,32% | 0,92 | 65,17% |
global_scope_2_with_breakpoint | 0,238 | 0,364 | 52,94% | 0,337 | 41,60% |
As a random Python developer, I would be unhappy with a 20% slowdown in the speed of coverage, under the assumption that that would make all of my continuous integration runs take 20% longer. :/
I did a benchmark, and the latest tip of 3.11 (commit c966e0816e) is looking better, though still slower than 3.10: | cov | proj | v3.10.5 | v3.11.0b3 | 3.11_tip | 3.11b3 vs 3.10 | 3.11 tip vs 3.10 |
---|---|---|---|---|---|---|---|
6.4.1 | bug1339.py | 0.399 s | 0.832 s | 0.469 s | 208.41% | 117.42% | |
6.4.1 | bm_sudoku.py | 26.273 s | 55.805 s | 34.877 s | 212.40% | 132.75% | |
6.4.1 | bm_spectral_norm.py | 32.660 s | 71.043 s | 44.546 s | 217.52% | 136.39% |
Quite frankly, those 3.11 tip numbers look well within reason to accept.
Tracing is a relatively rare operation in terms of all CPU cycles consumed by CPython processes on all computers around the world. It is worth sacrificing some tracing performance there if deemed necessary in order to gain performance where it is actually statistically significant: in deployed python programs.
As a top line conclusion, I agree that with these improvements this no longer needs to block the release.
That said, I think it's important to note that coverage performance has a substantial impact on developer experience. For many projects tests-with-coverage is a blocking element of their CI pipeline, so this performance ultimately dictates their iteration cycle time. It cannot happen for 3.11, but I think on a longer timeline it's worth exploring what can be done to improve coverage's performance (which perhaps is another external tool like https://github.com/plasma-umass/slipcover) -- particularly if it's going to be in tension with other optimizations we want to land for the general case!
Bug report
Coverage.py is seeing a significant increase in overhead for tracing code in 3.11 compared to 3.10: https://github.com/nedbat/coveragepy/issues/1287
As an example:
(This is the output of lab/benchmark.py.)
Your environment
Linked PRs