python / cpython

The Python programming language
https://www.python.org
Other
62.51k stars 30.01k forks source link

Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC) #89279

Closed 23c8a93b-de57-46a3-b68d-6c6a493f0c9f closed 2 years ago

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago
BPO 45116
Nosy @malemburg, @gvanrossum, @rhettinger, @pfmoore, @vstinner, @tjguk, @markshannon, @zware, @zooba, @animalize, @pablogsal, @brandtbucher, @neonene, @erlend-aasland, @Fidget-Spinner
PRs
  • python/cpython#28390
  • python/cpython#28419
  • python/cpython#28427
  • python/cpython#28475
  • python/cpython#28630
  • python/cpython#28631
  • python/cpython#31436
  • python/cpython#31459
  • python/cpython#32387
  • Files
  • 310rc1_confirm_overhead.patch
  • ceval_310rc1_patched.c
  • b98e-no-inline-in-all.diff
  • b98e-no-inline-in-eval.diff
  • b98e-no-inline-in-the-others.diff
  • pyproject_inlinestat.patch
  • x64_28d2.log
  • x64_b98e.log
  • 310rc2_benchmarks.txt
  • 310a7_vs_310rc2_bench.txt
  • PR28475_inline.log
  • PR28475_vs_310rc2_vs_310a7.txt
  • PR28475_skip1test_bench.txt
  • 310rc2patched_vs_310rc2notrace.txt
  • switch-case_unarranged_bench.txt
  • ceval_PR29565_split_func.c
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['interpreter-core', '3.10', 'performance', 'expert-C-API', '3.11', 'OS-windows'] title = 'Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC)' updated_at = user = 'https://github.com/neonene' ``` bugs.python.org fields: ```python activity = actor = 'steve.dower' assignee = 'none' closed = False closed_date = None closer = None components = ['Interpreter Core', 'Windows', 'C API'] creation = creator = 'neonene' dependencies = [] files = ['50263', '50264', '50271', '50272', '50273', '50274', '50275', '50276', '50280', '50286', '50291', '50293', '50296', '50315', '50363', '50452'] hgrepos = [] issue_num = 45116 keywords = ['patch'] message_count = 82.0 messages = ['401143', '401152', '401154', '401182', '401183', '401319', '401329', '401346', '401364', '401623', '401624', '401628', '401743', '401964', '401970', '401972', '402025', '402040', '402043', '402044', '402063', '402064', '402065', '402067', '402068', '402071', '402090', '402091', '402092', '402098', '402099', '402117', '402135', '402143', '402189', '402190', '402217', '402229', '402230', '402287', '402289', '402296', '402307', '402308', '402320', '402480', '402856', '402857', '402858', '402864', '402867', '402871', '402878', '402886', '402891', '402893', '402928', '402930', '402954', '403403', '403409', '403430', '403432', '403464', '403559', '403587', '403609', '404089', '406354', '406386', '406407', '406416', '406471', '406474', '406479', '406487', '406613', '407188', '415378', '416911', '416950', '416977'] nosy_count = 15.0 nosy_names = ['lemburg', 'gvanrossum', 'rhettinger', 'paul.moore', 'vstinner', 'tim.golden', 'Mark.Shannon', 'zach.ware', 'steve.dower', 'malin', 'pablogsal', 'brandtbucher', 'neonene', 'erlendaasland', 'kj'] pr_nums = ['28390', '28419', '28427', '28475', '28630', '28631', '31436', '31459', '32387'] priority = None resolution = None stage = 'patch review' status = 'open' superseder = None type = 'performance' url = 'https://bugs.python.org/issue45116' versions = ['Python 3.10', 'Python 3.11'] ```

    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    pyperformance on Windows shows some gap between 3.10a7 and 3.10b1. The following are the ratios compared with 3.10a7 (the higher the slower).

    ------------------------------------------------- Windows x64 | PGO release official-binary ----------------+-------------------------------- 20210405 | 3.10a7 | 1.00 1.24 1.00 (PGO?) 20210408-07:58 | b98eba5 | 0.98 20210408-10:22 |

    Since PR25244 (28d28e053db6b69d91c2dfd579207cd8ccbc39e7), _PyEval_EvalFrameDefault() in ceval.c has seemed to be unoptimized with PGO (msvc14.29.16.10). At least the functions below have become un-inlined there at all.

    (1) _Py_DECREF() (from Py_DECREF,Py_CLEAR,Py_SETREF) (2) _Py_XDECREF() (from Py_XDECREF,SETLOCAL) (3) _Py_IS_TYPE() (from PyXXX_CheckExact) (4) _Py_atomic_load_32bit_impl() (from CHECK_EVAL_BREAKER)

    I tried in vain other linker options like thread-safe-profiling, agressive-code-generation, /OPT:NOREF. 3.10a7 can inline them in the eval-loop even if profiling only test_array.py.

    I measured overheads of (1)~(4) on my own build whose eval-loop uses macros instead of them.

    ----------------------------------------------------------------- Windows x64 | PGO patched overhead in eval-loop ----------------+------------------------------------------------ 3.10a7 | 1.00 20210802 | 3.10rc1 | 1.09 1.05 4% (slow 43, fast 5, same 10) 20210831-20:42 | 863154c | 0.95 0.90 5% (slow 48, fast 3, same 7) (3.11a0+) | ----------------------------------------------------------------- Windows x86 | PGO patched overhead in eval-loop ----------------+------------------------------------------------ 3.10a7 | 1.00 20210802 | 3.10rc1 | 1.15 1.13 2% (slow 29, fast 14, same 15) 20210831-20:42 | 863154c | 1.05 1.02 3% (slow 44, fast 7, same 7) (3.11a0+) |

    vstinner commented 3 years ago

    Rather than defining again functions as macro, you should consider using __forceinline function attribute: see bpo-45094.

    rhettinger commented 3 years ago

    Perhaps these critical code sections should have been left as macros. It is difficult to assuring system wide inlining across modules.

    vstinner commented 3 years ago

    Raymond:

    Perhaps these critical code sections should have been left as macros. It is difficult to assuring system wide inlining across modules.

    These functions were not converted recently to static inline function. For example, Py_INCREF() was already a static inline function in Python 3.9. I don't think that any decision should be taken before performances have been analyzed in depth.

    I'm not convinced that there is really a performance regression. I'm not sure how benchmarks are run on Windows.

    neonene:

    I measured overheads of (1)~(4) on my own build whose eval-loop uses macros instead of them.

    I don't understand how to read the table.

    Usually, I expect a comparison between a reference build and a patch build, but here you seem to use 3.10a7 as the reference to compare results. I'm not sure that geometric means can be compared this way.

    vstinner commented 3 years ago

    Since PR25244 (28d28e053db6b69d91c2dfd579207cd8ccbc39e7), _PyEval_EvalFrameDefault() in ceval.c has seemed to be unoptimized with PGO (msvc14.29.16.10). At least the functions below have become un-inlined there at all.

    I'm not sure if PGO builds are reproducible, since there is a training step which requires to run a non-deterministic workload (the Python test suite which is somehow randomized by I/O timings and other stuffs).

    The compiler is free to inline or not depending on the pressure on registers and stack memory. I'm not sure that inlining always make the code faster, since inlining have many side effects.

    I don't know well MSC compiler.

    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    @vstinner: __forceinline suggestion

    Since PR25244 (mentioned above), it seems link.exe has got to get stuck on python310.dll. Before the PR, it took 10x~ longer to link than without __forceinline function. I can confirm with _Py_DECREF() and _PyXDECREF() and one training-job (the more fucntions forced/jobs used, the slower to link). Have you tried \_forceinline on PGO ?

    I don't understand how to read the table.

    Overhead field is the output of pyperf command, not subtraction (the answers are the same just luckily).

    ex) 3.10rc1x86 PGO: PGO : pyperf compare_to 3.10a7 left patched : pyperf compare_to 3.10a7 right overhead : pyperf compare_to right left are 1.15x slower (slower 52, faster 4, not significant 2) 1.13x slower (slower 50, faster 4, not significant 4) 1.02x slower (slower 29, faster 14, not significant 15)

    I'm not sure if PGO builds are reproducible,

    MSVC does not produce the same code. Inlining (all or nothing) might be a quite special case in the hottest section. I suspect the profiler doesn't work well only for _PyEval_EvalFrameDefault(), including branch/align optimization. So my posted macro or inlining is just for a mesureing, not the solution.

    vstinner commented 3 years ago

    I don't understand if adding __forceinline on the mentioned static inline functions has an impact on the performance of a PGO build on Windows.

    zooba commented 3 years ago

    I'm not sure if PGO builds are reproducible, since there is a training step which requires to run a non-deterministic workload (the Python test suite which is somehow randomized by I/O timings and other stuffs).

    I'm 99% sure it's a tracing PGO rather than sampling, so provided everything runs in the same order and follows the same codepaths, wall-clock timings shouldn't have a significant effect.

    Without looking at the generated code, it may be more effective to try and force the functions called from the macro to never be inlined. The optimiser may be missing that the calls are uncommon, and trying to inline them first, then deciding that the whole merged function is too big to inline in places where it would be useful.

    There's also no particular reason we need to use these tests as the profile, it's just that nobody has taken the time to come up with anything better. I'd rather see us invest time in generating a good profile rather than trying to hand-optimise inlining. (Though the test suite is good in many ways, because it makes sure all the extension modules are covered. We definitely want as close to full code coverage as we can get, at least for happy paths, but may need more eval loop in the profile if that's what needs help.)

    b8e5a65c-bed4-4c5d-ba32-799f883ba638 commented 3 years ago

    This article briefly introduces the inlining decisions in MSVC. https://devblogs.microsoft.com/cppblog/inlining-decisions-in-visual-studio/

    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    Thanks for all suggestions. I focused on my bisected commit and the previous.

    I run pyperformance with 4 functions never inlined in the sections below.

       _Py_DECREF()
       _Py_XDECREF()
       _Py_IS_TYPE()
       _Py_atomic_load_32bit_impl()

    are

    (1) never inlined in _PyEval_EvalFrameDefault(). (2) never inlined in the other funcitons. (3) never inlined in all functions.

    slow downs [4-funcs never inlined section] -------------------------------------------------------------- Windows x64 PGO (44job) (*) (1) (2) (3) rebuild none eval others all -------------------------------------------------------------- b98eba5 (4 funcs inlined in eval) 1.00 1.05 1.09 1.14 PR25244 ( not inlined in eval) 1.06 1.07 1.18 1.17

    pyperf compare_to upper lower: (*) 1.06x slower (slower 45, faster 4, not not significant 9) (1) 1.02x slower (slower 33, faster 13, not not significant 12) (2) 1.08x slower (slower 48, faster 6, not not significant 4) (3) 1.03x slower (slower 39, faster 6, not not significant 13)

    -------------------------------------------------------------- Windows x86 PGO (44job) (*) (1) (2) (3) rebuild none eval others all -------------------------------------------------------------- b98eba5 (4 funcs inlined in eval) 1.00 1.03 1.06 1.15 PR25244 ( not inlined in eval) 1.13 1.13 1.22 1.24

    pyperf compare_to upper lower: (*) 1.13x slower (slower 54, faster 2, not not significant 2) (1) 1.10x slower (slower 47, faster 3, not not significant 8) (2) 1.14x slower (slower 54, faster 1, not not significant 3) (3) 1.08x slower (slower 43, faster 3, not not significant 12)

    In both x64 and x86, it looks column (2) and (*) has similar gaps. So, I would like to simply focus on the eval-loop.

    I built PGO with "/d2inlinestats" and "/d2inlinelogfull:_PyEval_EvalFrameDefault" according to the blog.

    I posted logs. As for PR25244, the logsize is 3x smaller than the previous and pgo rejects the 4 funcs above. I will look into it later.

    Collecting:

    Before the PR, it took 10x~ longer to link than without __forceinline function.

    Current build is 10x~ shorter than before to link. Before the PR, __forceinline had no impact to me.

    b8e5a65c-bed4-4c5d-ba32-799f883ba638 commented 3 years ago

    MSVC 2019 has a /Ob3 option: https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion

    From the experience of another project, I conjecture /Ob3 increase the "global budget" mentioned in the blog. I used /Ob3 for the 3.10 branch, and there seems to be no significant performance change. If you have time, neonene welcome to verify this.

    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    According to: https://docs.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=msvc-160

    PGO seems to override /Ob3. Around this issue, I posted benchmark on bpo-44381. On python building, /Ob3 works for only non-pgo-supported dlls like,

    _ctypes_test _freeze_importlib _msi _testbuffer _testcapi _testconsole _testembed _testimportmultiple _testinternalcapi _testmultiphase _uuid liblzma pylauncher pyshellext pywlauncher sqlite3 venvlauncher venvwlauncher winsound

    I use this option in _msvccompiler.py for my pyd. I will try and report when PGO with /Ob3 makes difference in the log.

    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    With msvc 16.10.3 and 16.11.2 (latest), PR25244 told me the amount of code in _PyEval_EvalFrameDefault() is over the limit of PGO. In the old version of _PyEval_EvalFrameDefault (b98eba5), the same issue can be caused adding any-code anywhere with more than 20 expressions/statements. For example, at the top/middle/end of the function, repeating "if (0) {}" 10times, or "if (0) {19 statements}". As for python3.9.7, more than 800 expressions/statements.

    Here is just a workaround for 3.10rc2 on windows. \==================================================

    --- Python/ceval.c
    +++ Python/ceval.c
    @@ -1306,9 +1306 @@
    -#define DISPATCH() \
    -    { \
    -        if (trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE) { \
    -            goto tracing_dispatch; \
    -        } \
    -        f->f_lasti = INSTR_OFFSET(); \
    -        NEXTOPARG(); \
    -        DISPATCH_GOTO(); \
    -    }
    +#define DISPATCH() goto tracing_dispatch
    @@ -1782,4 +1774,9 @@
         tracing_dispatch:
         {
    +        if (!(trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE)) {
    +            f->f_lasti = INSTR_OFFSET();
    +            NEXTOPARG();
    +            DISPATCH_GOTO();
    +        }
             int instr_prev = f->f_lasti;
             f->f_lasti = INSTR_OFFSET();

    ==================================================

    This patch becomes ineffective just adding one expression to DISPATCH macro as below

       #define DISPATCH() {if (1) goto tracing_dispatch;}

    And this approach is not sufficient for 3.11 with bigger eval-func. I don't know a cl/link option to lift such restriction of function size.

    3.10rc2 x86 pgo : 1.00 patched : 1.09x faster (slower 5, faster 48, not significant 5)

    3.10rc2 x64 pgo : 1.00 (roughly the same speed as official bin) patched : 1.07x faster (slower 5, faster 47, not significant 6) patched(/Ob3) : 1.07x faster (slower 7, faster 45, not significant 6)

    x64 results are posted.

    Fixing inlining rejection also made __forceinline buildable with normal processing time and memory usage.

    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    I reported this issue to developercommunity of microsoft.

    https://developercommunity.visualstudio.com/t/1531987

    rhettinger commented 3 years ago

    These should be changed back to macros where inlining is guaranteed.

    zooba commented 3 years ago

    I agree with Raymond. Let's stop throwing more code at this until we've figured out what's going on and revert the change for now.

    Fidget-Spinner commented 3 years ago

    @neonene Thanks for the truly excellent investigation!

    @Raymond and @Steve, If I understood OP (neonene) properly, changing Py_DECREF to a macro won't get back the entire 7% lost performance in pyperformance. neonene's investigations suggest that the entire eval function is now too big for PGO on MSVC. Fixing PyDECREF may get us a few %, but all the other hot functions in eval are likely not being inlined as well. Their suggested patch in https://bugs.python.org/msg401743 works, but IMO, _may slow down DISPATCH() _slightly_ since it seems to introduce another jump.

    I suggest using their patch only for MSVC until we think of something better in 3.11. The additional jump may not matter after PGO (and it surely isn't a 7% slowdown :). WDYT?

    vstinner commented 3 years ago

    the entire eval function is now too big for PGO on MSVC

    I don't think that the issue is specific to MSVC. If a function becomes too big, it becomes less efficient for CPU caches.

    One idea would be to move the least common opcodes into a slow-path, in a separated function, and make sure that this function is *not* inlined (use Py_NO_INLINE macro).

    @Mark: What do you think?

    Maybe we can keep current targets in the big switch, and call the function there. Something like:

    TARGET(DUP_TOP): TARGET(DUP_TOP_TWO): (...) ceval_slow_path(); break;

    _PyEval_EvalFrameDefault() takes around 3500 lines of C code.

    vstinner commented 3 years ago

    New changeset 6b413551284a94cfe31377c9c607ff890aa06c26 by Victor Stinner in branch 'main': bpo-45116: Add the Py_ALWAYS_INLINE macro (GH-28390) https://github.com/python/cpython/commit/6b413551284a94cfe31377c9c607ff890aa06c26

    vstinner commented 3 years ago

    I added Py_ALWAYS_INLINE to run benchmarks more easily. Even if Py_INCREF() is converted back to a macro, there are now multiple static inline functions which are short and performance critical.

    Using Py_ALWAYS_INLINE *may* speed up the Python debug builds and the PGO build on Windows if I understood correctly.

    Right now, I'm not sure. The heuristic to decide if a function is inlined or not seems to depend a lot on the compiler and the compiler options.

    rhettinger commented 3 years ago

    Right now, I'm not sure. The heuristic to decide if a function is inlined or not seems to depend a lot on the compiler and the compiler options.

    That is exactly correct. And it is why we should use the macro form which is certain to be inlined.

    Please do as Steve asked and revert back to the previous stable, reliable code.

    rhettinger commented 3 years ago

    Pablo, should this be a release blocker?

    malemburg commented 3 years ago

    FWIW: Back in the days of Python 1.5.2, the ceval loop was too big for CPU caches as well and one of the things I experimented with at the time was rearranging the opcodes based on how often they were used and splitting the whole switch statement we had back then in two parts. This results in a 10-20% speedup.

    CPU caches have since gotten much larger, but the size of the loop still is something to keep in mind and optimize for, as more and more logic gets added to the inner loop of Python.

    IMO, we should definitely keep forced inlines / macros where they are used inside hot loops, perhaps even in all of the CPython code, since the conversion to inline functions is mostly for hiding internals from extensions, not to hide them from CPython itself.

    @neonene: Could you provide more details about the CPU you're using to run the tests ?

    BTW: Perhaps the PSF could get a few sponsors to add more hosts to speed.python.org, to provide a better overview. It looks as if the system is only compiling on Ubuntu 14.04 and running on an 11 year old system (https://speed.python.org/about/). If that's the case, the system uses a server CPU with 12MB cache (https://www.intel.com/content/www/us/en/products/sku/47916/intel-xeon-processor-x5680-12m-cache-3-33-ghz-6-40-gts-intel-qpi/specifications.html).

    pablogsal commented 3 years ago

    Pablo, should this be a release blocker?

    How severe is the regression? If is severe enough we can mark it as a release blocker, but a conclusion needs to be reached ASAP because I don't want to change a fundamental macro a few days before the release

    Fidget-Spinner commented 3 years ago

    How severe is the regression?

    OP provided pyperformance of current 3.10 vs their patched version at https://bugs.python.org/file50280/310rc2_benchmarks.txt. The patch is at https://bugs.python.org/msg401743.

    vstinner commented 3 years ago

    Raymond: "Please do as Steve asked and revert back to the previous stable, reliable code."

    Is this issue really about Py_INCREF() being a static inline macro? Or is it more about the increased size of the _PyEval_EvalFrameDefault() function?

    neonene's analyzis seems to show that the the PGO optimizer of the MSC compiler has thresholds depending on the function size, and _PyEval_EvalFrameDefault() crossed these thresholds. Py_INCREF() static inline function only seems to be the top of the iceberg, it's more a "side effect" than the root issue, no?

    neonene showed in msg401743 that even adding *dead code* changes Python performance. So for me, it sounds weird to decide to change Py_INCREF() implementation only based on this analysis. Or maybe I missed something.

    neonene's analyzis starts with the commit 28d28e053db6b69d91c2dfd579207cd8ccbc39e7 of PR 25244. Are you suggesting to revert this change? Mark Shannon is pushing many changes in ceval.c and the frame object. Multiple changes have been pushed on top of it since this commit, I don't think that this commit can be easily reverted.

    What I understood is that adding __forceinline on some static inline functions can make the performance regression of the commit 28d28e053db6b69d91c2dfd579207cd8ccbc39e7 less bad. But I would prefer to validate that, since neonene's comparisons are not what I'm looking for: compare main to main+Py_ALWAYS_INLINE.

    pablogsal commented 3 years ago

    If https://bugs.python.org/file50280/310rc2_benchmarks.txt is correct, this means that we have a 7% slowdown in Windows PGO builds for 3.10, which I don't think is acceptable.

    I am marking this as a release blocker until there is some agreement. I remind you that if an agreement cannot be reached, you may reach the Steering Council for help reaching some conclusion.

    vstinner commented 3 years ago

    New changeset e4044e9f893350b4623677c048d33414a77edf55 by Victor Stinner in branch 'main': bpo-45116: Py_DEBUG ignores Py_ALWAYS_INLINE (GH-28419) https://github.com/python/cpython/commit/e4044e9f893350b4623677c048d33414a77edf55

    vstinner commented 3 years ago

    If https://bugs.python.org/file50280/310rc2_benchmarks.txt is correct, this means that we have a 7% slowdown in Windows PGO builds for 3.10, which I don't think is acceptable.

    What I understood is that the https://bugs.python.org/msg401743 patch makes 1.07x faster.

    But https://bugs.python.org/issue45116#msg401143 compares Python 3.10b1 to Python 3.10a7: Python 3.10b1 is slower (32-bit: "1.07", 64-bit: "1.14": "higher the slower" wrote neonene).

    vstinner commented 3 years ago

    I created a draft PR to mark functions like Py_INCREF() and Py_ISTYPE() with \_forceinline: PR 28427.

    vstinner commented 3 years ago

    Can someone compare the main branch (commit e4044e9f893350b4623677c048d33414a77edf55) to the main branch + PR 28427 patch?

    You can download the patch from: https://patch-diff.githubusercontent.com/raw/python/cpython/pull/28427.patch

    How can I build Python with PGO on Windows? Visual Studio has two profiles: PGinstrument and PGupdate.

    I cannot find "PGO", "PGinstrument" or "PGupdate" in the https://devguide.python.org/

    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    (32-bit: "1.07", 64-bit: "1.14": "higher the slower" wrote neonene)

    32-bit and 64-bit are in reverse. I compared b1 and a7 because this can be confirmed by anyone with official binary. If 7% of my patch has little to do with the gap, then I will be happy that 3.10 can be far faster.

    How can I build Python with PGO on Windows?

    Try the following,

    PCbuild\build.bat -p x64 --no-tkinter --pgo

    Before building, your object.h needs to replace static inline int Py_ALWAYS_INLINE with static Py_ALWAYS_INLINE int

    In my case, pgo got stuck on linking with the object.h.

    I'm waiting the reply from developercommunity.

    Fidget-Spinner commented 3 years ago

    @Pablo,

    If \<benchmark_link> is correct ...

    For some verification, I benched pyperformance on Win10 AMD64, with the Python 3.10a7 and 3.10rc2 x64 binaries downloaded directly from python.org website release pages. The results corroborate with neonene's (please see the attached file). In short, there was a 1.09x slowdown on average.

    FYI, pyperf system tune doesn't work on Windows. So I manually disabled turbo boost and intel speedstep, but I didn't have time to research setting core affinity and the other stabilizations. Nonetheless, most of the benches were stable.

    I am marking this as a release blocker until there is some agreement. Got it. Setting as advised.

    b8e5a65c-bed4-4c5d-ba32-799f883ba638 commented 3 years ago

    In my case, pgo got stuck on linking with the object.h.

    Me too. Since commit 28d28e0 (the first commit to slow down the PGO build), if add __forceinline attribute to _Py_DECREF() function in object.h, the PGO build hangs (>50 minutes).

    So PR 28427 may not be a short-term solution.

    b8e5a65c-bed4-4c5d-ba32-799f883ba638 commented 3 years ago

    Like OP's benchmark, if convert the inline functions to macros in object.h, the 3.10 branch is 1.03x faster, but still 1.07x slower than 28d28e0\~1. @vstinner could you prepare such a PR as a candidate fix.

    There seem to be two ways to solve it in short-term. 1, Split the giant function. 2, Contact MSVC team to see if there is a quick solution, such as undocumented options.

    But the release time is too close. The worst result is to release with the performance regression, and note in the download page that there is a performance regression, if you care about performance please use 3.9.

    rhettinger commented 3 years ago

    I concur with Ma Lin.

    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    release with the performance regression

    I'm OK with the option. The limitation of PGO seems to me a bit weird and it might be unexpected for MSVC team.

    markshannon commented 3 years ago

    If we are hitting a size limit for PGO, then we need to reduce the size of _PyEval_EvalFrameDefault, to let the compiler do its job. Force inlining stuff is not going to help.

    Reverting https://github.com/python/cpython/pull/25244 for 3.10 seems to be the cleanest way to do this.

    Would someone with a reliable way to test performance on Windows test the effect of https://github.com/python/cpython/pull/28475, please?

    Longer term we need get PGO in MSVC working on larger functions, but I doubt that will be possible for 3.10.

    pablogsal commented 3 years ago

    Can someone repeat the benchmarks with https://github.com/python/cpython/pull/28475 ?

    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    I built 3.10rc2 PGO with PR28475 applied, and posted the inliner's log. In the log, the 4-callees mentioned above are now inlined, which were "hard reject"ed before.

    As for the performance, a few reporters may be needed, but it's not necessary for them to care about noises in the apparent gap.

    310rc2 x64 PGO : 1.00 + PR28475 build1 bench1 : 1.05x faster (slower 7, faster 43, nochange 8) bench2 : 1.05x faster (slower 2, faster 43, nochange 13) build2 : 1.05x faster (slower 4, faster 45, nochange 9)

    310rc2 x64 release : 1.00 + PR28475 : 1.01x faster (slower 14, faster 25, nochange 19)

    Is Windows involved in the faster-cpython project? If so, the project should be provided with Windows machines for validation.

    b8e5a65c-bed4-4c5d-ba32-799f883ba638 commented 3 years ago

    PR28475: 64-bit build is 1.03x slower than 28d28e0\~1 32-bit build is 1.04x slower than 28d28e0\~1

    28d28e0\~1 is the last good commit.

    Fidget-Spinner commented 3 years ago

    Like what Ma Lin and neonene have mentioned above, PR28475 recovered half of the lost performance. It's unfortunately still 4% slower than 3.10a7.

    pyperf compare_to 310a7.json 310rc2.json 310rc2patched.json

    Geometric mean (versus 3.10a7) \==============

    310rc2: 1.09x slower 310rc2patched: 1.04x slower

    Attached pyperf benchmark comparisons file.

    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    To be fair, the slowdowns between PR25244 and b1 seems to be an accumulation of "1.00x slower" of every commit. I don't know after b1.

    markshannon commented 3 years ago

    The only other change of any obvious significance to _PyEval_EvalFrameDefault since 3.10a7 are the changes to MATCH_MAPPING and MATCH_SEQUENCE and those make _PyEval_EvalFrameDefault smaller.

    We may need to look elsewhere for the remaining \~4% performance.

    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    PR28475 PGO is 2% slower than the patch I pasted on msg401743. The function sizes are almost the same (+1:goto,+1:label), and there is no performance gap between release builds.

    I suspect the following.

    1. PGO is too sensitive to a function size at near the limit.
    2. PR28475 is not fully covered by 44 tests. (msg401346)
    23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

    3.10rc2 Python/ceval.c 1306: #define DISPATCH() \ 1307: { \ 1308: if (trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE) { \ 1309: goto tracing_dispatch; \

    Among the 44 pgo-tests, only test_patma.TestTracing hits the condition above. On Windows, it seems that skipping it tightens the profile of PR28475 a bit. Additional tests such as test_threading(.ThreadTests.test_frame_tstate_tracing) might also cause some amount of variation or vice versa.

    3.10rc2 x64 PGO : 1.00 + PR28475 with TestTracing : 1.05x faster (slow 3, fast 46, same 9) without : 1.06x faster (slow 5, fast 52, same 1)

      with TestTracing : 1.00
      without          : 1.01x faster (slow 19, fast 27, same 12)

    (Details: PR28475_skip1test_bench.txt)

    Does test_patma.TestTracing need training for match-case performance?

    pablogsal commented 2 years ago

    If someone wants this issue to be solved in 3.10.0 it must be resolved ASAP. I am going to start freezing the release branch in one day or two to start testing the final candidate as much as possible so this issue has 24h at max to be merged into 3.10 and cherry picked into the release branch

    pablogsal commented 2 years ago

    I'm landing PR 28475 for now as it improves the situation from 7% to 4% slowdown and is contained enough.

    pablogsal commented 2 years ago

    This means that if anyone wants to pursue the 4% that is left the fix must be committed within 24 hours.

    Fidget-Spinner commented 2 years ago

    If someone wants this issue to be solved in 3.10.0 it must be resolved ASAP.

    neonene suggested that the tracing tests for pattern matching (added in 3.10b4/rc1) caused PGO to wrongly optimize the more uncommon tracing paths in ceval. I will verify their one-line fix to disable that test for PGO and report back within the next 48 hours.

    Related issue: https://bugs.python.org/issue44600