Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC)

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

BPO	45116
Nosy	@malemburg, @gvanrossum, @rhettinger, @pfmoore, @vstinner, @tjguk, @markshannon, @zware, @zooba, @animalize, @pablogsal, @brandtbucher, @neonene, @erlend-aasland, @Fidget-Spinner
PRs	python/cpython#28390 python/cpython#28419 python/cpython#28427 python/cpython#28475 python/cpython#28630 python/cpython#28631 python/cpython#31436 python/cpython#31459 python/cpython#32387
Files	310rc1_confirm_overhead.patch ceval_310rc1_patched.c b98e-no-inline-in-all.diff b98e-no-inline-in-eval.diff b98e-no-inline-in-the-others.diff pyproject_inlinestat.patch x64_28d2.log x64_b98e.log 310rc2_benchmarks.txt 310a7_vs_310rc2_bench.txt PR28475_inline.log PR28475_vs_310rc2_vs_310a7.txt PR28475_skip1test_bench.txt 310rc2patched_vs_310rc2notrace.txt switch-case_unarranged_bench.txt ceval_PR29565_split_func.c

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['interpreter-core', '3.10', 'performance', 'expert-C-API', '3.11', 'OS-windows'] title = 'Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC)' updated_at = user = 'https://github.com/neonene' ``` bugs.python.org fields: ```python activity = actor = 'steve.dower' assignee = 'none' closed = False closed_date = None closer = None components = ['Interpreter Core', 'Windows', 'C API'] creation = creator = 'neonene' dependencies = [] files = ['50263', '50264', '50271', '50272', '50273', '50274', '50275', '50276', '50280', '50286', '50291', '50293', '50296', '50315', '50363', '50452'] hgrepos = [] issue_num = 45116 keywords = ['patch'] message_count = 82.0 messages = ['401143', '401152', '401154', '401182', '401183', '401319', '401329', '401346', '401364', '401623', '401624', '401628', '401743', '401964', '401970', '401972', '402025', '402040', '402043', '402044', '402063', '402064', '402065', '402067', '402068', '402071', '402090', '402091', '402092', '402098', '402099', '402117', '402135', '402143', '402189', '402190', '402217', '402229', '402230', '402287', '402289', '402296', '402307', '402308', '402320', '402480', '402856', '402857', '402858', '402864', '402867', '402871', '402878', '402886', '402891', '402893', '402928', '402930', '402954', '403403', '403409', '403430', '403432', '403464', '403559', '403587', '403609', '404089', '406354', '406386', '406407', '406416', '406471', '406474', '406479', '406487', '406613', '407188', '415378', '416911', '416950', '416977'] nosy_count = 15.0 nosy_names = ['lemburg', 'gvanrossum', 'rhettinger', 'paul.moore', 'vstinner', 'tim.golden', 'Mark.Shannon', 'zach.ware', 'steve.dower', 'malin', 'pablogsal', 'brandtbucher', 'neonene', 'erlendaasland', 'kj'] pr_nums = ['28390', '28419', '28427', '28475', '28630', '28631', '31436', '31459', '32387'] priority = None resolution = None stage = 'patch review' status = 'open' superseder = None type = 'performance' url = 'https://bugs.python.org/issue45116' versions = ['Python 3.10', 'Python 3.11'] ```

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

I have another fix.

pablogsal commented 3 years ago

I have another fix.

If you have another fix, please create a PR ASAN and get it reviewed and merged by a core dev in the next 24 hours, otherwise it will need to wait until 3.10.1

Fidget-Spinner commented 3 years ago

Sadly, I can't reproduce the speedups OP reported from disabling test_patma.TestTracing. It's not any faster than what we have with PR28475. (See attached pyperformance).

I'm looking forward to their other fix :). Even if it comes in 3.10.1 that's still a huge win. I don't think everyone immediately upgrades when a new Python version arrives.

IMO, we should note in What's New that only for Windows, 3.10.0 has a slight slowdown. Some use cases are slower (by >10%!), while some people won't feel a thing. (Then again, maybe this is offset by the LOAD_ATTR opcache in 3.10 and we get a net zero effect?). I'll submit a PR soon if the full fix misses 3.10.0.

pablogsal commented 3 years ago

IMO, we should note in What's New that only for Windows, 3.10.0 has a slight slowdown.

I disagree. This is a regression/bug and we don't advertise "known bugs" in the what's new, the same for any other bugfix that has been delayed until 3.10.1

Some use cases are slower (by >10%!)

Can you still reproduce this with PR 28475?

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

I submitted 2 drafts in a hurry. Sorry for short explanations. I'll add more reports.

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

@pablogsal I'm OK with more effective fixes in 3.10.1 and later.

Thanks all, thanks kj and malin for many help.

b8e5a65c-bed4-4c5d-ba32-799f883ba638 commented 3 years ago

I think this is a bug of MSVC2019, not a really regression of CPython. So changing the code of CPython is just a workaround, maybe the right direction is to prompt MSVC to fix the bug, otherwise there will be more trouble when 3.11 is released a year later.

Seeing MSVC's reply, it seems they didn't realize that it was a bug, but suggested to adjust the training samples and use __forceinline. They don't know __forceinline hangs the build process since 28d28e0.

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

_PyEval_EvalFrameDefault() may also need to be divided.

Fidget-Spinner commented 3 years ago

@Pablo

I disagree. This is a regression/bug and we don't advertise "known bugs" in the what's new, the same for any other bugfix that has been delayed until 3.10.1

Alright, in hindsight 3.10 What's New was a bad suggestion on my part. I wonder if there's a better location for such news though.

> Some use cases are slower (by >10%!) Can you still reproduce this with PR 28475?

Yes that number is *with* PR28475. Without that PR it was worse. The second pyperformance comparison in this file is 3.10a7 vs PR28475 https://bugs.python.org/file50293/PR28475_vs_310rc2_vs_310a7.txt. Omitting python_startup (unstable on Windows) and unpack_sequence (microbenchmark):

logging_silent: 250 ns +- 7 ns -> 291 ns +- 10 ns: 1.16x slower
hexiom: 14.0 ms +- 0.3 ms -> 15.7 ms +- 3.0 ms: 1.12x slower
logging_simple: 16.1 us +- 0.2 us -> 18.0 us +- 0.5 us: 1.12x slower
nbody: 215 ms +- 7 ms -> 235 ms +- 4 ms: 1.09x slower
logging_format: 17.8 us +- 0.3 us -> 19.4 us +- 0.5 us: 1.09x slower
richards: 104 ms +- 6 ms -> 112 ms +- 3 ms: 1.08x slower
xml_etree_parse: 218 ms +- 3 ms -> 235 ms +- 3 ms: 1.08x slower
sqlalchemy_imperative: 34.5 ms +- 0.9 ms -> 37.1 ms +- 1.1 ms: 1.08x slower
xml_etree_iterparse: 158 ms +- 2 ms -> 168 ms +- 2 ms: 1.06x slower
pathlib: 255 ms +- 6 ms -> 271 ms +- 3 ms: 1.06x slower
pyflate: 963 ms +- 10 ms -> 1.02 sec +- 0.02 sec: 1.06x slower
unpickle_pure_python: 446 us +- 11 us -> 471 us +- 9 us: 1.06x slower ---- anything \<= 1.05x slower is snipped since it could be noise -----

At this point I don't know if we have any quick fixes left. So maybe we should open another issue for 3.11 and consider factoring out uncommon opcodes into functions like Victor and Mark suggested. We could make use of the opcode stats the faster-cpython folks have collected https://github.com/faster-cpython/tools.

markshannon commented 3 years ago

Sadly the MSVC team are claiming that this isn't a bug in their compiler. Not sure how we convince them that it is. The website rejects any attempt to reopen the issue.

How feasible would it be to use Clang or GCC on Windows?

vstinner commented 3 years ago

How feasible would it be to use Clang or GCC on Windows?

clang seems to have a good Windows support and tries to the ABI compatible with MSC which is a must have to keep wheel package support (especially for the stable ABI, used by PyQt on Windows for example).

Moreover, there are ways to cross-build Python from another platform to Windows which can be convenient ;-)

I don't know the Windows ecosystem. Do people want to get VS debugger for example? Is clang compatible with the VS debugger?

See the discussion of 2014: "Status of C compilers for Python on Windows" https://mail.python.org/archives/list/python-dev@python.org/thread/SYWDJ23AQDPWQN7HD6M6YCSGXERCHWA2/

zooba commented 3 years ago

I would very much appreciate any new compiler be compatible with the standard Windows debuggers (windbg primarily, but I imagine most contributors would like it to keep working from VS).

Last I heard, clang is fine as a compiler for debugging if you use the MSVC linker to generate debug info, though it still isn't as complete as MSVC (ultimately by definition, since MSVC is the standard-by-implementation for this stuff). And I've got no idea how/whether link-time optimisation works when you mix tools, but I'd have to assume it doesn't.

Switching compiler may prevent me from being able to analyse crash reports (and by me, I mean the automated internal tools that do it for me), and certainly parts of the Windows build rely on MSVC-specific functionality right now (not in the main DLL) so we'd end up needing both for a full build.

Also, just to put it out there, I'm not volunteering to rewrite the build system :) If the steering council signs off on switching, I won't block it, but I have more interesting things to work on.

zooba commented 3 years ago

If we know which parts of the function are critical, perhaps we should be designing a PGO profile that actually hits them all? The current profile is very arbitrary, basically just waiting for someone motivated enough to figure out a better one.

b8e5a65c-bed4-4c5d-ba32-799f883ba638 commented 3 years ago

Today I tested with msvc2022-preview, __forceinline attribute will not hang the build.

64-bit PGO builds:

28d28e0~1,vc2022   : baseline
28d28e0~1+F,vc2022 : 1.02x slower  <1>
28d28e0,vc2022     : 1.03x slower  <2>
28d28e0+F,vc2022   : 1.03x slower
3.10 final,vc2022  : 1.03x slower
3.10 final+F,vc2022: 1.03x slower
28d28e0~1,vc2019   : 1.00x slower  <3>

28d28e0~1 is the last fast commit, 28d28e0 is the first slow commit. +F means add __forceinline attribute to all inline functions in object.h vc2019 and vc2022 are the latest version.

\<1> Forcing inline is slower. \<2> 28d28e0 is still slow, but not that much. \<3> Normally, msvc2019 and msvc2022 have the same performance.

Is it possible to write a PGO profile for 28d28e0? https://github.com/python/cpython/commit/28d28e053db6b69d91c2dfd579207cd8ccbc39e7

msvc2022 will be released in November this year, and maybe subsequent versions can be built with msvc2022.

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

PR28475 is not in the official source archive. https://www.python.org/ftp/python/3.10.0/Python-3.10.0.tar.xz

I'll check later whether official binary has the fix.

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

3.10.0 official binary is as slow as rc2.

Many files are not updated in the source archive or b494f5935c92951e75597bfe1c8b1f3112fec270, so I'm not sure if the delay is intentional or not.

We have no choice except waiting for 3.10.1.

gvanrossum commented 3 years ago

Someone whose name I don't recognize (MagzoB Mall) just changed the issue type to "crash" without explaining why. That user was created today, and has no triage permissions. Mind if I change it back? It feels like vandalism.

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 3 years ago

msg402954

https://github.com/faster-cpython/tools

According to the suggested stats and pgomgr.exe, I experimentally moved LOAD_FAST and LOAD_CONST cases out of switch as below.

        if (opcode == LOAD_FAST) {
            ...
            DISPATCH();
        }

        if (opcode == LOAD_CONST) {
            ...
            DISPATCH();
        }

        switch (opcode) {

x64 performance results after patched (msvc2019)

Good inliner ver. 3.10.0+ 1.03x faster than before 28d28e0\~1 1.04x faster 3.8.12 1.03x faster

Bad inliner ver. (too big evalfunc. Has msvc2022 increased the capacity?) 3.10.0/rc2 1.00x faster 3.11a1+ 1.02x faster

It seems to me since quite a while ago the optimizer has stopped at some place after successful inlining. So the performance may be sensitive to code changes and it could be possible to detect where the optimization is aborted.

(Benchmarks: switch-case_unarranged_bench.txt)

brandtbucher commented 2 years ago

The total size of the main interpreter loop was recently reduced somewhat by an unrelated change:

https://github.com/python/cpython/commit/9178f533ff5ea7462a2ca22cfa67afd78dad433b

I wonder if this issue still exists?

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 2 years ago

I still have the issue in current main and PR29565 with msvc2022 (v142 or v143 toolset).

brandtbucher commented 2 years ago

Hm. If removing 26 opcodes didn't fix this, then maybe the size of _PyEval_EvalFrameDefault isn't really the issue?

gvanrossum commented 2 years ago

I'd like to know how to reproduce this. @neonene can you write down the steps I should do to get the results you get? I have VS 2019, if I need VS 2022 I can install that.

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 2 years ago

Here are the 3 steps to reproduce with minimal pgo training. (vs2019)

Download the source archive of PR29565 and extract. https://github.com/python/cpython/archive/6a84d61c55f2e543cf5fa84522d8781a795bba33.zip
Apply the following patch.

\==============================

--- PCbuild/build.bat
+++ PCbuild/build.bat
@@ -66 +66 @@
-set pgo_job=-m test --pgo
+set pgo_job=-c"pass"
--- PCbuild/pyproject.props
+++ PCbuild/pyproject.props
@@ -47,2 +47,3 @@
       <AdditionalOptions>/utf-8 %(AdditionalOptions)</AdditionalOptions>
+      <AdditionalOptions Condition="$(SupportPGO) and $(Configuration) == 'PGUpdate'">/d2inlinelogfull:_PyEval_EvalFrameDefault %(AdditionalOptions)</AdditionalOptions>
     </ClCompile>

==============================

Build [Rebuild]

PCbuild\build --no-tkinter --pgo > build.log [-r]

According to the inlining section in the log, any function that has one or more conditional expressions got "reject" from inliner.

> Inlinee for function _PyEval_EvalFrameDefault > -_Py_EnsureFuncTstateNotNULL (pgo hard reject) > ... > _Py_INCREF (pgu decision) > _Py_INCREF (pgu decision) > -_Py_XDECREF (pgo hard reject) > -_Py_XDECREF (pgo hard reject) > -_Py_DECREF (pgo hard reject) > -_Py_DECREF (pgo hard reject) > ...

Profiling scores can be shown on VS2019 Command Prompt.

pgomgr PCbuild\amd64\python311.pgd /summary [/detail] > largefile.txt

pgomgr.exe (or profile itself) has an issue. https://developercommunity.visualstudio.com/t/1560909

Unused opcodes in this training

ROT_THREE, DUP_TOP_TWO, UNARY_POSITIVE, UNARY_NEGATIVE, BINARY_OP_ADD_FLOAT, UNARY_INVERT, BINARY_OP_MULTIPLY_INT, BINARY_OP_MULTIPLY_FLOAT, GET_LEN, MATCH_MAPPING, MATCH_SEQUENCE, MATCH_KEYS, LOAD_ATTR_SLOT, LOAD_METHOD_CLASS, GET_AITER, GET_ANEXT, BEFORE_ASYNC_WITH, END_ASYNC_FOR, STORE_ATTR_SLOT, STORE_ATTR_WITH_HINT, GET_YIELD_FROM_ITER, PRINT_EXPR, YIELD_FROM, GET_AWAITABLE, LOAD_ASSERTION_ERROR, SETUP_ANNOTATIONS, UNPACK_EX, DELETE_ATTR, DELETE_GLOBAL, ROT_N, COPY, DELETE_DEREF, LOAD_CLASSDEREF, MATCH_CLASS, SET_UPDATE, DO_TRACING

I managed to activate inliner experimentally by removing the 36 op-cases from switch and merging/removing many macros.

Static instruction counts of _PyEval_EvalFrameDefault()

PR29565 : 6882 (down to 4400 with above change)

PR29482 : 7035 PR29482\~1 : 7742 3.10.0+ : 3980 (well inlined sharing DISPATCH macro) 3.10.0 : 5559 3.10b1 : 5680 3.10a7 : 4117 (well inlined)

zooba commented 2 years ago

-set pgo_job=-m test --pgo +set pgo_job=-c"pass"

This essentially disables PGO. You won't get anything valid or useful from analysing its results if you don't give it a somewhat reasonable profile (preferably one that exercises the interpreter loop, which "pass" does not).

gvanrossum commented 2 years ago

@neonene what's the importance of PR29565?

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 2 years ago

This essentially disables PGO.

Thank you for the suggestion. I'll take another experimental aproach to reduce the size of 3.11 evalfunc for stronger validation.

@neonene what's the importance of PR29565?

While we are talking about function size, I would like to use around PR29565 for consistent reporting. I think any commit is okay to reproduce the issue.

And please ignore the patch to build.bat.

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 2 years ago

In the eval-loop of PR29565, inlining seems to be enabled within about 70 op-brahches, trained with 44 tests.

log & source: ceval_PR29565_split_func.c (not for performance)

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 2 years ago

I requested the MSVC team to reconsider the inlining issues, including __forceinline. https://developercommunity.visualstudio.com/t/1595341

The stuck at link due to __forceinline can be avoided by completing the _Py_DECREF optimization outside _PyEval_EvalFrameDefault:

static inline void         // no __forceinline
_Py_DECREF_impl(...) {
    ...
}
static __forceinline void
_Py_DECREF(...) {          // no conditional branch in the function
    _Py_DECREF_impl(...);    
}

In _PyEvalEvalFrameDefault, wrapping the callees like above seems better for performance than just specifying \_forceinline under the current MSVC.

gvanrossum commented 2 years ago

I can't yet confirm a regression in 3.11 (the main branch, currently) compared to 3.10. See my adventures in https://github.com/faster-cpython/ideas/discussions/315.

gvanrossum commented 2 years ago

-_Py_DECREF (pgo hard reject)

What exactly does "pgo hard reject" mean? I Googled it and found no hits besides this very issue.

I am trying to redefine the top three from this error log as macros, but since I still don't have stable benchmark results it's hard to know if this has any effect.

23c8a93b-de57-46a3-b68d-6c6a493f0c9f commented 2 years ago

What exactly does "pgo hard reject" mean?

In my recognition, "pgo hard reject" is based on the PGOptimizer's heuristic, "reject" is related to the probe count (hot/cold).

https://developercommunity.visualstudio.com/t/1531987#T-N1535774

And there was a reply from MSVC team, closing the issue. MSVC won't be fixed in the near future.

https://developercommunity.visualstudio.com/t/1595341#T-N1695626

From the reply and my investigation, 3.11 would need the following:

Some callsites such as tp_* pointer should not inline its fastpaths in the eval switch-case. They often conflict. Each pointer needs to be wrapped with a function or maybe _PyEval_EvalFrameDefault needs to be enclosed with "inline_depth(0)" pragma.
__assume(0) should be replaced with other function, inside the eval switch-case or in the inlined paths of callees. This is critical with PGO.
For inlining, use __forceinline / macro / const function pointer.

MSVC's stuck can be avoided in many ways, when force-inlining in the evalloop a ton of Py_DECREF()s, unless tp_dealloc does not create a inlined callsite:

     void
     _Py_Dealloc(PyObject *op)
     {
      ...
     #pragma inline_depth(0) // effects from here, PGO accepts only 0.
         (*dealloc)(op);     // conflicts when inlined.
     }
     #pragma inline_depth()  // can be reset only outside the func.

Virtual Call Speculation: https://docs.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=msvc-170#optimizations-performed-by-pgo
The profiler runs under /GENPROFILE:PATH option, but at the big ceval-func, the optimizer merges the profiles into one like /GENPROFILE:NOPATH mode. https://docs.microsoft.com/en-us/cpp/build/reference/genprofile-fastgenprofile-generate-profiling-instrumented-build?view=msvc-170#arguments
__assume(0) (Py_UNREACHABLE): https://devblogs.microsoft.com/cppblog/visual-studio-2017-throughput-improvements-and-advice/#remove-usages-of-__assume

zooba commented 2 years ago

__assume(0) should be replaced with other function, inside the eval switch-case or in the inlined paths of callees. This is critical with PGO.

Out of interest, have you done other experiments confirming this? The reference linked is talking about compiler throughput (i.e. how long it takes to compile), and while it hints that using __assume(0) may interfere with other optimisations, that isn't supported with any detail or analysis in the post.

neonene commented 2 years ago

have you done other experiments confirming this?

My benchmark results are left in https://github.com/faster-cpython/ideas/issues/321#issuecomment-1094129130.

__assume(0) is problematic only where the substitute function is inlined.

Correction of my previous post:

MSVC's stuck can be avoided in many ways, ... unless tp_dealloc does not create a inlined callsite

-unless +if

gvanrossum commented 2 years ago

For Py_UNREACHABLE() then maybe we should just remove these two lines from, pymacro.h?

#elif defined(_MSC_VER)
#  define Py_UNREACHABLE() __assume(0)

Then the code will fall back to

#else
#  define Py_UNREACHABLE() \
    Py_FatalError("Unreachable C code path reached")
#endif

gvanrossum commented 2 years ago

__assume(0) is problematic only where the substitute function is inlined.

Can you elaborate? What is the "substitute function"? The macro definition is

#  define Py_UNREACHABLE() __assume(0)

so there is no inlined function. Are you referring to the code containing the call to Py_UNREACHABLE()? That wouldn't affect the ceval.c main loop in _PyEval_EvalFrameDefault because that function is definitely to large to be inlined. :-)

What am I missing?

neonene commented 2 years ago

Sorry for the lack of explanation.

I encountered a measurable slowdown several months ago when Py_RETURN_RICHCOMPARE macro is inlined in the eval-loop. However, that may be x86 only.

If I understand correctly, x86 official binaries are non-PGO builds. Then, a Py_FatalError() only for TARGET(CACHE) branch would be enough for now.

When I change the current version as below:

Substitute void Py_UNREACHABLE(void) {} or Py_FatalError() for __assume(0) in pymacro.h

Make PyObject_RichCompare() called through a function pointer, adding this above _PyEval_EvalFrameDefault().

static const richcmpfunc PyObject_RichCompare_PTR = PyObject_RichCompare;
#define PyObject_RichCompare PyObject_RichCompare_PTR

Then, PGO decides to inline PyObject_RichCompare(), based on its profile. This seems to affect "Function Layout optimization" even if it is not inlined. (under verification)

    PyObject_RichCompare (pgu decision)
      _PyThreadState_GET (pgu decision)
        _PyRuntimeState_GetThreadState (pgu decision)
      _PyErr_Occurred (pgu decision)
      -_PyErr_BadInternalCall (pgo hard reject)
      _Py_EnterRecursiveCall (pgu decision)
        _Py_MakeRecCheck (pgu decision)
        -_Py_CheckRecursiveCall (pgo hard reject)
      do_richcompare (pgu decision)
        PyType_IsSubtype (pgu decision)
          -type_is_subtype_base_chain (pgo hard reject)
        Py_DECREF (pgu decision)
          _Py_Dealloc (pgu decision)
        long_richcompare (pgu decision)
          ...
>>>>>     Py_UNREACHABLE (pgu decision)  // or _Py_FatalErrorFunc (pgo hard reject)

As for other place (_Py_FatalErrorFunc() never gets inlined anywhere):

    dict_get (pgu decision)
      -_PyArg_CheckPositional (pgo hard reject)
      dict_get_impl (pgu decision)
        unicode_get_hash (pgu decision)
        PyObject_Hash (pgu decision)
          _Py_HashPointer (pgu decision)
            _Py_HashPointerRaw (pgu decision)
          -PyType_Ready (pgo hard reject)
          -PyObject_HashNotImplemented (pgo hard reject)
        _Py_dict_lookup (pgu decision)
          -unicodekeys_lookup_unicode (pgo hard reject)
          -unicodekeys_lookup_generic (pgo hard reject)
          dictkeys_generic_lookup (pgu decision)
            dictkeys_get_index (pgu decision)
            Py_INCREF (pgu decision)
            -PyObject_RichCompareBool (pgo hard reject)
            Py_DECREF (pgu decision)
              _Py_Dealloc (pgu decision)
>>>>>       -Py_UNREACHABLE (pgo hard reject)  // (no harm)

    Py_DECREF (force inline)
      -_Py_Dealloc (initial scan: soft depth exceeded)
    -_Py_Specialize_BinaryOp (pgo hard reject)
>>>  Py_UNREACHABLE (pgu decision)            // @TARGET(CACHE) inlined
    _PyFrame_SetStackPointer (pgu decision)
    -trace_function_entry (pgo hard reject)
    _PyFrame_GetStackPointer (pgu decision)
    PyDTrace_FUNCTION_ENTRY_ENABLED (pgu decision)
    -dtrace_function_entry (pgo hard reject)
    PyDTrace_LINE_ENABLED (pgu decision)
    -maybe_dtrace_line (pgo hard reject)
    _PyFrame_SetStackPointer (pgu decision)
    -maybe_call_line_trace (pgo hard reject)
    _PyFrame_GetStackPointer (pgu decision)
    -_PyInterpreterFrame_GetLine (pgo hard reject)
    -fprintf (initial scan: parameter mismatch, varargs, not eligible)
    -_PyErr_SetString (pgo hard reject)
>>> -Py_UNREACHABLE (pgo hard reject)         // out of switch (no harm)

Benchmark after removal of TARGET(CACHE) branch:

Py_UNREACHABLE at long_richcompare()	x64 PGO	x86 PGO
__assume(0)	1.00	1.00
Py_FatalError	1.02x *slower*	1.03x~ faster
void foo(void) {}	1.02x *slower*	1.04x~ faster

__assume(0) works well in the hot section on x64.

EDIT: The gap on x86 can be increased depending on the amount of optimization.

``` >pyperf compare_to assume64 fatalerr64 Slower (25): - dulwich_log: 195 ms +- 3 ms -> 231 ms +- 50 ms: 1.19x slower - xml_etree_process: 98.3 ms +- 1.7 ms -> 109 ms +- 1 ms: 1.11x slower - pickle_pure_python: 540 us +- 7 us -> 592 us +- 41 us: 1.10x slower - logging_silent: 163 ns +- 1 ns -> 177 ns +- 3 ns: 1.09x slower - xml_etree_generate: 138 ms +- 4 ms -> 148 ms +- 3 ms: 1.07x slower - unpickle_pure_python: 348 us +- 9 us -> 371 us +- 5 us: 1.07x slower - meteor_contest: 152 ms +- 1 ms -> 162 ms +- 2 ms: 1.06x slower - hexiom: 10.8 ms +- 0.1 ms -> 11.4 ms +- 0.2 ms: 1.06x slower - deltablue: 6.78 ms +- 0.23 ms -> 7.14 ms +- 0.09 ms: 1.05x slower - scimark_fft: 491 ms +- 13 ms -> 517 ms +- 12 ms: 1.05x slower - fannkuch: 627 ms +- 14 ms -> 660 ms +- 20 ms: 1.05x slower - crypto_pyaes: 127 ms +- 3 ms -> 134 ms +- 4 ms: 1.05x slower - sqlalchemy_imperative: 44.8 ms +- 2.3 ms -> 47.1 ms +- 5.3 ms: 1.05x slower - nqueens: 164 ms +- 1 ms -> 172 ms +- 3 ms: 1.05x slower - float: 138 ms +- 2 ms -> 144 ms +- 3 ms: 1.05x slower - logging_format: 20.4 us +- 0.4 us -> 21.3 us +- 0.9 us: 1.04x slower - nbody: 179 ms +- 6 ms -> 185 ms +- 7 ms: 1.04x slower - django_template: 85.4 ms +- 1.1 ms -> 88.6 ms +- 2.3 ms: 1.04x slower - go: 240 ms +- 4 ms -> 249 ms +- 7 ms: 1.03x slower - regex_compile: 231 ms +- 2 ms -> 238 ms +- 3 ms: 1.03x slower - json_dumps: 21.1 ms +- 0.2 ms -> 21.6 ms +- 0.2 ms: 1.02x slower - logging_simple: 18.9 us +- 0.3 us -> 19.3 us +- 0.4 us: 1.02x slower - sqlalchemy_declarative: 277 ms +- 8 ms -> 283 ms +- 8 ms: 1.02x slower - scimark_monte_carlo: 104 ms +- 2 ms -> 106 ms +- 2 ms: 1.02x slower - pathlib: 183 ms +- 12 ms -> 187 ms +- 7 ms: 1.02x slower Faster (6): - pickle_dict: 44.7 us +- 0.5 us -> 41.9 us +- 0.9 us: 1.07x faster - scimark_sor: 215 ms +- 4 ms -> 203 ms +- 3 ms: 1.06x faster - telco: 11.3 ms +- 0.1 ms -> 10.8 ms +- 0.1 ms: 1.04x faster - unpickle: 23.0 us +- 0.3 us -> 22.2 us +- 0.2 us: 1.04x faster - pickle_list: 6.75 us +- 0.17 us -> 6.60 us +- 0.19 us: 1.02x faster - python_startup: 13.8 ms +- 1.2 ms -> 13.5 ms +- 0.1 ms: 1.02x faster Benchmark hidden because not significant (27): 2to3, chameleon, chaos, json_loads, mako, pickle, pid igits, pyflate, python_startup_no_site, raytrace, regex_dna, regex_effbot, regex_v8, richards, scima rk_lu, scimark_sparse_mat_mult, spectral_norm, sqlite_synth, sympy_expand, sympy_integrate, sympy_su m, sympy_str, tornado_http, unpack_sequence, unpickle_list, xml_etree_parse, xml_etree_iterparse Geometric mean: 1.02x slower ``` ``` >pyperf compare_to assume64 emptyfunc64 Slower (23): - nbody: 179 ms +- 6 ms -> 191 ms +- 4 ms: 1.07x slower - xml_etree_process: 98.3 ms +- 1.7 ms -> 105 ms +- 1 ms: 1.07x slower - meteor_contest: 152 ms +- 1 ms -> 163 ms +- 7 ms: 1.07x slower - scimark_fft: 491 ms +- 13 ms -> 523 ms +- 11 ms: 1.07x slower - xml_etree_generate: 138 ms +- 4 ms -> 146 ms +- 3 ms: 1.06x slower - fannkuch: 627 ms +- 14 ms -> 662 ms +- 15 ms: 1.06x slower - spectral_norm: 179 ms +- 2 ms -> 189 ms +- 3 ms: 1.05x slower - unpickle_pure_python: 348 us +- 9 us -> 366 us +- 5 us: 1.05x slower - telco: 11.3 ms +- 0.1 ms -> 11.8 ms +- 0.2 ms: 1.05x slower - scimark_lu: 153 ms +- 3 ms -> 160 ms +- 2 ms: 1.04x slower - django_template: 85.4 ms +- 1.1 ms -> 89.0 ms +- 1.2 ms: 1.04x slower - python_startup: 13.8 ms +- 1.2 ms -> 14.3 ms +- 1.8 ms: 1.04x slower - nqueens: 164 ms +- 1 ms -> 170 ms +- 4 ms: 1.04x slower - json_dumps: 21.1 ms +- 0.2 ms -> 21.9 ms +- 0.2 ms: 1.04x slower - deltablue: 6.78 ms +- 0.23 ms -> 7.01 ms +- 0.10 ms: 1.03x slower - crypto_pyaes: 127 ms +- 3 ms -> 131 ms +- 3 ms: 1.03x slower - unpack_sequence: 69.1 ns +- 3.0 ns -> 71.3 ns +- 6.4 ns: 1.03x slower - hexiom: 10.8 ms +- 0.1 ms -> 11.1 ms +- 0.2 ms: 1.03x slower - sympy_sum: 299 ms +- 4 ms -> 308 ms +- 6 ms: 1.03x slower - chameleon: 14.7 ms +- 0.2 ms -> 15.1 ms +- 0.8 ms: 1.03x slower - sqlalchemy_declarative: 277 ms +- 8 ms -> 284 ms +- 12 ms: 1.02x slower - raytrace: 543 ms +- 6 ms -> 556 ms +- 8 ms: 1.02x slower - scimark_sparse_mat_mult: 6.86 ms +- 0.35 ms -> 7.00 ms +- 0.30 ms: 1.02x slower Faster (4): - scimark_sor: 215 ms +- 4 ms -> 202 ms +- 2 ms: 1.06x faster - sqlite_synth: 5.44 us +- 0.13 us -> 5.21 us +- 0.02 us: 1.05x faster - pickle_list: 6.75 us +- 0.17 us -> 6.53 us +- 0.05 us: 1.03x faster - pickle_dict: 44.7 us +- 0.5 us -> 43.7 us +- 1.5 us: 1.02x faster Benchmark hidden because not significant (31): 2to3, chaos, dulwich_log, float, go, json_loads, logg ing_format, logging_silent, logging_simple, mako, pathlib, pickle, pickle_pure_python, pidigits, pyf late, python_startup_no_site, regex_compile, regex_dna, regex_effbot, regex_v8, richards, scimark_mo nte_carlo, sqlalchemy_imperative, sympy_expand, sympy_integrate, sympy_str, tornado_http, unpickle, unpickle_list, xml_etree_parse, xml_etree_iterparse Geometric mean: 1.02x slower ``` ``` >pyperf compare_to assume86 fatalerr86 Slower (6): - sqlalchemy_imperative: 40.2 ms +- 2.2 ms -> 42.6 ms +- 2.9 ms: 1.06x slower - unpickle_list: 8.10 us +- 0.10 us -> 8.47 us +- 0.18 us: 1.05x slower - pickle: 21.1 us +- 0.6 us -> 21.8 us +- 0.2 us: 1.04x slower - scimark_monte_carlo: 148 ms +- 5 ms -> 153 ms +- 5 ms: 1.03x slower - sqlalchemy_declarative: 261 ms +- 5 ms -> 269 ms +- 14 ms: 1.03x slower - sqlite_synth: 5.57 us +- 0.03 us -> 5.71 us +- 0.03 us: 1.03x slower Faster (28): - unpack_sequence: 126 ns +- 11 ns -> 95.3 ns +- 5.7 ns: 1.32x faster - logging_silent: 225 ns +- 5 ns -> 191 ns +- 10 ns: 1.18x faster - spectral_norm: 279 ms +- 6 ms -> 244 ms +- 4 ms: 1.14x faster - hexiom: 14.3 ms +- 0.1 ms -> 12.6 ms +- 0.1 ms: 1.14x faster - scimark_lu: 230 ms +- 6 ms -> 205 ms +- 5 ms: 1.12x faster - richards: 101 ms +- 1 ms -> 90.4 ms +- 4.0 ms: 1.12x faster - scimark_sparse_mat_mult: 10.7 ms +- 0.3 ms -> 9.64 ms +- 0.27 ms: 1.11x faster - deltablue: 8.37 ms +- 0.10 ms -> 7.65 ms +- 0.31 ms: 1.09x faster - xml_etree_process: 129 ms +- 1 ms -> 120 ms +- 2 ms: 1.07x faster - scimark_sor: 275 ms +- 6 ms -> 257 ms +- 11 ms: 1.07x faster - float: 175 ms +- 2 ms -> 164 ms +- 2 ms: 1.07x faster - go: 308 ms +- 7 ms -> 288 ms +- 9 ms: 1.07x faster - xml_etree_generate: 176 ms +- 2 ms -> 165 ms +- 4 ms: 1.06x faster - pickle_pure_python: 652 us +- 11 us -> 621 us +- 16 us: 1.05x faster - raytrace: 695 ms +- 5 ms -> 663 ms +- 6 ms: 1.05x faster - pathlib: 189 ms +- 6 ms -> 180 ms +- 5 ms: 1.04x faster - pyflate: 963 ms +- 21 ms -> 923 ms +- 27 ms: 1.04x faster - nbody: 247 ms +- 8 ms -> 237 ms +- 7 ms: 1.04x faster - xml_etree_parse: 272 ms +- 4 ms -> 264 ms +- 7 ms: 1.03x faster - regex_compile: 272 ms +- 3 ms -> 263 ms +- 2 ms: 1.03x faster - mako: 22.0 ms +- 0.3 ms -> 21.3 ms +- 0.1 ms: 1.03x faster - chameleon: 19.9 ms +- 0.2 ms -> 19.4 ms +- 0.4 ms: 1.03x faster - chaos: 174 ms +- 2 ms -> 170 ms +- 2 ms: 1.03x faster - xml_etree_iterparse: 175 ms +- 3 ms -> 170 ms +- 4 ms: 1.03x faster - nqueens: 203 ms +- 1 ms -> 198 ms +- 2 ms: 1.03x faster - unpickle_pure_python: 456 us +- 6 us -> 445 us +- 16 us: 1.02x faster - unpickle: 25.9 us +- 0.4 us -> 25.3 us +- 0.4 us: 1.02x faster - json_loads: 47.1 us +- 0.4 us -> 46.2 us +- 0.5 us: 1.02x faster Benchmark hidden because not significant (24): 2to3, crypto_pyaes, django_template, dulwich_log, fan nkuch, json_dumps, logging_format, logging_simple, meteor_contest, pickle_dict, pickle_list, pidigit s, python_startup, python_startup_no_site, regex_dna, regex_effbot, regex_v8, scimark_fft, sympy_exp and, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http Geometric mean: 1.03x faster ``` ``` >pyperf compare_to assume86 emptyfunc86 Slower (3): - pickle: 21.1 us +- 0.6 us -> 22.2 us +- 0.5 us: 1.05x slower - python_startup: 13.2 ms +- 0.2 ms -> 13.8 ms +- 2.5 ms: 1.04x slower - regex_dna: 243 ms +- 1 ms -> 250 ms +- 6 ms: 1.03x slower Faster (38): - unpack_sequence: 126 ns +- 11 ns -> 94.1 ns +- 5.3 ns: 1.34x faster - hexiom: 14.3 ms +- 0.1 ms -> 12.5 ms +- 0.2 ms: 1.15x faster - deltablue: 8.37 ms +- 0.10 ms -> 7.43 ms +- 0.08 ms: 1.13x faster - spectral_norm: 279 ms +- 6 ms -> 249 ms +- 2 ms: 1.12x faster - logging_silent: 225 ns +- 5 ns -> 202 ns +- 7 ns: 1.11x faster - xml_etree_process: 129 ms +- 1 ms -> 118 ms +- 1 ms: 1.09x faster - unpickle_pure_python: 456 us +- 6 us -> 418 us +- 12 us: 1.09x faster - nqueens: 203 ms +- 1 ms -> 187 ms +- 2 ms: 1.09x faster - scimark_sparse_mat_mult: 10.7 ms +- 0.3 ms -> 9.85 ms +- 0.33 ms: 1.09x faster - scimark_lu: 230 ms +- 6 ms -> 213 ms +- 4 ms: 1.08x faster - xml_etree_generate: 176 ms +- 2 ms -> 163 ms +- 3 ms: 1.08x faster - richards: 101 ms +- 1 ms -> 93.7 ms +- 3.9 ms: 1.08x faster - float: 175 ms +- 2 ms -> 163 ms +- 3 ms: 1.08x faster - scimark_sor: 275 ms +- 6 ms -> 257 ms +- 6 ms: 1.07x faster - unpickle_list: 8.10 us +- 0.10 us -> 7.60 us +- 0.23 us: 1.07x faster - scimark_monte_carlo: 148 ms +- 5 ms -> 139 ms +- 5 ms: 1.06x faster - go: 308 ms +- 7 ms -> 290 ms +- 8 ms: 1.06x faster - raytrace: 695 ms +- 5 ms -> 661 ms +- 7 ms: 1.05x faster - crypto_pyaes: 175 ms +- 1 ms -> 166 ms +- 2 ms: 1.05x faster - nbody: 247 ms +- 8 ms -> 235 ms +- 8 ms: 1.05x faster - pyflate: 963 ms +- 21 ms -> 917 ms +- 15 ms: 1.05x faster - xml_etree_iterparse: 175 ms +- 3 ms -> 167 ms +- 6 ms: 1.04x faster - meteor_contest: 178 ms +- 2 ms -> 171 ms +- 5 ms: 1.04x faster - sympy_sum: 298 ms +- 11 ms -> 287 ms +- 2 ms: 1.04x faster - pickle_pure_python: 652 us +- 11 us -> 628 us +- 16 us: 1.04x faster - scimark_fft: 724 ms +- 10 ms -> 697 ms +- 15 ms: 1.04x faster - regex_compile: 272 ms +- 3 ms -> 263 ms +- 7 ms: 1.03x faster - unpickle: 25.9 us +- 0.4 us -> 25.1 us +- 0.3 us: 1.03x faster - chameleon: 19.9 ms +- 0.2 ms -> 19.3 ms +- 0.4 ms: 1.03x faster - sympy_expand: 912 ms +- 21 ms -> 885 ms +- 7 ms: 1.03x faster - mako: 22.0 ms +- 0.3 ms -> 21.4 ms +- 0.2 ms: 1.03x faster - chaos: 174 ms +- 2 ms -> 169 ms +- 2 ms: 1.03x faster - django_template: 93.7 ms +- 1.6 ms -> 91.1 ms +- 1.4 ms: 1.03x faster - fannkuch: 907 ms +- 19 ms -> 885 ms +- 12 ms: 1.02x faster - sympy_str: 542 ms +- 4 ms -> 529 ms +- 10 ms: 1.02x faster - xml_etree_parse: 272 ms +- 4 ms -> 266 ms +- 9 ms: 1.02x faster - json_dumps: 23.8 ms +- 0.2 ms -> 23.3 ms +- 0.3 ms: 1.02x faster - regex_v8: 38.7 ms +- 0.4 ms -> 38.0 ms +- 0.4 ms: 1.02x faster Benchmark hidden because not significant (17): 2to3, dulwich_log, json_loads, logging_format, loggin g_simple, pathlib, pickle_dict, pickle_list, pidigits, python_startup_no_site, regex_effbot, sqlalch emy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_integrate, telco, tornado_http Geometric mean: 1.04x faster ```

neonene commented 2 years ago

Are you referring to the code containing the call to Py_UNREACHABLE()? That wouldn't affect the ceval.c main loop in _PyEval_EvalFrameDefault because that function is definitely to large to be inlined. :-)

Here is MSVC's inlining decision on current Python3.10: https://bugs.python.org/file50291/PR28475_inline.log

Weird, but faster than when only tiny functions are inlined. In the log, Py_DECREF(static) is expanded until Py_Dealloc(extern) stops its recursion. That looks to me too expansive.

gvanrossum commented 2 years ago

It looks like we may be looking at different builds? I'm only looking at x64 builds for 3.11.

If I understand correctly, x86 official binaries are non-PGO builds.

@zooba Is that so?

zooba commented 2 years ago

If I understand correctly, x86 official binaries are non-PGO builds.

Yeah, this is correct. We're more likely to deprecate and drop the 32-bit binaries before we make any major effort to optimise them - they run under an emulation layer in the OS (practically all supported OS installs are 64-bit native), so aren't really going to be recommended for people who care about performance anyway.

neonene commented 2 years ago

I think this issue can be closed. (I can't after migration)

Most of my experiences are invalid after Guido's #91718 corrected the quirks of MSVC. Another reasonable fix would be a good test which makes specialized sections hotter.

Thanks.

Fidget-Spinner commented 2 years ago

Closing as requested by OP. Thanks for your investigations @neonene ! Thanks to Guido too for the fix.

gvanrossum commented 2 years ago

Thank you @neonene for your gentle pushes and encouragement and help to get this fixed!

vstinner commented 2 years ago

@neonene:

Most of my experiences are invalid after Guido's https://github.com/python/cpython/pull/91718 corrected the quirks of MSVC.

Do you mean that this merged change https://github.com/python/cpython/commit/2f233fceae9a0c5e66e439bc0169b36547ba47c3 is now useless?

gvanrossum commented 2 years ago

No they are complementary.

neonene commented 2 years ago

Do you mean that this merged change 2f233fc is now useless?

No. What I said is about the optimization, not the (force) inlining. And what I suggested before have been already fixed by f8dc618 (and 2f233fc):

tp_* or cfunc pointer in the eval-loop can inline multiple callees without conflict.

Moving LOAD_FAST out of switch according to the scores below has no advantage now.

TOP3 entries with current 44 tests
case 124  132522464  // LOAD_FAST
case 100   48956231  // LOAD_CONST
case  45   48318813  // LOAD_FAST__LOAD_FAST

vstinner commented 2 years ago

What I understand is that PGO build of Python 3.11 on Windows will be faster thanks to these changes, and the Windows python.org binaries only use PGO for 64-bit, not for 32-bit.

neonene commented 2 years ago

You can read a bit more posts and links because you have changed this thread's title several times.

vstinner commented 2 years ago

Can someone please try to write a summary of this long and complex issue? It seems like different but related topics have been discussed and it's hard to get an overview. I'm confused between sometimes someone said that a change fixed the fix and then wrote that no, it's not really fixing the issue.

gvanrossum commented 2 years ago

Let me give it a quick try.

Originally, @neonene observed a Windows-specific performance regression in 3.10 between the a7 and b1 release. This was eventually shown to be caused by the function _PyEval_EvalFrameDefault getting so long that the MSVC LTO gave up on inlining many things there. IIUC in 3.10 this was eventually fixed by making the function a bit smaller (https://github.com/python/cpython/pull/28475).
Of course, the same issue was then observed in the main branch (3.11). We then went back and forth trying various approaches to fix it. This wasn't easy because (a) the code kept changing (because the "Faster CPython" team was very active -- mostly growing the function), and (b) we had no good hardware or strategy to run reliable benchmarks. The latter problem spawned https://github.com/faster-cpython/ideas/issues/321.
Eventually I settled on a fix which consisted of turning a few inline functions back into macros, but only in ceval.c, and not in debug mode (and one only for MSVC). This was PR https://github.com/python/cpython/issues/89279, commit https://github.com/python/cpython/commit/2f233fceae9a0c5e66e439bc0169b36547ba47c3.
Somewhat relatedly, I also figured out how to get MSVC to generate slightly faster switch code: if you switch on a one-byte value and all 256 cases exist, it skips a memory load. This was PR https://github.com/python/cpython/issues/91719, commit https://github.com/python/cpython/commit/f8dc6186d1857a19edd182277a9d78e6d6cc3787.
Finally, I figured out how to get stable benchmark numbers (see https://github.com/faster-cpython/ideas/issues/321#issuecomment-1107072776 and following comments) and showed that the macrofied inline functions gave us 10% performance back and the improved switch code gave 3%.

That's it.

vstinner commented 2 years ago

Thanks for the summary. I would add that marking performance critical function with __forceinline (Py_ALWAYS_INLINE) was tested, but it didn't work.

python / cpython

Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC) #89279