python / cpython

The Python programming language
https://www.python.org
Other
63.71k stars 30.52k forks source link

decimal: Use FASTCALL and/or Argument Clinic #73487

Closed vstinner closed 6 years ago

vstinner commented 7 years ago
BPO 29301
Nosy @rhettinger, @vstinner, @skrah, @serhiy-storchaka, @animalize
Files
  • decimal.patch
  • decimal-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/skrah' closed_at = created_at = labels = ['3.7', 'performance'] title = 'decimal: Use FASTCALL and/or Argument Clinic' updated_at = user = 'https://github.com/vstinner' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'skrah' closed = True closed_date = closer = 'vstinner' components = [] creation = creator = 'vstinner' dependencies = [] files = ['46320', '46435'] hgrepos = [] issue_num = 29301 keywords = ['patch'] message_count = 23.0 messages = ['285665', '285670', '285671', '285673', '285680', '286359', '286368', '286369', '286371', '286372', '286374', '286375', '286448', '287131', '287133', '287149', '297135', '297153', '318123', '338430', '338462', '338513', '338516'] nosy_count = 5.0 nosy_names = ['rhettinger', 'vstinner', 'skrah', 'serhiy.storchaka', 'malin'] pr_nums = [] priority = 'normal' resolution = 'out of date' stage = 'resolved' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue29301' versions = ['Python 3.7'] ```

    vstinner commented 7 years ago

    I'm currently working on the isuse bpo-29259: "Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects". I used bm_telco of the performance benchmark suite to check which functions still require to create a temporary tuple for positional arguments and a temporary dict for keyword arguments. I found 3 remaining functions which have an impact on the result of the benchmark:

    I would like to know if Stephan would be ok to modify the _decimal module to use FASTCALL. I know that recently he reverted changes to keep the same code base on Python 3.5, 3.6 and 3.7.

    With 4 changes (tp_fastcall bpo-29259, print bpo-29296, unpack bpo-29300 and this issue), bm_telco becomes 22% faster which is not negligible!

    20.9 ms +- 0.5 ms => 16.4 ms +- 0.5 ms

    Attached decimal.patch patch is the strict minimum change to optimize bm_telco, but I would prefer to change all _decimal functions and methods using METH_VARARGS and METH_VARARGS|METH_KEYWORDS to convert them to METH_FASTCALL.

    The best would be to use Argument Clinic. AC exists since Python 3.5, so it should be possible to keep the same code base on Python 3.5-3.7, only generated code would be different.

    vstinner commented 7 years ago

    Oh wait, I'm not sure that attached patch has a significant impact on performances. It seems like the speedup mostly comes from the print patch: http://bugs.python.org/issue29296#msg285668

    But well, it is still interesting to use METH_FASTCALL in decimal ;-)

    serhiy-storchaka commented 7 years ago

    See msg207652 in bpo-20177.

    rhettinger commented 7 years ago

    This should not go in without Stephan Krah's approval. This code is finely tuned and carefully arranged.

    vstinner commented 7 years ago

    Raymond Hettinger: "This should not go in without Stephan Krah's approval. This code is finely tuned and carefully arrang"

    Sure, that's why I opened this issue.

    Serhiy Storchaka: "See msg207652 in bpo-20177."

    Oh, I didn't know that Stefan Krah already gave his opinion.

    Since the performance impact is unclear, I prefer to close this issue. I will come back if argument parsing is more clearly identified as a performance bottleneck.

    vstinner commented 7 years ago

    Hello, I'm back! :-)

    I almost abandonned my the tp_fastcall change (bpo-29259).

    print() now uses FASTCALL (bpo-29296), it already made bm_telco faster: https://speed.python.org/timeline/#/?exe=5&ben=telco&env=1&revs=50&equid=off&quarts=on&extr=on (see the speedup around January 16 at the right)

    For the unpack module, I'm still working on my patch in the issue bpo-29300, but it doesn't seem to have a significant impact on bm_telco.

    So the remaining question is the usage of FASTCALL or Argument Clinic in the _decimal module.

    Since the performance impact is unclear, I prefer to close this issue. I will come back if argument parsing is more clearly identified as a performance bottleneck.

    I ran a new benchmark and FASTCALL makes bm_telco benchmark 1.11x faster:

    Median +- std dev: [ref] 19.4 ms +- 0.7 ms -> [decimal-2.patch] 17.5 ms +- 0.6 ms: 1.11x faster (-10%)

    So I decided to reopen the issue.

    Stefan: what do you think of using Argument Clinic and/or FASTCALL in _decimal? Is "bm_telco 1.11x faster" significant enough for you to justify such change?

    "The best would be to use Argument Clinic. AC exists since Python 3.5, so it should be possible to keep the same code base on Python 3.5-3.7, only generated code would be different."

    serhiy-storchaka commented 7 years ago

    I would dissuade from direct using of FASTCALL, but Argument Clinic is enough reliable. I would recommend to use Argument Clinic unless Stefan supports external fork of the decimal module and is limited by public API.

    vstinner commented 7 years ago

    Oops, I forgot to attach decimal-2.patch, here you have.

    I would dissuade from direct using of FASTCALL, but Argument Clinic is enough reliable.

    I wrote decimal-2.patch in a few minutes, when I was trying to find all functions still calling _PyStack_AsTuple() in my tp_fast{new,init,call} fork. I just wrote it to identify which parts of CPython can be optimized to make bm_telco faster.

    I agree that Argument Clinic should be preferred over using directly METH_FASTCALL, especially for the _decimal module: Stefan already wrote that he wants to have the same code base on Python 3.5, 3.6 and 3.7.

    I would recommend to use Argument Clinic unless Stefan supports external fork of the decimal module and is limited by public API.

    What is the status of Argument Clinic? Is it something "public" or not?

    The status of _decimal is also unclear to me. Is it part of CPython or not? :-) I know that there is also a "backported" module for Python 2:

    https://pypi.python.org/pypi/cdecimal http://www.bytereef.org/mpdecimal/index.html

    IMHO it's a great tool. Maybe it's time to make it usable outside CPython in Python 3.7? Or maybe we should wait until the remaining missing features are implemented? Issues bpo-20291 and bpo-29299 for example.

    I'm now waiting for Stefan's feedback ;-)

    5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 7 years ago

    I'll take a look when I have the opportunity.

    AC will not happen: It makes the module too large and unreadable.

    serhiy-storchaka commented 7 years ago

    What is the status of Argument Clinic? Is it something "public" or not?

    No, it is for for internal CPython use only. It lacks some features (support of var-positional and var-keyword parameters, optional parameters without default value), the syntax of positional-only parameters is not officially accepted still, and future optimizations can require incompatible changes.

    Only when all CPython builtins and extensions be converted to Argument Clinic, PEP-457 (or an alternative) be accepted, Argument Clinic issues be resolved, we could say it stable enough. For now even Argument Clinic tests are broken.

    vstinner commented 7 years ago

    Stefan Krah:

    AC will not happen: It makes the module too large and unreadable.

    Ah you dislike the additional [clinic input] sections?

    It's kind of strange when you have to convert existing code, when once the code uses AC, I prefer AC to separated documentation variables. It helps to keep docstrings more up to date, and it helps to enhance the API (ex: allow keywords, rename parameters to better names, etc.). It also helps to make the documentation closer to the code, which is IMHO a good thing :-)

    IMHO the PyArg_ParseXXX() calls and their "kwlist" static variable are "unreadable", and I'm happy to be able to hide them!

    FYI decimal-2.patch replaces PyArg_ParseTupleAndKeywords() with _PyArg_ParseStackAndKeywords() with static _PyArg_Parser object. This object only decides keyword names once and is more efficient to parse arguments. It explains partially the speedup. Only partially because bm_telco only calls the .quantize() method, and it only uses positional arguments (no keyword arguments) ;-)

    5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 7 years ago

    STINNER Victor added the comment: > AC will not happen: It makes the module too large and unreadable.

    Ah you dislike the additional [clinic input] sections?

    Yes, they tear apart the code. I stopped reading many C files because of this. Brett asked why several people don't review, this is actually *one* of the reasons for me.

    It's kind of strange when you have to convert existing code, when once the code uses AC, I prefer AC to separated documentation variables. It helps to keep docstrings more up to date, and it helps to enhance the API (ex: allow keywords, rename parameters to better names, etc.). It also helps to make the documentation closer to the code, which is IMHO a good thing :-)

    Apparently it works for several people, but not me.

    IMHO the PyArg_ParseXXX() calls and their "kwlist" static variable are "unreadable", and I'm happy to be able to hide them!

    FYI decimal-2.patch replaces PyArg_ParseTupleAndKeywords() with _PyArg_ParseStackAndKeywords() with static _PyArg_Parser object. This object only decides keyword names once and is more efficient to parse arguments. It explains partially the speedup. Only partially because bm_telco only calls the .quantize() method, and it only uses positional arguments (no keyword arguments) ;-)

    Okay, thanks!

    rhettinger commented 7 years ago

    > Ah you dislike the additional [clinic input] sections?

    Yes, they tear apart the code. I stopped reading many C files because of this.

    I concur with Stefan. AC is very impactful on modules, greatly affecting their readability and maintainability. The existing PyArg_ParseXXX() calls and their "kwlist" static variable are very easy to work with, easy to modify, and easy to teach (I cover them effortlessly in the Python classes I teach). In contrast, AC is much harder to learn, harder to get right, doesn't fit some existing APIs, harder to change, and it greatly expands the number of lines of code.

    5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 7 years ago

    If the API can still change (msg287083), I think I'll wait with this until shortly before 3.7 beta.

    vstinner commented 7 years ago

    If the API can still change (msg287083), I think I'll wait with this until shortly before 3.7 beta.

    I do not understand why Serhiy keeps repeating that the API is going to change, I have no such plan. IMHO the FASTCALL API is now very well defined: (PyObject **args, Py_ssize_t nargs, PyObject *kwnames), and helper functions are now well tested.

    We might optimize Argument Clinic further, but for decimal, it seems like it's a no-no and that direct usage of METH_FASTCALL is preferred.

    So no, I don't plan or expect any API change.

    vstinner commented 7 years ago

    Oh ok, it seems like Serhiy wants to change METH_FASTCALL. I didn't know :-) He just opened the issue bpo-29464: "Specialize FASTCALL for functions with positional-only parameters". Interesting idea.

    vstinner commented 7 years ago

    So Stefan: what do you want to do? Use FASTCALL or not? :-)

    I would prefer to not wait until the beta if you are ok to use FASTCALL.

    Note: it seems like bpo-29464 is going to be approved ;-)

    5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 7 years ago

    I think I'll wait until bpo-29464 is committed and the API is considered frozen (see msg295176?).

    vstinner commented 6 years ago

    Sorry, I lost interest in this optimization. If someone wants to work on it, please open a new issue.

    b8e5a65c-bed4-4c5d-ba32-799f883ba638 commented 5 years ago

    AC will not happen: It makes the module too large and unreadable.

    AC has great performance improvements these days: bpo-35582 and bpo-36127 It's quite worthy of reconsidering this decision.

    5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 5 years ago

    Thanks, but it is still not going to happen. Look at the increased code size in e.g. blake2s_impl.c.h.

    I want to know what is going on in the code. Also, the performance improvements are in argument parsing, but not when you have numerical code like a * b, where a and b are already decimals.

    vstinner commented 5 years ago

    Also, the performance improvements are in argument parsing, but not when you have numerical code like a * b, where a and b are already decimals.

    If the function call takes around 100 ns, the benefit of FASTCALL and the new more efficient function to parse arguments becomes quite significant. Some Decimal methods are that fast if not faster.

    python3 -m perf timeit \ -s 'import decimal; a=decimal.Decimal(1); b=decimal.Decimal(2); ctx=decimal.getcontext()' \ 'ctx.copy_sign(a, b)' \ --duplicate=1024

    => Mean +- std dev: [ref] 148 ns +- 1 ns -> [fastcall] 98.9 ns +- 4.9 ns: 1.49x faster (-33%)

    ./python -m perf timeit \ -s 'import decimal; a=decimal.Decimal(1); b=decimal.Decimal(2)' \ 'a.copy_sign(b)' \ --duplicate=1024

    => Mean +- std dev: [ref] 123 ns +- 5 ns -> [fastcall] 86.5 ns +- 1.4 ns: 1.42x faster (-29%)

    I wrote a quick & dirty change using directly METH_FASTCALL with _PyArg_ParseStackAndKeywords() (dec_mpd_qcopy_sign) and _PyArg_CheckPositional() (ctx_mpd_qcopy_sign). Using Argument Clinic may be more verbose, but it generates *even more* efficient code for functions accepting keyword arguments (like Decimal.copy_sign) and generate better docstring (at least, the signature for function parameters if no text is given).

    Obviously, if your application only uses large numbers, the benefit will be invisible. But I guess that some people use decimal with "small" numbers ;-)

    Note: Sure, you're right that operators like "a b" wouldn't benefit from FASTCALL since they don't parse explicitly arguments. BINARY_MULTIPLY instruction gets directly 2 values from the Python stack and pass them to PyNumber_Multiply() with is defined to always take exactly 2 PyObject in C.

    vstinner commented 5 years ago

    Hum, after reading again my previous, I'm not sure that my intent was clear. I'm fine with Stefan rejecting the optimization. He is the maintainer of decimal. I just wanted to comment what he said ;-)