Migrate decimal to use PEP 567 context variables

1st1 commented 6 years ago

BPO	32630
Nosy	@gvanrossum, @gpshead, @mdickinson, @vstinner, @ned-deily, @methane, @skrah, @1st1
PRs	python/cpython#5278
Files	xwith.py: small benchmark xwith2.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/skrah' closed_at = created_at = labels = ['3.7', 'type-feature', 'library'] title = 'Migrate decimal to use PEP 567 context variables' updated_at = user = 'https://github.com/1st1' ``` bugs.python.org fields: ```python activity = actor = 'gregory.p.smith' assignee = 'skrah' closed = True closed_date = closer = 'yselivanov' components = ['Library (Lib)'] creation = creator = 'yselivanov' dependencies = [] files = ['47410', '47411'] hgrepos = [] issue_num = 32630 keywords = ['patch'] message_count = 32.0 messages = ['310472', '310482', '310630', '310698', '310699', '310700', '310701', '310704', '310714', '310715', '310716', '310738', '310790', '310791', '310794', '310795', '310796', '310799', '310800', '310803', '310805', '310806', '310809', '310811', '310815', '310817', '310820', '310821', '310864', '310875', '310876', '363091'] nosy_count = 8.0 nosy_names = ['gvanrossum', 'gregory.p.smith', 'mark.dickinson', 'vstinner', 'ned.deily', 'methane', 'skrah', 'yselivanov'] pr_nums = ['5278'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue32630' versions = ['Python 3.7'] ```

1st1 commented 6 years ago

PEP-567 allows decimal to be safely used in async/await code.

I couldn't observe any performance impact by the proposed PR. The PR doesn't modify decimal context behaviour: instead of using a thread-local storage it now uses a context variable.

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

I'll take a look.

1st1 commented 6 years ago

Stefan, it would be great to have this committed before 3.7 feature freeze.

The change is pretty straightforward -- we replaced threading.local() with a contextvar, which should be a backwards compatible change.

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

I realize that you had to fight massive mailing list distractions during the PEP discussions, but this is very close to the beta ...

Let's start here:

>>> from decimal import *
==18887== Invalid read of size 8
==18887==    at 0x5324E0: contextvar_new (context.c:744)
==18887==    by 0x53141A: PyContextVar_New (context.c:137)
==18887==    by 0xFED052B: PyInit__decimal (_decimal.c:5542)
==18887==    by 0x51FC56: _PyImport_LoadDynamicModuleWithSpec (importdl.c:159)
==18887==    by 0x51F29F: _imp_create_dynamic_impl (import.c:2145)
==18887==    by 0x51A4BA: _imp_create_dynamic (import.c.h:289)
==18887==    by 0x43257A: _PyMethodDef_RawFastCallDict (call.c:530)
==18887==    by 0x432710: _PyCFunction_FastCallDict (call.c:582)
==18887==    by 0x432DD6: PyCFunction_Call (call.c:787)
==18887==    by 0x4FAA44: do_call_core (ceval.c:4659)
==18887==    by 0x4F58CC: _PyEval_EvalFrameDefault (ceval.c:3232)
==18887==    by 0x4E7F99: PyEval_EvalFrameEx (ceval.c:545)
==18887==  Address 0xcf589a8 is 8 bytes before a block of size 64 alloc'd
==18887==    at 0x4C2A9A1: malloc (vg_replace_malloc.c:299)
==18887==    by 0x470498: _PyMem_RawMalloc (obmalloc.c:75)
==18887==    by 0x470FFC: PyMem_RawMalloc (obmalloc.c:503)
==18887==    by 0x471DEF: _PyObject_Malloc (obmalloc.c:1560)
==18887==    by 0x471312: PyObject_Malloc (obmalloc.c:616)
==18887==    by 0x4A35D6: PyUnicode_New (unicodeobject.c:1293)
==18887==    by 0x4CA16B: _PyUnicodeWriter_PrepareInternal (unicodeobject.c:13423)
==18887==    by 0x4B1843: PyUnicode_DecodeUTF8Stateful (unicodeobject.c:4806)
==18887==    by 0x4A5E67: PyUnicode_FromString (unicodeobject.c:2105)
==18887==    by 0x5313F5: PyContextVar_New (context.c:133)
==18887==    by 0xFED052B: PyInit__decimal (_decimal.c:5542)
==18887==    by 0x51FC56: _PyImport_LoadDynamicModuleWithSpec (importdl.c:159)
==18887==

1st1 commented 6 years ago

I realize that you had to fight massive mailing list distractions during the PEP discussions, but this is very close to the beta ...

Oh thanks, but I see no reason for you to be condescending here.

I cannot reproduce this on Mac OS / Linux. Are you sure you've built your Python correctly? Can you run 'make distclean; ./configure --with-pydebug; make -j4'?

1st1 commented 6 years ago

(Just in case I rebased my patch onto the latest master)

1st1 commented 6 years ago

I think I found what cause this, but I have no idea why it has surfaced only now :)

https://github.com/python/cpython/pull/5326/files (see the added PyType_IS_GC(Py_TYPE(name)) check)

I'll merge that PR and rebase the decimal patch again.

1st1 commented 6 years ago

I pushed a fix (already in the master branch) and rebased the patch once again. I expect it to work now :)

1st1 commented 6 years ago

Stefan, I do think that this is a release blocker. We want to get this change as early as possible to ensure that it's well tested.

AFAIK Guido also wants decimal to be updated and well supported in async/await code.

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

Sure, and *I* am the one running the extended decimal test suite as we speak, not Guido.

You are playing power games here, and you did that from the start by choosing the nosy list.

1st1 commented 6 years ago

Sure, and *I* am the one running the extended decimal test suite as we speak, not Guido.

Thank you.

You are playing power games here, and you did that from the start by choosing the nosy list.

Please. I thought it was pretty much decided that we will update decimal if there is no significant performance degradation, so there's no need for a conspiracy. I put Guido to the nosy-list not because I want to force something, but just because we've discussed decimal and PEP 567/550 with him numerous times.

gvanrossum commented 6 years ago

You guys both need to calm down.

Stefan, what's your objection against this, assuming the crash is fixed?

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

Tests -----

I ran some of my own tests (not even close to all), they seem fine.

However, I could not find any tests for the added feature (safe use with async) though. We would be adding a new feature without tests.

Performance -----------

I'm getting a large slowdown:

./python Modules/_decimal/tests/bench.py

bench.py

patched: [0.199, 0.206, 0.198, 0.199, 0.197, 0.202, 0.198, 0.201, 0.213, 0.199] status-quo: [0.187, 0.184, 0.185, 0.183, 0.184, 0.188, 0.184, 0.183, 0.183, 0.185]

slowdown: > 10%

xwith.py --------

patched: [0.535, 0.541, 0.523] status-quo: [0.412, 0.393, 0.375]

slowdown: > 30%

Given the performance issues I'm -1 for adding the feature at this point.

1st1 commented 6 years ago

However, I could not find any tests for the added feature (safe use with async) though. We would be adding a new feature without tests.

This is no problem, I can add a few async/await tests.

I'm getting a large slowdown: ./python Modules/_decimal/tests/bench.py [..] patched: [0.199, 0.206, 0.198, 0.199, 0.197, 0.202, 0.198, 0.201, 0.213, 0.199] status-quo: [0.187, 0.184, 0.185, 0.183, 0.184, 0.188, 0.184, 0.183, 0.183, 0.185]

I'd like you to elaborate a bit more here. First, bench.py produces a completely different output from what you've quoted. How exactly did you compile these results? Are those numbers results of Pi calculation or factorial? Can you upload the actual script you used here (if there's one)?

Second, here's my run of bench.py with contextvars and without: https://gist.github.com/1st1/1187fc58dfdef86e3cad8874e0894938

I don't see any difference, left alone 10% slowdown.

xwith.py --------

patched: [0.535, 0.541, 0.523] status-quo: [0.412, 0.393, 0.375]

This benchmark is specially constructed to profile creating decimal contexts and doing almost nothing with them.

I've optimized PEP-567 for contextvar.get() operation, not contextvar.set (it's hard to make hamt.set() as fast as dict.set()). That way, if you have an some decimal code that performs actual calculations with decimal objects, the operation of looking up the current context is cheap.

It's hard to imagine a situation, where a real decimal-related code just creates decimal contexts and does nothing else with them.

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

On Fri, Jan 26, 2018 at 09:06:38PM +0000, Yury Selivanov wrote:

This benchmark is specially constructed to profile creating decimal contexts and doing almost nothing with the

It is not constructed at all. It was the first thing I wrote down trying to play a bit with speed.

Even the telco benchmark (where there's a lot of other stuff going on) slows down by around 7-8%.

I did not hunt for these benchmarks. They are the first things I tried out. I cannot believe that you never saw a slowdown as claimed in your OP.

gvanrossum commented 6 years ago

Guys. Please stop with the editorializing. "I cannot believe ..." (used essentially by both of you) is not constructive.

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

Guido, I have the feeling that the feature -- about which I was actually positive in the first place -- is being pushed aggressively with no respect for the module author.

BTW, prec is changed quite frequently in decimal code, so if people want infix notation they also have to use many with-statements.

gvanrossum commented 6 years ago

Stefan, I don't think a module author should retain veto over everything affecting their code forever. (We've had spectacular process failures with this in the past.) Please take a deep breath and patiently answer Yury's questions. If you two can't agree on this, the status quo wins, but it will be a blemish on your reputation if you just block it unilaterally. At the very least help Yury reproduce your timing results -- at this point the burden is on you since nobody else can reproduce them.

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

I have run about 1000 times more decimal benchmarks than both Yury and you. You attempt to hurt my reputation is laughable.

Show me some top-performance code that you have written.

1st1 commented 6 years ago

Sorry Stefan, I never wanted this to look like "I'm pushing this without listening to Stefan". I apologize if it looked that way.

I ran bm_telco on my machine before submitting the PR, and I indeed did not see any performance impact. I'll try again. I also have a idea of a micro-optimization that might make it a tiny bit faster.

gvanrossum commented 6 years ago

Stefan this is unacceptable abuse. Please read the code of conduct.

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

Yury, would you be willing to work this out by email? -- I think it was you who I discussed the context-subclassing with and that was quite a pleasant experience.

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

Guido, I apologize for the outburst. I had the impression that msg310799 implicitly asserted my incompetence in benchmarking.

308d1678-b02d-445e-978b-e5e183845395 commented 6 years ago

FWIW, I ran bm_telco with pyperformance on a benchmark-tuned system and did not observe the slowdown. Benchmarks were done on a release build (--enable-optimizations)

$ sudo (which python3) -m perf system tune

MASTER:

$ pyperformance run --python=envs/3.7-master-pgo-lto/prefix/bin/python3.7m --affinity=2,3 --rigorous --benchmarks=telco -o json/3.7-master.json
Python benchmark suite 0.6.1

[1/1] telco... INFO:root:Running /home/elvis/dev/python/performance/venv/cpython3.7-8cfe759dd297/bin/python -u /home/elvis/dev/python/performance/performance/benchmarks/bm_telco.py --rigorous --affinity=2,3 --output /tmp/tmpgszxc792 ......................................... telco: Mean +- std dev: 9.17 ms +- 0.32 ms

MASTER + contextvars patch:

$ pyperformance run --python=envs/3.7-master-pgo+lto+decimal-contextvars/prefix/bin/python3.7m --affinity=2,3 --rigorous --benchmarks=telco -o json/3.7-contextvars.json
Python benchmark suite 0.6.1

[1/1] telco... INFO:root:Running /home/elvis/dev/python/performance/venv/cpython3.7-8a6fbdee5a5b/bin/python -u /home/elvis/dev/python/performance/performance/benchmarks/bm_telco.py --rigorous --affinity=2,3 --output /tmp/tmp8y4mivnp ......................................... telco: Mean +- std dev: 9.29 ms +- 0.19 ms

COMPARISON:

### telco ### Mean +- std dev: 9.17 ms +- 0.32 ms -> 9.29 ms +- 0.19 ms: 1.01x slower Not significant

308d1678-b02d-445e-978b-e5e183845395 commented 6 years ago

Likewise, on the same builds, running _decimal/tests/bench.py does not show a significant difference: https://gist.github.com/elprans/fb31510ee28a3aa091aee3f42fe65e00

gvanrossum commented 6 years ago

Apologies accepted. I did not imply that -- I was simply stating that Yury needed your help reproducing your result so he could do something about it. It seems you two are taking this offline so I trust that there will be no more barbs.

vstinner commented 6 years ago

Since the root of the discussion is a performance regression, let me take a look since I also care of not regressing in term of performance. We (CPython core developers, as as team) spent a lot of time on optimizing CPython to make benchmarks like telco faster at each release. The good news is that Python 3.7 *is* faster than Python 3.6 on telco. If I recall correctly, it's not because of recent optimizations in the decimal module, but more general changes like CALL_METHOD optimization!

Python master vs 3.6 (normalized on 3.6):

https://speed.python.org/comparison/?exe=12%2BL%2Bmaster%2C12%2BL%2B3.6&ben=670&env=1%2C2&hor=false&bas=12%2BL%2B3.6&chart=normal+bars

Graph of telco performance on master since April 2014 to January 2017: https://speed.python.org/timeline/#/?exe=12&ben=telco&env=1&revs=50&equid=off&quarts=on&extr=on

20.2 ms => 14.1 ms, well done!

If you are curious of reasons why Python became faster, see my documentation: http://pyperformance.readthedocs.io/cpython_results_2017.html

Or even my talk at Pycon 2017: https://www.youtube.com/watch?v=d65dCD3VH9Q&t=957s

Sorry, I moved off topic. Let's move back to this measuring the performance of this issue...

--

I rewrote xwith.py using my perf module to use CPU pinning (on my isolated CPUs), automatic calibration of the number of loops, ignore the first "warmup" value, spawn 20 processes, compute the average and standard deviation, etc. => see attached xwidth2.py

Results on my laptop with 2 physical cores isolated for best benchmark stability (*):

vstinner@apu$ ./python -m perf compare_to master.json pr5278.json Mean +- std dev: [master] 1.86 us +- 0.03 us -> [pr5278] 2.27 us +- 0.04 us: 1.22x slower (+22%)

Note: master is the commit 29a7df78277447cf6b898dfa0b1b42f8da7abc0c and I rebased PR 5278 on top on this commit.

(*) http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproductible-benchmark-results

This is obvious the *worst case: a *micro benchmark using local contexts and modifying this local context. In this case, I understand that this microbenchmark basically measures the overhead of contextvars on modying a context.

The question here is if the bottleneck of applications using decimal is the code modifying the context or the code computing numbers (a+b, a*b, a/b, etc.).

Except for a few small projects, I rarely use decimal, so I'm unable to judge that.

But just to add my 2 cents, I never used "with localcontext()", I don't see the point of this tool in my short applications. I prefer to modify directly the current context (getcontext()), and only modify this context *once*, at startup. For example, set the rounding mode and set the precision, and that's all.

--

The Python benchmark suite does have a benchmark dedicated to the decimal module: http://pyperformance.readthedocs.io/benchmarks.html#telco

I ran this benchmark on PR 5278:

vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json Benchmark hidden because not significant (1): telco

... not significant. Honestly, I'm not surprised at all:

telco benchmark doesn't modify the context in the hot code, only *outside* the benchmark
telco likely spends most of its runtime in computing numbers (sum += x; d = d.quantize(...), etc.) and converting Decimal to string. I don't know the decimal module, but I guess that it requires to *read* the current context. But getting the context is likely efficient and not significant compared to the cost of other operations.

FYI timings can be seen in verbose mode:

vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json -v Mean +- std dev: [telco_master] 10.7 ms +- 0.4 ms -> [telco_pr5278] 10.7 ms +- 0.4 ms: 1.00x faster (-0%) Not significant!

vstinner commented 6 years ago

Likewise, on the same builds, running _decimal/tests/bench.py does not show a significant difference: https://gist.github.com/elprans/fb31510ee28a3aa091aee3f42fe65e00

Note: it may be interesting to rewrite this benchmark my perf module to be able to easily check if a benchmark result is significant.

http://perf.readthedocs.io/en/latest/cli.html#perf-compare-to

"perf determines whether two samples differ significantly using a Student’s two-sample, two-tailed t-test with alpha equals to 0.95."

=> https://en.wikipedia.org/wiki/Student's_t-test

Usually, I consider that between 5% slower and 5% faster is not significant. But it depends how the benchmark was run, it depends on the type of benchmark, etc. Here I don't know bench.py so I cannot judge.

For example, for an optimization, I'm more interested by an optimization making a benchmark 10% faster ;-)

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

On Fri, Jan 26, 2018 at 11:11:00PM +0000, STINNER Victor wrote:

vstinner@apu$ ./python -m perf compare_to master.json pr5278.json Mean +- std dev: [master] 1.86 us +- 0.03 us -> [pr5278] 2.27 us +- 0.04 us: 1.22x slower (+22%)

Note: master is the commit 29a7df78277447cf6b898dfa0b1b42f8da7abc0c and I rebased PR 5278 on top on this commit.

Thank you and Elvis for running the benchmarks. Yes, the exact version does seem important -- I have been getting some differences based on the checkout.

This is obvious the *worst case: a *micro benchmark using local contexts and modifying this local context. In this case, I understand that this microbenchmark basically measures the overhead of contextvars on modying a context.

The question here is if the bottleneck of applications using decimal is the code modifying the context or the code computing numbers (a+b, a*b, a/b, etc.).

Yes, that's the big question. In the generator discussions people were advised to use "with" whenever possible, so I assume short blocks *will* be used.

I would use the context functions, which would not require PEP-567 at all. This means that I'm somewhat okay with excessive with-statements being a bit slower.

vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json -v Mean +- std dev: [telco_master] 10.7 ms +- 0.4 ms -> [telco_pr5278] 10.7 ms +- 0.4 ms: 1.00x faster (-0%) Not significant!

Okay. In my above reference to telco, I ran the "telco.py full" command from http://www.bytereef.org/mpdecimal/quickstart.html#telco-benchmark .

The numbers I posted weren't cooked, but I have a hard time reproducing them myself now consistently with the latest revisions, so let's declare telco.py and bench.py a tie.

This means that I no longer have any objections, so Yury, please go ahead and merge the PR!

Stefan Krah

1st1 commented 6 years ago

Thank you, Stefan. I've updated the PR with an asyncio+decimal test and run tests in refleak mode to make sure there's no regression there.

If during the beta/rc period we see that contextvars isn't stable enough or something I'll revert this change before 3.7.0 myself, so that decimal users will not be disturbed.

I'll merge the PR once the CI is green.

Yes, that's the big question. In the generator discussions people were advised to use "with" whenever possible, so I assume short blocks *will* be used.

Yes, I used decimal examples all the time to showcase how context is supposed to work with generators. Most of those examples were specifically constructed to illustrate some point, but I don't think that real-world code uses a 'with localcontext()' statement in every function.

Unfortunately there's no way (at least known to me) to make 'ContextVar.set()' faster than it is now. I use HAMT which guarantees that all set operations will have O(log n) performance; the other known approach is to use copy-on-write (as in .NET), but that has an O(n) ContextVar.set() performance. So I guess a slightly slower 'with localcontext()' is the price to pay to make decimal easier to use for async/await code.

1st1 commented 6 years ago

New changeset f13f12d8daa587b5fcc66fe3ed1090a5dadab289 by Yury Selivanov in branch 'master': bpo-32630: Use contextvars in decimal (GH-5278) https://github.com/python/cpython/commit/f13f12d8daa587b5fcc66fe3ed1090a5dadab289

gpshead commented 4 years ago

FYI - this appears to have caused a regression - https://bugs.python.org/issue39776

python / cpython

Migrate decimal to use PEP 567 context variables #76811