Fixing Copy on Writes from reference counting and immortal objects

476181bd-17c7-4f5b-a731-677932cb4f0c commented 4 years ago

BPO	40255
Nosy	@tim-one, @nascheme, @gpshead, @pitrou, @vstinner, @carljm, @DinoV, @methane, @markshannon, @ericsnowcurrently, @zooba, @corona10, @pablogsal, @eduardo-elizondo, @remilapeyre, @shihai1991
PRs	python/cpython#19474 python/cpython#24828 python/cpython#31488 python/cpython#31489 python/cpython#31490 python/cpython#31491

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['interpreter-core', 'type-feature', '3.10'] title = 'Fixing Copy on Writes from reference counting and immortal objects' updated_at = user = 'https://github.com/eduardo-elizondo' ``` bugs.python.org fields: ```python activity = actor = 'eric.snow' assignee = 'none' closed = False closed_date = None closer = None components = ['Interpreter Core'] creation = creator = 'eelizondo' dependencies = [] files = [] hgrepos = [] issue_num = 40255 keywords = ['patch'] message_count = 69.0 messages = ['366216', '366221', '366224', '366325', '366326', '366328', '366329', '366330', '366331', '366341', '366344', '366345', '366387', '366388', '366390', '366402', '366403', '366404', '366406', '366407', '366408', '366412', '366413', '366416', '366422', '366423', '366428', '366456', '366518', '366519', '366520', '366536', '366561', '366565', '366566', '366573', '366577', '366578', '366579', '366593', '366605', '366623', '366624', '366625', '366628', '366629', '366744', '366818', '366825', '366826', '366828', '366834', '366835', '366837', '366852', '379644', '379871', '379917', '388525', '388526', '388760', '410592', '410613', '412956', '413080', '413116', '413693', '413694', '413717'] nosy_count = 16.0 nosy_names = ['tim.peters', 'nascheme', 'gregory.p.smith', 'pitrou', 'vstinner', 'carljm', 'dino.viehland', 'methane', 'Mark.Shannon', 'eric.snow', 'steve.dower', 'corona10', 'pablogsal', 'eelizondo', 'remi.lapeyre', 'shihai1991'] pr_nums = ['19474', '24828', '31488', '31489', '31490', '31491'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue40255' versions = ['Python 3.10'] ```

Linked PRs

gh-102887
gh-103308
gh-103403
gh-103817
gh-103823
gh-103962
gh-104054
gh-105195
gh-105977
gh-107605

zooba commented 4 years ago

But micro-benchmarks may tell you things which are not true in the real-world (for example an easily-predicted branch)

This is true, though this branch should be more easily predictable because INCREF/DECREF are inlined. If there were only one function doing the branch it'd be different.

Wouldn't even surprise me if PGO made different optimisations to this in certain functions. Anything that normally operates on a code object, for example, is almost always going to take the branch when everything is imported early and before immortalisation.

We also have the real world app that is Instagram. If they say that this has shown an improvement in their app, I'm inclined to believe that it's not reliant on a microbenchmark (other controlled circumstances, sure, but not a microbenchmark ;) ).

pitrou commented 4 years ago

We also have the real world app that is Instagram.

I don't think Instagram is a single app. What is Python used for at Instagram?

23982c60-ed6c-47d1-96c2-69d417bd81b3 commented 4 years ago

> We also have the real world app that is Instagram.

I don't think Instagram is a single app. What is Python used for at Instagram?

According to their engineering blog (https://instagram-engineering.com/static-analysis-at-scale-an-instagram-story-8f498ab71a0c):

Our server app is a monolith, one big codebase of several million lines and a few thousand Django endpoints [1], all loaded up and served together. A few services have been split out of the monolith, but we don’t have any plans to aggressively break it up.

pitrou commented 4 years ago

Wow. I see, thank you.

markshannon commented 4 years ago

On 20/04/2020 2:33 pm, Steve Dower wrote:

Steve Dower \steve.dower@python.org\ added the comment:

> I would expect that the negative impact on branch predictability would easily outweigh the cost of the memory write (A guaranteed L1 hit)

If that were true then Spectre and Meltdown wouldn't have been so interesting :)

I really don't follow your logic here.

Pipelining processors are going to speculatively execute both paths, and will skip the write much more quickly than by doing it, and meanwhile nobody should have tried to read the value so it hasn't had to block for that path. I'm not aware of any that detect no-op writes and skip synchronising across cores - the dirty bit of the cache line is just set unconditionally.

Processors only execute the one path, speculatively or otherwise, that's why branch prediction is so important. (And is important for the Spectre attack. If what you said were true, the Spectre attack wouldn't need to pre-bias the branch predictor.) I'm assuming a modern Intel processor.

Benchmarking already showed that the branching version is faster. It's possible that "refcount += (refcount & IMMORTAL) ? 0 : 1" could generate different code (should be mov,test,lea,cmovz rather than mov,and,add,mov or mov,and,jz,add,mov), but it's totally reasonable for a branch to be faster than unconditionally modifying memory.

It hasn't been shown that it is faster. I'm not claiming that it isn't, just that I haven't seen the evidence.

----------

Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue40255\

methane commented 3 years ago

I'm big -1 too. But I am interested in Instagram usage.

How % of heap are CoW-ed with gc.freeze()?
When CoW happen? in execution time, or shutdown?
Which type cause CoW?

I have two ideas to reduce CoW:

Fast shutdown option Currently Python try to free object as possible in finalization. It will cause much CoW. It will very bad for applications using multiprocessing.Pool for parallel processing of large data.
Use "legacy" Unicode representation for large strings. Unicode is cold, immutable. But refcount is not immutable. Currentlly, most unicode object is "compact": object header and data is in one memory block. We may be able to use "legacy" representation and put cold data into another page.

For long term, multi process friendly pyc format can share more memory inter processes, even though they are not created by prefork.

vstinner commented 3 years ago

Fast shutdown option

You can use os._exit(0).

methane commented 3 years ago

> Fast shutdown option

You can use os._exit(0).

Yes. Instagram use it as atexit.register(os._exit, 0).

https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172

I think this hack can be supported in multiprocessing module in some way.

ericsnowcurrently commented 3 years ago

FYI, I've just put up a similar PR [1] to Eddie's, with a focus on object immortality rather than immutability. However, if Py_IMMORTAL_CONST_REFCOUNTS is defined then it should have the behavior Eddie is after.

[1] https://github.com/python/cpython/pull/24828

ericsnowcurrently commented 3 years ago

Note that my PR does make the builtin types immortal (at least those initialized in _PyTypes_Init()). However, I did not make the singletons or other candidate objects immortal in the PR. That can be done separately.

ericsnowcurrently commented 3 years ago

While the eventual solution may overlap, my interests are divergent enough from those here that I've created a separate issue: bpo-43503.

gpshead commented 2 years ago

[data] I finally dug up the old YouTube doc at work with their findings. Mostly just posting this here for future public reference if anyone wants. Nothing surprising.

When youtube experimented with a modified 2.7 adding "eternal refcounts" in 2015, they saw a 3-5% CPU performance regression. (not the 10% I had in my mind)

Their version of this simply set a high bit on the refcount as the indicator and added the obvious conditional into the Py_INCREF/Py_DECREF macros.

Unsurprisingly in line with what others have since found. For their preforked server and decision of what to mark eternal before forking, it saved them 10% ram (fewer copy on writes). The -ram vs +cpu +maintenance cost tradeoff wound up not being worthwhile to them though. Their motivation for trying was entirely COW memory savings.

=== CPython 2.7 object.h modification they used:

+#ifdef GOOGLE_ETERNAL_REFCOUNT_SUPPORT
+
+#define PY_ETERNAL_REFCOUNT (PY_SSIZE_T_MAX / 2)
+
+#define Py_IS_ETERNAL(op) (                       \
+  ((PyObject*)(op))->ob_refcnt >= PY_ETERNAL_REFCOUNT)
+
+#define Py_SET_ETERNAL(op)                                \
+  do {                                                    \
+      ((PyObject*)(op))->ob_refcnt = PY_ETERNAL_REFCOUNT; \
+      if (PyObject_IS_GC(op)) {                           \
+        PyObject_GC_UnTrack(op);                          \
+      }                                                   \
+  } while (0)
+
+#define Py_INCREF(op) (                           \
+      Py_IS_ETERNAL(op)                           \
+        ?  PY_ETERNAL_REFCOUNT                    \
+        :  (_Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA \
+           ((PyObject*)(op))->ob_refcnt++)        \
+  )
+
+#define Py_DECREF(op)                                   \
+    do {                                                \
+        if (Py_IS_ETERNAL(op)) break;                   \
+        if (_Py_DEC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
+        --((PyObject*)(op))->ob_refcnt != 0)            \
+            _Py_CHECK_REFCNT(op)                        \
+        else                                            \
+        _Py_Dealloc((PyObject *)(op));                  \
+    } while (0)
+
+#else

vstinner commented 2 years ago

I just want to say that GOOGLE_ETERNAL_REFCOUNT_SUPPORT is a cool name :-D I love "eternal refcount"!

ericsnowcurrently commented 2 years ago

@Eddie, what can I do to push this forward? FYI, in addition to the python-dev thread a few weeks back, I've brought the matter up with the steering council. [1]

Also, if we can get back to performance-neutral (currently at about 4% slower) then there would be a lot less controversy. Even at 2% it may be enough.

[1] https://github.com/python/steering-council/issues/103

476181bd-17c7-4f5b-a731-677932cb4f0c commented 2 years ago

@eric.snow great to hear about this update! I'll start looking at some of the techniques that we talked about to improve performance, I'm optimistic that we'll be able to close down the gap to 2%.

methane commented 2 years ago

I think making more objects immortal by default will reduce the gap, although I am not sure it can be 2%. (I guess 3% and I think it is acceptable gap.)

Code attributes (contents of co_consts, co_names, etc...) in deep frozen modules.
- only if subinterpreter shares them.
Statically allocated strings (previously _Py_IDENTIFIER)

To reduce gap more, we need to reduce Python stack operation in ceval in some way.

methane commented 2 years ago

All of these optimizations should be disabled by default.

It will cause leak when Python is embedded.
Even for python command, it will break __del__ and weakref callbacks.

476181bd-17c7-4f5b-a731-677932cb4f0c commented 2 years ago

It seems that we are back on track with perf being back to neutral!

I've created 4 new PRs. Each with an optimization applied on top of the baseline introduction of instance immortalization.

The main PR 19474 currently stands at around 4%, after rebasing past Eric's PR 30928 it went down to 3%.

This is the list of optimizations that I used to get some performance back:

Immortalizing Interned Strings (PR 31488): 1% regression
Immortalizing Runtime Heap After Startup (PR 31489): 0% regression
Immortalizing Modules After Import (PR 31490): 1% regression
Combined Optimizations (PR 31491): 0% improvement

All the PRs contain the results of the pyperformance benchmarks and they should each stand on their own in case we want to go for a subset of these optimizations rather than all of them. Make sure to look at each PR to read the implementation details.

For testing, in every PR I made sure all the tests were passing on my local environment. Though I do expect some failures happening in non-linux environments. I'll be fixing these over the next couple of days.

ericsnowcurrently commented 2 years ago

Thanks for the updates, Eddie.

bhack commented 7 months ago

Any update here? What is the current status?

methane commented 7 months ago

I think we can close this issue because we have immortal instance already.

python / cpython

Fixing Copy on Writes from reference counting and immortal objects #84436

Linked PRs