Fixing Copy on Writes from reference counting and immortal objects

476181bd-17c7-4f5b-a731-677932cb4f0c commented 4 years ago

BPO	40255
Nosy	@tim-one, @nascheme, @gpshead, @pitrou, @vstinner, @carljm, @DinoV, @methane, @markshannon, @ericsnowcurrently, @zooba, @corona10, @pablogsal, @eduardo-elizondo, @remilapeyre, @shihai1991
PRs	python/cpython#19474 python/cpython#24828 python/cpython#31488 python/cpython#31489 python/cpython#31490 python/cpython#31491

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['interpreter-core', 'type-feature', '3.10'] title = 'Fixing Copy on Writes from reference counting and immortal objects' updated_at = user = 'https://github.com/eduardo-elizondo' ``` bugs.python.org fields: ```python activity = actor = 'eric.snow' assignee = 'none' closed = False closed_date = None closer = None components = ['Interpreter Core'] creation = creator = 'eelizondo' dependencies = [] files = [] hgrepos = [] issue_num = 40255 keywords = ['patch'] message_count = 69.0 messages = ['366216', '366221', '366224', '366325', '366326', '366328', '366329', '366330', '366331', '366341', '366344', '366345', '366387', '366388', '366390', '366402', '366403', '366404', '366406', '366407', '366408', '366412', '366413', '366416', '366422', '366423', '366428', '366456', '366518', '366519', '366520', '366536', '366561', '366565', '366566', '366573', '366577', '366578', '366579', '366593', '366605', '366623', '366624', '366625', '366628', '366629', '366744', '366818', '366825', '366826', '366828', '366834', '366835', '366837', '366852', '379644', '379871', '379917', '388525', '388526', '388760', '410592', '410613', '412956', '413080', '413116', '413693', '413694', '413717'] nosy_count = 16.0 nosy_names = ['tim.peters', 'nascheme', 'gregory.p.smith', 'pitrou', 'vstinner', 'carljm', 'dino.viehland', 'methane', 'Mark.Shannon', 'eric.snow', 'steve.dower', 'corona10', 'pablogsal', 'eelizondo', 'remi.lapeyre', 'shihai1991'] pr_nums = ['19474', '24828', '31488', '31489', '31490', '31491'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue40255' versions = ['Python 3.10'] ```

Linked PRs

gh-102887
gh-103308
gh-103403
gh-103817
gh-103823
gh-103962
gh-104054
gh-105195
gh-105977
gh-107605

476181bd-17c7-4f5b-a731-677932cb4f0c commented 4 years ago

Copy on writes are a big problem in large Python application that rely on multiple processes sharing the same memory.

With the implementation of gc.freeze, we've attenuated the problem by removing the CoW coming from the GC Head. However, reference counting still causes CoW.

This introduces Immortal Instances which allows the user to bypass reference counting on specific objects and avoid CoW on forked processes.

Immortal Instances are specially useful for applications that cache heap objects in shared memory which live throughout the entire execution (i.e modules, code, bytecode, etc.)

gpshead commented 4 years ago

Thanks for sharing! It's good to have a patch implementing for others who might need it to try out.

We experimented with an implementation of this on CPython 2.7 that we called Eternal Refcounts for YouTube many years ago. For the same reason (saving memory in forked workers by not dirtying pages of known needed objects via refcount changes).

I don't have that old patch handy right now, but looking at yours it was the same concept: A magic refcount value to branch based on.

We wound up not deploying it or pushing for it in CPython because the CPU performance implications of adding a branch instruction to Py_INCREC and PyDECREF were, unsurprisingly, quite high. I'd have to go digging to find numbers but it _believe it was on the order of a 10% slowdown on real world application code. (microbenchmarks don't tell a good story, the python performance test suite should)

Given that most people's applications don't fork workers, I do not expect to see such an implementation ever become the default. It is a very appropriate hack, but the use case is limited to applications that don't use threads, are on POSIX, and always fork() many workers.

Also note that this is an ABI change as those INCREF and DECREF definitions are intentionally public .h file macros or inlines and thus included within any compiledcode rather than being an expensive function call elsewhere (vstinner's new alternate higher level API proposal presumably changes this). If an interpreter with this loaded a binary using the CPython API built against headers without it, your refcounts could get messed up.

476181bd-17c7-4f5b-a731-677932cb4f0c commented 4 years ago

the CPU performance implications of adding a branch instruction to Py_INCREC and Py_DECREF were, unsurprisingly, quite high.

Yeah, makes sense. I guess it really depends on the specific profile of your application.

For Instagram this was an overall net positive change and we still have it in prod. The amount of page faults from Copy on Writes was so large that reducing it resulted in a net CPU win (even with the added branching). And of course, a huge reduction in memory usage.

Microbenchmarks don't tell a good story, the python performance test suite should

Agreed. I only added the Richards benchmark as a reference. I'm hoping someone can pick it up and have more concrete numbers on an average Python workload.

Given that most people's applications don't fork workers, I do not expect to see such an implementation ever become the default.

In any case, I gated this change with ifdefs. In case we don't have it by default, we can always can easily enable it with a simple -DPy_IMMORTAL_INSTANCES flag to the compiler.

Also note that this is an ABI change as those INCREF and DECREF definitions are intentionally public .h file

This has indeed been a bit of a pain for us as well. Due to how our tooling is set-up, there's a small number of third party libraries that are still causing Copy on Writes. Fortunately, that's the only drawback. Even if your immortalized object goes through an extension that has a different Py_{DECREF,INCREF} implementation, the reference count will still be so large that it will never be deallocated.

pablogsal commented 4 years ago

I run the pyperformance test suite with PGO + LTO + full cpu isolation in the speed.python.org machine and these were the results:

+-------------------------+--------------------------------------+---------------------------------------+ | Benchmark | master | immortal_refs_patch | +=========================+======================================+=======================================+ | pidigits | 289 ms | 290 ms: 1.01x slower (+1%) | +-------------------------+--------------------------------------+---------------------------------------+ | pickle | 20.2 us | 20.3 us: 1.01x slower (+1%) | +-------------------------+--------------------------------------+---------------------------------------+ | xml_etree_parse | 253 ms | 258 ms: 1.02x slower (+2%) | +-------------------------+--------------------------------------+---------------------------------------+ | json_loads | 52.0 us | 53.1 us: 1.02x slower (+2%) | +-------------------------+--------------------------------------+---------------------------------------+ | scimark_fft | 708 ms | 723 ms: 1.02x slower (+2%) | +-------------------------+--------------------------------------+---------------------------------------+ | python_startup_no_site | 7.83 ms | 8.04 ms: 1.03x slower (+3%) | +-------------------------+--------------------------------------+---------------------------------------+ | scimark_sparse_mat_mult | 9.28 ms | 9.57 ms: 1.03x slower (+3%) | +-------------------------+--------------------------------------+---------------------------------------+ | unpickle | 28.0 us | 28.9 us: 1.03x slower (+3%) | +-------------------------+--------------------------------------+---------------------------------------+ | regex_effbot | 4.70 ms | 4.86 ms: 1.03x slower (+3%) | +-------------------------+--------------------------------------+---------------------------------------+ | xml_etree_iterparse | 178 ms | 184 ms: 1.04x slower (+4%) | +-------------------------+--------------------------------------+---------------------------------------+ | python_startup | 11.5 ms | 12.0 ms: 1.04x slower (+4%) | +-------------------------+--------------------------------------+---------------------------------------+ | scimark_monte_carlo | 212 ms | 221 ms: 1.04x slower (+4%) | +-------------------------+--------------------------------------+---------------------------------------+ | pathlib | 38.6 ms | 40.5 ms: 1.05x slower (+5%) | +-------------------------+--------------------------------------+---------------------------------------+ | sqlite_synth | 7.85 us | 8.26 us: 1.05x slower (+5%) | +-------------------------+--------------------------------------+---------------------------------------+ | sqlalchemy_imperative | 62.9 ms | 66.2 ms: 1.05x slower (+5%) | +-------------------------+--------------------------------------+---------------------------------------+ | richards | 138 ms | 145 ms: 1.05x slower (+5%) | +-------------------------+--------------------------------------+---------------------------------------+ | crypto_pyaes | 199 ms | 210 ms: 1.06x slower (+6%) | +-------------------------+--------------------------------------+---------------------------------------+ | nbody | 249 ms | 265 ms: 1.06x slower (+6%) | +-------------------------+--------------------------------------+---------------------------------------+ | raytrace | 937 ms | 995 ms: 1.06x slower (+6%) | +-------------------------+--------------------------------------+---------------------------------------+ | deltablue | 13.4 ms | 14.2 ms: 1.06x slower (+6%) | +-------------------------+--------------------------------------+---------------------------------------+ | tornado_http | 276 ms | 295 ms: 1.07x slower (+7%) | +-------------------------+--------------------------------------+---------------------------------------+ | scimark_lu | 247 ms | 264 ms: 1.07x slower (+7%) | +-------------------------+--------------------------------------+---------------------------------------+ | sqlalchemy_declarative | 280 ms | 300 ms: 1.07x slower (+7%) | +-------------------------+--------------------------------------+---------------------------------------+ | dulwich_log | 143 ms | 153 ms: 1.07x slower (+7%) | +-------------------------+--------------------------------------+---------------------------------------+ | chaos | 210 ms | 226 ms: 1.07x slower (+7%) | +-------------------------+--------------------------------------+---------------------------------------+ | 2to3 | 594 ms | 639 ms: 1.08x slower (+8%) | +-------------------------+--------------------------------------+---------------------------------------+ | unpickle_list | 6.60 us | 7.10 us: 1.08x slower (+8%) | +-------------------------+--------------------------------------+---------------------------------------+ | float | 213 ms | 229 ms: 1.08x slower (+8%) | +-------------------------+--------------------------------------+---------------------------------------+ | pyflate | 1.27 sec | 1.37 sec: 1.08x slower (+8%) | +-------------------------+--------------------------------------+---------------------------------------+ | go | 481 ms | 522 ms: 1.09x slower (+9%) | +-------------------------+--------------------------------------+---------------------------------------+ | pickle_pure_python | 966 us | 1.05 ms: 1.09x slower (+9%) | +-------------------------+--------------------------------------+---------------------------------------+ | meteor_contest | 175 ms | 191 ms: 1.09x slower (+9%) | +-------------------------+--------------------------------------+---------------------------------------+ | genshi_xml | 148 ms | 161 ms: 1.09x slower (+9%) | +-------------------------+--------------------------------------+---------------------------------------+ | chameleon | 20.6 ms | 22.5 ms: 1.09x slower (+9%) | +-------------------------+--------------------------------------+---------------------------------------+ | nqueens | 191 ms | 208 ms: 1.09x slower (+9%) | +-------------------------+--------------------------------------+---------------------------------------+ | telco | 14.6 ms | 16.0 ms: 1.09x slower (+9%) | +-------------------------+--------------------------------------+---------------------------------------+ | logging_format | 22.1 us | 24.2 us: 1.10x slower (+10%) | +-------------------------+--------------------------------------+---------------------------------------+ | regex_compile | 322 ms | 355 ms: 1.10x slower (+10%) | +-------------------------+--------------------------------------+---------------------------------------+ | hexiom | 16.4 ms | 18.1 ms: 1.10x slower (+10%) | +-------------------------+--------------------------------------+---------------------------------------+ | json_dumps | 25.6 ms | 28.3 ms: 1.10x slower (+10%) | +-------------------------+--------------------------------------+---------------------------------------+ | sympy_sum | 397 ms | 439 ms: 1.11x slower (+11%) | +-------------------------+--------------------------------------+---------------------------------------+ | sympy_integrate | 42.7 ms | 47.2 ms: 1.11x slower (+11%) | +-------------------------+--------------------------------------+---------------------------------------+ | django_template | 120 ms | 133 ms: 1.11x slower (+11%) | +-------------------------+--------------------------------------+---------------------------------------+ | sympy_str | 633 ms | 702 ms: 1.11x slower (+11%) | +-------------------------+--------------------------------------+---------------------------------------+ | xml_etree_generate | 162 ms | 180 ms: 1.11x slower (+11%) | +-------------------------+--------------------------------------+---------------------------------------+ | spectral_norm | 244 ms | 272 ms: 1.11x slower (+11%) | +-------------------------+--------------------------------------+---------------------------------------+ | xml_etree_process | 134 ms | 149 ms: 1.11x slower (+11%) | +-------------------------+--------------------------------------+---------------------------------------+ | logging_simple | 19.3 us | 21.5 us: 1.11x slower (+11%) | +-------------------------+--------------------------------------+---------------------------------------+ | mako | 28.2 ms | 31.5 ms: 1.12x slower (+12%) | +-------------------------+--------------------------------------+---------------------------------------+ | fannkuch | 853 ms | 954 ms: 1.12x slower (+12%) | +-------------------------+--------------------------------------+---------------------------------------+ | sympy_expand | 983 ms | 1.11 sec: 1.13x slower (+13%) | +-------------------------+--------------------------------------+---------------------------------------+ | genshi_text | 64.0 ms | 72.9 ms: 1.14x slower (+14%) | +-------------------------+--------------------------------------+---------------------------------------+ | unpack_sequence | 123 ns | 142 ns: 1.16x slower (+16%) | +-------------------------+--------------------------------------+---------------------------------------+ | logging_silent | 311 ns | 364 ns: 1.17x slower (+17%) | +-------------------------+--------------------------------------+---------------------------------------+ | unpickle_pure_python | 581 us | 683 us: 1.17x slower (+17%) | +-------------------------+--------------------------------------+---------------------------------------+

Not significant (2): pickle_list; regex_dna; pickle_dict; scimark_sor; regex_v8

pablogsal commented 4 years ago

System state \============

CPU: use 12 logical CPUs: 1,3,5,7,9,11,13,15,17,19,21,23 Perf event: Maximum sample rate: 1 per second ASLR: Full randomization Linux scheduler: Isolated CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23 Linux scheduler: RCU disabled on CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23 CPU Frequency: 0,2,4,6,8,10,12,14,16,18,20,22=min=1600 MHz, max=3333 MHz; 1,3,5,7,9,13,15,17,19,21,23=min=max=3333 MHz Turbo Boost (MSR): CPU 1,3,5,7,9,11,13,15,17,19,21,23: disabled IRQ affinity: irqbalance service: inactive IRQ affinity: Default IRQ affinity: CPU 0,2,4,6,8,10,12,14,16,18,20,22 IRQ affinity: IRQ affinity: IRQ 0,2=CPU 0-23; IRQ 1,3-15,17,20,22-23,28-51=CPU 0,2,4,6,8,10,12,14,16,18,20,22

zooba commented 4 years ago

If it's easy to set up, it might be interesting to see if there's a difference if (e.g.) all interned strings are marked as immortal, or all small ints. Or type objects. Maybe set a flag during initialization and treat everything created before that as immortal.

There's definitely going to be overhead, but assuming the branch is there we may as well use it to optimise our own operations.

pablogsal commented 4 years ago

Thanks, Eddie for sharing the patch. I think this is an interesting discussion: for instance, in the core dev sprint at Microsoft I recall that we had a quick discussion about having immortal references.

After analyzing the patch, and even assuming that we add conditional compilation, here are some of the things I feel uneasy about:

Anything that is touched by the immortal object will be leaked. This can also happen in obscure ways if reference cycles are created.
As Eddie mentions in the PR, this does not fully cover all cases as objects that become tracked by the GC after they are modified (for instance, dicts and tuples that only contain immutable objects). Those objects will still, participate in reference counting after they start to be tracked.
As Gregory mentions, if immortal objects are handed to extension modules compiled with the other version of the macros, the reference count can be corrupted. This may break the garbage collector algorithm that relies on the the balance between strong references between objects and its reference count to do the calculation of the isolated cycles. This without mentioning that this defies the purpose of the patch in those cases, as a single modification of the reference the count will trigger the copy of the whole memory page where the object lives.
The patch modifies very core mechanics of Python for the benefit of a restricted set of users (as Gregory mentions): no threads, POSIX and forking.
Is not clear to me that leaking memory is the best way to solve this problem compared to other solutions (like having the the reference count of the objects separated from them). objects separated from them.

There is a trend right now to try to remove immortal objects (started by Victor and others) like static types and transforming them to heap types. I am afraid that having the restriction that the interpreter and the gc needs to be aware of immortal types can raise considerably the maintenance costs of already complicated pieces of the CPython VM.

pablogsal commented 4 years ago

If it's easy to set up, it might be interesting to see if there's a difference if (e.g.) all interned strings are marked as immortal, or all small ints. Or type objects. Maybe set a flag during initialization and treat everything created before that as immortal.

I can look into it, the main problem is that every run of the pyperformance test suite takes several hours, even more with --full-pgo :( I will try to set up some automation for that when I have some free cycles

zooba commented 4 years ago

There is a trend right now to try to remove immortal objects (started by Victor and others) like static types and transforming them to heap types.

This isn't actually about removing immortal objects, but removing *mutable* objects that would be shared between subinterpreters. Making some objects immortal, such as interned strings or stateless singletons, would actually benefit this work, as they could then be safely shared between subinterpreters.

But as you say, the added complexity here is starting to seem like it'll cost more than it's worth...

vstinner commented 4 years ago

The idea of immortal objects was proposed to solve the bpo-39511 problem.

I dislike immortable for different reasons. Here are some.

Gregory:

We wound up not deploying it or pushing for it in CPython because the CPU performance implications of adding a branch instruction to Py_INCREC and Py_DECREF were, unsurprisingly, quite high.

Exactly. Copy-on-Write problem affects a minority of users. If PR 19474 is enabled by default, all users will pay the overhead even if they never call fork() without exec().

1.17x slower on logging_silent or unpickle_pure_python is a very expensive price :-(

Pablo:

Anything that is touched by the immortal object will be leaked. This can also happen in obscure ways if reference cycles are created.

gc.freeze() has a similar issue, no? This function also comes from Instagram specific use case ;-)

Steve:

This isn't actually about removing immortal objects, but removing *mutable* objects that would be shared between subinterpreters. Making some objects immortal, such as interned strings or stateless singletons, would actually benefit this work, as they could then be safely shared between subinterpreters.

From what I understood, we simply cannot share any object (mutable or not) between two subinterpreters. Otherwise, we will need to put a lock on all Py_INCREF/Py_DECREF operation which would kill performances of running multiple interpreters in parallel. Join bpo-39511 discussion.

pablogsal commented 4 years ago

gc.freeze() has a similar issue, no? This function also comes from Instagram specific use case ;-)

Not really because this patch is much more impacting. gc.freeze() moves objects into a permanent generation making them not participate in the GC so the only leaks are cycle-based. With this patch immortal object will leak every strong reference that they have even if they don't participate in cycles (but also if they participate in cycles).

vstinner commented 4 years ago

I would be more interested by an experiment to move ob_refcnt outside PyObject to solve the Copy-on-Write issue, especially to measure the impact on performances.

zooba commented 4 years ago

> This isn't actually about removing immortal objects, but removing *mutable* > objects that would be shared between subinterpreters. Making some objects > immortal, such as interned strings or stateless singletons, would actually > benefit this work, as they could then be safely shared between > subinterpreters.

From what I understood, we simply cannot share any object (mutable or not) between two subinterpreters. Otherwise, we will need to put a lock on all Py_INCREF/Py_DECREF operation which would kill performances of running multiple interpreters in parallel. Join bpo-39511 discussion.

That thread consists of everyone else telling you what I've just told you. Not sure that me telling you as well is going to help ;)

pitrou commented 4 years ago

I run the pyperformance test suite with PGO + LTO + full cpu isolation in the speed.python.org machine and these were the results:

Be mindful that synthetic benchmarks are probably a gentle case for branch prediction. Real-world performance on irregular workloads might be worse still (or perhaps better, who knows :-)).

pitrou commented 4 years ago

As a separate discussion, I would be interested to know whether the original use case (Eddie's) could be satisfied differently. It probably doesn't belong to this issue, though.

(Eddie, if you want to discuss this, feel free to e-mail me privately. I think Davin Potts - co-maintainer of multiprocessing who recently added some facilities around shared memory - might be interested as well.)

carljm commented 4 years ago

Anything that is touched by the immortal object will be leaked. This can also happen in obscure ways if reference cycles are created.

I think this is simply expected behavior if you choose to create immortal objects, and not really an issue. How could you have an immortal object that doesn't keep its strong references alive?

this does not fully cover all cases as objects that become tracked by the GC after they are modified (for instance, dicts and tuples that only contain immutable objects). Those objects will still participate in reference counting after they start to be tracked.

I think the last sentence here is not quite right. An immortalized object will never start participating in reference counting again after it is immortalized.

There are two cases. If at the time of calling immortalize_heap() you have a non-GC-tracked object that is also not reachable from any GC-tracked container, then it will not be immortalized at all, so will be unaffected. This is a side effect of the PR using the GC to find objects to immortalize.

If the non-GC-tracked object is reachable from a GC-tracked object (I believe this is by far the more common case), then it will be immortalized. If it later becomes GC-tracked, it will start participating in GC (but the immortal bit causes it to appear to the GC to have a very high reference count, so GC will never collect it or any cycle it is part of), but that will not cause it to start participating in reference counting again.

if immortal objects are handed to extension modules compiled with the other version of the macros, the reference count can be corrupted

I think the word "corrupted" makes this sound worse than it is in practice. What happens is just that the object is still effectively immortal (because the immortal bit is a very high bit), but the copy-on-write benefit is lost for the objects touched by old extensions.

1.17x slower on logging_silent or unpickle_pure_python is a very expensive price

Agreed. It seems the only way this makes sense is under an ifdef and off by default. CPython does a lot of that for debug features; this might be the first case of doing it for a performance feature?

I would be more interested by an experiment to move ob_refcnt outside PyObject to solve the Copy-on-Write issue

It would certainly be interesting to see results of such an experiment. We haven't tried that for refcounts, but in the work that led to gc.freeze() we did try relocating the GC header to a side location. We abandoned that because the memory overhead of adding a single indirection pointer to every PyObject was too large to even consider the option further. I suspect that this memory overhead issue and/or likely cache locality problems will make moving refcounts outside PyObject look much worse for performance than this immortal-instances patch does.

carljm commented 4 years ago

An immortalized object will never start participating in reference counting again after it is immortalized.

Well, "passed to an extension compiled with no-immortal headers" is an exception to this.

But for the "not GC tracked but later becomes GC tracked" case, it will not re-enter reference counting, only the GC.

carljm commented 4 years ago

This may break the garbage collector algorithm that relies on the balance between strong references between objects and its reference count to do the calculation of the isolated cycles.

I don't think it really breaks anything. What happens is that the immortal object appears to the GC to have a very large reference count, even after adjusting for within-cycle references. So cycles including an immortal object are always kept alive, which is exactly the behavior one should expect from an immortal object.

DinoV commented 4 years ago

I think there's other cases of performance related features being hidden under an ifdef. Computed gotos show up that way, although probably more because it's a compiler extension that's not supported everywhere. Pymalloc is also very similar in that it implies an ABI change as well.

I wonder if it would be worth it to introduce an ABI flag for this as well? On the one hand is it a slightly different contract, on the other hand using extensions that don't support the immortalization actually work just fine.

pablogsal commented 4 years ago

I think this is simply expected behavior if you choose to create immortal objects, and not really an issue. How could you have an immortal object that doesn't keep its strong references alive?

I think I didn't express myself very well: I am saying that I perceive this as a problem because when you immortalize the heap at a given time you will also immortalize any strong reference that these objects made afterwards. The immortal property will be "infecting" other objects as strong references are created.

I think the last sentence here is not quite right. An immortalized object will never start participating in reference counting again after it is immortalized.

Yeah, but the objects I mention will not be immortalized. These are objects that were not tracked when the immortalization is done and are not reachable from tracked objects at the moment and become tracked afterwards (for instance, some dictionary that only had immutable objects when the immortalization was done). And this is always very impacting: a single object that participates in refcounting will copy an entire page of memory. Although I agree this is a rare thing is a possible thing that can happen.

I think the word "corrupted" makes this sound worse than it is in practice. What happens is just that the object is still effectively immortal (because the immortal bit is a very high bit), but the copy-on-write benefit is lost for the objects touched by old extensions.

Yeah, I agree that corrupted may sound a bit more dramatic than it should but I think this is still a problem because when it happens (as mentioned before) it affects entire pages and not the single object that we consider.

But for the "not GC tracked but later becomes GC tracked" case, it will not re-enter reference counting, only the GC.

But if said objects (isolated and untracked before and now tracked) acquire strong references to immortal objects, those objects will be visited when the gc starts calculating the isolated cycles and that requires a balanced reference count to work.

zooba commented 4 years ago

There hasn't been much said about which kind of objects would be immortal. Could this be a creation-time option? Or a type object (similar to a no-op tp_dealloc)?

If it's something that is known when the object is created, then some of the "infection" concern can be reduced - but if you wanted to arbitrarily make arbitrary objects immortal dynamically then it's a different matter.

Another benefit of knowing at creation time is that immortal objects could be allocated to a different page (or potentially even to read-only shared memory, if that was appropriate).

pablogsal commented 4 years ago

There hasn't been much said about which kind of objects would be immortal.

The current patch makes *all objects that are tracked by the gc* immortal, independently of their type (unless I am missing something).

pablogsal commented 4 years ago

all objects that are tracked by the gc

also, all the objects that are reachable from objects tracked by the gc.

gpshead commented 4 years ago

Marking everything as immortal/eternal after you've loaded your common code & data but before you do your forking is thenorm for this kind of serving system.

You already presumably have a defined lifetime for the processes so they'll likely be set to die within N hours or days. Memory leaks of too many things persisting is a non-issue in this kind of system.

The alternative of trying to pick and choose exactly what (and anything they reference) sticks around is a maintenance nightmare exercise in futility.

pablogsal commented 4 years ago

But if said objects (isolated and untracked before and now tracked) acquire strong references to immortal objects, those objects will be visited when the gc starts calculating the isolated cycles and that requires a balanced reference count to work.

I was thinking about this a bit more and because the refcount bit is too high, the GC algorithm will be "fine" in the sense that won't break. It will just treat immortal objects as referenced from outside any cycle isolate (because the refcount hopefully will never reach 0), which is consistent with the fact that they are in the permanent generation.

carljm commented 4 years ago

I think the concerns about "perfect" behavior in corner cases are in general irrelevant here.

In the scenarios where this optimization matters, there is no quantitative change that occurs at 100% coverage. Preventing 99% of CoW is 99% as good as preventing 100% :) So the fact that a few objects here and there in special cases could still trigger CoW just doesn't matter; it's still a massive improvement over the status quo. (That said, I wouldn't _mind_ improving the coverage, e.g. if you can suggest a better way to find all heap objects instead of using the GC.)

And similarly, gps is right that the concern that immortal objects can keep other objects alive (even via references added after immortalization) is a non-issue in practice. There really is no other behavior one could prefer or expect instead.

if said objects (isolated and untracked before and now tracked) acquire strong references to immortal objects, those objects will be visited when the gc starts calculating the isolated cycles and that requires a balanced reference count to work.

I'm not sure what you mean here by "balanced ref count" or by "work" :) What will happen anytime an immortal object gets into the GC, for any reason, is that the GC will "subtract" cyclic references and see that the immortal object still has a large refcount even after that adjustment, and so it will keep the immortal object and any cycle it is part of alive. This behavior is correct and should be fully expected; nothing breaks. It doesn't matter at all to the GC that this large refcount is "fictional," and it doesn't break the GC algorithm, it results only in the desired behavior of maintaining immortality of immortal objects.

It is perhaps slightly weird that this behavior falls out of the immortal bit being a high bit rather than being more explicit. I did do some experimentation with trying to explicitly prevent immortal instances from ever entering GC, but it turned out to be hard to do that in an efficient way. And motivation to do it is low, because there's nothing wrong with the behavior in the existing PR.

pablogsal commented 4 years ago

I'm not sure what you mean here by "balanced ref count" or by "work" :) What will happen anytime an immortal object gets into the GC, for any reason, is that the GC will "subtract" cyclic references and see that the immortal object still has a large refcount even after that adjustment, and so it will keep the immortal object and any cycle it is part of alive. This behavior is correct and should be fully expected; nothing breaks. It doesn't matter at all to the GC that this large refcount is "fictional," and it doesn't break the GC algorithm, it results only in the desired behavior of maintaining immortality of immortal objects.

Yep, that is right. I think there was a race condition between my previous message and yours :)

I think what was confusing me in this line of reasoning is that I was not taking into account that the immortal bit is a very high one, making the refcount gigantic. I was treating it mentally like a flag without factoring the implications of such big reference count.

vstinner commented 4 years ago

This feature seems to be driven by Instagram use case. Are there other users who would benefit of that? Is it a common use case to load big data and then fork to use preloaded data?

PR 19474 has a maintenance cost:

new configuration option
new macro
need a buildbot to check that the option is not broken
document the change
etc.

There is even now a question about using a different ABI flag.

It's not the common case to build a custom Python manually for a specific use case. Most users use a binary shipped by their operating system.

I'm not sure that it's worth it for Python to maintain such special build.

Maybe bpo-39511 would be a better motivation to support immortable objects.

But I'm also concerned by the huge impact on performance :-(

carljm commented 4 years ago

Is it a common use case to load big data and then fork to use preloaded data?

A lot of the "big data" in question here is simply lots of Python module/class/code objects resulting from importing lots of Python modules.

And yes, this "pre-fork" model is extremely common for serving Python web applications; it is the way most Python web application servers work. We already have an example in this thread of another large Python web application (YouTube) that had similar needs and considered a similar approach.

vstinner commented 4 years ago

Carl:

A lot of the "big data" in question here is simply lots of Python module/class/code objects resulting from importing lots of Python modules.

And yes, this "pre-fork" model is extremely common for serving Python web applications; it is the way most Python web application servers work. (...)

I would be interested to hear the answer to Antoine's question which is basically: why not using the multiprocessing fork server?

pitrou commented 4 years ago

I'll note that this "extremely common" model can break as soon as you have hidden worker threads somewhere (this can happen in a third-party library).

For example, the libhdfs library is not fork-safe.

nascheme commented 4 years ago

Eddie mentions in the PR about using memory arenas to contain immortal objects. I think it could be worth investigating further. With the current PR, the immortal status is dependent on the value of the refcnt field of the object. Using immortal arenas might be faster because you can check if an object is immortal just based on its address (no data dependency on refcnt value).

The fastest would be to create an immortal block as part of the BSS (uninitialized data). Then, within incref/decref you use the location and size of the immortal arena (compile time constants) to test if the object is immortal. Maybe could just check if the high bits of an object address match a constant mask (if immortal region is aligned and is power of 2 in size). Slightly slower but more flexible would be to make the immortal arena size and location global variables. That way, you can set the size of the region on startup. Also would be more flexible in terms of ABI compatibility. That would introduce one or two global loads within incref/decref but those values would almost certainly be in L1 cache.

Even more flexible would be to use a memory map to mark which arenas are immortal. See my radix tree implementation for obmalloc:

https://bugs.python.org/issue37448

I would guess the radix tree lookups are too expensive to put in incref/decref. Should probably test that though.

I had started doing an experiment with the arena approach before I noticed Eddie's comment about it. I would like to see his version. Here is a sketch of mine (not working yet):

change new_arena() to initially allocate from an "immortal memory" region. There are multiple ways to allocate that (BSS/uninitialized data, aligned_alloc(), etc).
add a _PyMem_enable_immortal() call to switch obmalloc from using immortal arenas to regular ones. Need to mess with some obmalloc data structures to switch arenas (usedpools, usable_arenas, unused_arena_objects).
change incref/decref to check if immortal status has been enabled and if object address falls within immortal region. If so, incref/decref don't do anything.

By default, the immortal arenas could be enabled on Python startup. Call _PyMem_enable_immortal after startup but before running user code. There could be a command-line option to disable the automatic call to _PyMem_enable_immortal() so that users like Instagram can do their pre-fork initialization before calling it.

Next step down the rabbit-hole could be to use something like Jeethu Rao's frozen module serializer and dump out all of the startup objects and put them into the BSS:

https://bugs.python.org/issue34690

carljm commented 4 years ago

I would be interested to hear the answer to Antoine's question which is basically: why not using the multiprocessing fork server?

Concretely, because for a long time we have used the uWSGI application server and it manages forking worker processes (among other things), and AFAIK nobody has yet proposed trying to replace that with something built around the multiprocessing module. I'm actually not aware of any popular Python WSGI application server built on top of the multiprocessing module (but some may exist).

What problem do you have in mind that the fork server would solve? How is it related to this issue? I looked at the docs and don't see that it does anything to help sharing Python objects' memory between forked processes without CoW.

zooba commented 4 years ago

What problem do you have in mind that the fork server would solve?

My understanding is that the fork server solves the *general* problem of running arbitrary code before fork by making sure that only CPython/multiprocessing gets to run code before fork.

In this specific case where you're carefully running only forkable code at startup, it presumably won't help.

pitrou commented 4 years ago

Steve is right.

carljm commented 4 years ago

Makes sense. Yes, caution is required about what code runs before fork, but forkserver’s solution for that would be a non-starter for us, since it would ensure that we can share no basically no memory at all between worker processes.

476181bd-17c7-4f5b-a731-677932cb4f0c commented 4 years ago

I was able to get an equal (maybe slightly better) performance by following the advice from Steve and Neil and skipping reference counting of instances that we know are immortal.

It seems that by getting performance parity, we should be able to ease most of the concerns raised so far.

I've updated the PR to now immortalize:

All static types (i.e: PyType_Type, etc.)
All small ints (-5 to 256)
The following Singletons: PyTrue, PyFalse, PyNone
And the heap after the runtime is initialized (in pymain_main)

A quick caveat, immortalizing the runtime heap caused ~6 or so tests to fail since they depend on shutdown behavior which this now changes. In the PR I commented out the call to _PyGC_ImmortalizeHeap in pymain_main to pass the CI. Ignoring these failing tests for now, these are the benchmarks numbers that I got from pyperformance:

Baseline (master branch): unpack_sequence: Mean +- std dev: 76.0 ns +- 4.9 ns richards: Mean +- std dev: 116 ms +- 8 ms fannkuch: Mean +- std dev: 764 ms +- 24 ms pidigits: Mean +- std dev: 261 ms +- 7 ms

Immortalizing known immortals objects (Latest PR) unpack_sequence: Mean +- std dev: 74.7 ns +- 5.1 ns richards: Mean +- std dev: 112 ms +- 5 ms fannkuch: Mean +- std dev: 774 ms +- 24 ms pidigits: Mean +- std dev: 262 ms +- 11 ms

Only adding immortal branch (The commit that Pablo benchmarked) unpack_sequence: Mean +- std dev: 93.1 ns +- 5.7 ns richards: Mean +- std dev: 124 ms +- 4 ms fannkuch: Mean +- std dev: 861 ms +- 26 ms pidigits: Mean +- std dev: 269 ms +- 7 ms

The performance of Immortal Objects by default seems to now be within error of the master branch.

476181bd-17c7-4f5b-a731-677932cb4f0c commented 4 years ago

Neil:

The fastest would be to create an immortal block as part of the BSS (uninitialized data).

That's an interesting idea, definitely worth exploring and we can probably get some perf win out of it. And yes, using the frozen modules is definitely a step forward and we can leverage that to move these instances into the rodata section of the binary.

I had started doing an experiment with the arena approach before I noticed Eddie's comment about it. I would like to see his version.

What I had written up is slightly different from what you mentioned. I was mostly concerned about having a small object that we did not reach through the GC roots. If this small object would get a reference count bump, the whole arena would Copy on Write.

I added a commit to the PR with this Arena Immortalization so that you could take a look at the implementation: https://github.com/python/cpython/pull/19474/commits/b29c8ffd3faf99fc5c9885d2a4c6c3c6d5768c8c

The idea is to walk all the arena's pools to mark them as immortal by using a new word added to pool_header. This word would help us identify if the pool is immortal on every pyalloc and pydealloc.

I still get some tests breaking with this, I haven't tried to debug it though

476181bd-17c7-4f5b-a731-677932cb4f0c commented 4 years ago

After the added optimizations and assuming that running this more rigorously (i.e PGO + LTO + full cpu isolation + speed.python.org machine) also shows the PR as perf neutral, what are people's thoughts so far?

Would this be enough to consider adding this change by default?

Are there still any concerns regarding correctness? It seems like most of the questions raised so far have already been addressed

markshannon commented 4 years ago

A big -1 to this.

You are asking the whole world to take a hit on both performance and memory use, in order to save Instagram memory.

The PR uses the term "immortal" everywhere. There is only one reference to copy-on-write in a comment. Yet this issue about making object headers immutable. Immortality and object header immutability are not the same.

If object header immutability is to be a requirement, that needs a PEP.

If it is not requirement, but immortality is, then make the obvious improvement of changing the branchy code

if (!(obj->refcnt & IMMORTAL_BIT)) {
   obj->refcnt++;
}

to the branchless

obj->refcnt += ((obj->refcnt &IMMORTAL_BIT) != 0)

Immortality has advantages because it allows saturating reference counting and thus smaller object headers, but it is not the same as making the object header immutable.

476181bd-17c7-4f5b-a731-677932cb4f0c commented 4 years ago

Mark:

You are asking the whole world to take a hit on both performance and memory use.

That's an explicit non-goal. That's why the code was guarded to add as an optional compilation mode

This should be added by default if and only if this is a neutral or an improvement on performance. Which by the way, the latest perf numbers suggest that we are now at perf parity / within error (pending official confirmation from running this on speed.python.org machines).

Also, can you expand on how is this a performance hit on memory?

There is only one reference to copy-on-write in a comment. Yet this issue about making object headers immutable.

The PR summary expands on Copy on Writes. If you think this belongs in-code, let me know and I can update the PR.

then make the obvious improvement of changing the branchy code

That is strictly slower. The current version makes Py_INCREF and Py_DECREF cheaper for known immortal instances (i.e the heap after runtime init). This skips increasing and decreasing the refcount as well as the refcount check for deallocation. Using the proposed branch-less version makes all Py_INCREFs and Py_DECREFs more expensive.

I ran a couple of benchmarks with the branch-less version to highlight this:

Branch-less version: unpack_sequence: Mean +- std dev: 98.2 ns +- 4.9 ns richards: Mean +- std dev: 130 ms +- 5 ms fannkuch: Mean +- std dev: 894 ms +- 18 ms

Branch version: unpack_sequence: Mean +- std dev: 82.7 ns +- 3.9 ns richards: Mean +- std dev: 123 ms +- 4 ms fannkuch: Mean +- std dev: 838 ms +- 25 ms

Immortality has advantages because it allows saturating reference counting and thus smaller object headers, but it is not the same as making the object header immutable.

In its current form, this change is not trying to do a strict immutability of the header. We can't guarantee that third-party extensions won't mutate the instance. Instead, this change tries to maintain an instance's immutability as long as it can.

If the issue is with naming, I can easily rename this to something else :)

markshannon commented 4 years ago

Do you have more details on your comparison? I'm somewhat puzzled how a version that does no more work and has no jumps is slower.

The memory hit is not immediate, but making the object header immutable prevents changes like https://github.com/markshannon/cpython/commit/c73407e7b5d1a0fc794d55c6bcc91cfdc958f6c4 which would potentially save the two word per object overhead that the cyclic GC currently imposes.

476181bd-17c7-4f5b-a731-677932cb4f0c commented 4 years ago

I'm somewhat puzzled how a version that does no more work and has no jumps is slower.

Oh I think I know why there's some confusion. I've updated the PR from the initial version (which is probably the one that you saw). The branching does less work in Py_INCREF and Py_DECREF for all instances marked with the immortal bit by exiting early.

In the latest change, I added the immortal bit to a bunch of "known" immortal objects by. For instance:

All static types (i.e: PyType_Type, etc.)
All small ints (-5 to 256)
The following Singletons: PyTrue, PyFalse, PyNone
And the heap after the runtime is initialized (in pymain_main)

Example 1)

PyObject _Py_NoneStruct = {
  _PyObject_EXTRA_INIT
  1, &_PyNone_Type
#ifdef Py_IMMORTAL_OBJECTS
  _Py_IMMORTAL_BIT,
#else
  1,
#endif  /* Py_IMMORTAL_OBJECTS */
  &_PyNone_Type
};

Example 2)

static int
pymain_main(_PyArgv *args)
{
    PyStatus status = pymain_init(args);
    if (_PyStatus_IS_EXIT(status)) {
        pymain_free();
        return status.exitcode;
    }
    if (_PyStatus_EXCEPTION(status)) {
        pymain_exit_error(status);
    }

#ifdef Py_IMMORTAL_OBJECTS
    /* Most of the objects alive at this point will stay alive throughout the
     * lifecycle of the runtime. Immortalize to avoid the GC and refcnt costs */
    _PyGC_ImmortalizeHeap();
#endif  /* Py_IMMORTAL_OBJECTS */
    return Py_RunMain();

Therefore, you are now making Py_INCREF and Py_DECREF cheaper for things like Py_RETURN_NONE and a bunch of other instances.

Let me know if that explains it! I could also send you patch of the branch-less version so you can compare them.

but making the object header immutable prevents changes like

Why would it prevent it? These changes are not mutually exclusive, you can still have an immortal bit by: 1) Using a bit from gc_bits. Currently you only need 2 bits for the GC. Even with a short you'll have space for the immortal bit. 2) Using a bit from the ob_refcnt. Separately, using this allows us to experiment with a branch-less and test-less code by using saturated adds. For example:

/* Branch-less incref with saturated add */
#define PY_REFCNT_MAX ((int)(((int)-1)>>1))
#define _Py_INCREF(op) ({
    __asm__ (
        "addl $0x1, %[refcnt]"
        "cmovol  %[refcnt_max], %[refcnt]"
        : [refcnt] "+r" (((PyObject *)op)->ob_refcnt)
        : [refcnt_max] "r" (PY_REFCNT_MAX)
    );})

476181bd-17c7-4f5b-a731-677932cb4f0c commented 4 years ago

which would potentially save the two word per object overhead

Btw, I misread. I thought gc_bits referred to the bits used by the GC in the reference count. In any case, you can still use a bit in the reference count :)

nascheme commented 4 years ago

A few thoughts

First, the PR is well done. The changes are pretty minimal and are well localized. Thanks Eddie for writing clean code and responding to review requests.

My feeling is that, in current state, this still should not get merged or at least not get enabled by default. The main consideration is what is the cost of this new feature vs what is the benefit.

If the immortalize heap function is not called by default, all Python users pay a performance hit for this feature. Even after recent PR improvements, if you don't immortalize the heap, Python is slower.

Only a small fraction of Python users will benefit from this feature. I think the "pre-fork, CoW exploiting" application architecture is more common than some people would realize (it is a successful way to achieve multi-processing with CPython, making good use of multiple CPU cores). However, I'm pretty sure it is used by a very tiny fraction of all Python programs.

If we do immortalize by default (e.g. on Python startup), the semantics of the refcnt field are subtly changed. I suspect some programs will be broken as a result of this change. A clue about what might go wrong comes from the unicode object code. E.g. the resize_inplace() function in unicodeobject.c. It is possible that a non-trivial amount of other 3rd party code will make assumptions about refcnt that will be broken by this change. That code could be fixed but it is a matter of cost vs benefit.

If it is disabled by default with a build flag we have an ABI compatibility problem. I.e. incref/decref are inlined functions/macros. Also, as mentioned above, these kinds of build options tend to make maintenance and testing harder.

The frozen GC generation did not have these kinds of issues. If people don't use it, there is basically no cost. I think it was a good idea to merge the frozen generation feature. There is a small amount of ongoing maintenance cost but that's acceptable.

Regarding Mark's comment about immutability vs immutable object headers, I would frame the goal of the PR differently. Like the GC frozen generation, this immutable instance change is about providing a way to opt-out of Python's default garbage collection mechanism. The frozen generation was to opt-out of the mark-and-sweep style cyclic collection. This PR is about opting-out of the default Python reference counting.

Put that way, could we consider this PR as an incremental step in a process of making it possible to have a non-reference count based GC for CPython? E.g. perhaps we could eventually have a mixture of mark-and-sweep and reference counted GC, in order to support the handle (HPy) API. If we want to move towards that, we want CPython code to stop making assumptions about the refcnt field, like the unicodeobject.c file currently does.

I think pitching the PR in that way makes it more compelling. Merge it not because it helps a tiny class of Python applications. Instead, merge it because it helps find unwarranted assumptions about how the Python GC works. Once we get bad assumptions cleaned up in 3rd party code, we have more of chance of pursuing an alternative GC strategy for CPython.

nascheme commented 4 years ago

immutability vs immutable object headers

Sorry, I meant to write "immortal vs immutable".

nascheme commented 4 years ago

I resurrected an old thread on Discourse that seems quite relevant to this PR:

https://discuss.python.org/t/switching-from-refcounting-to-libgc/1641/35?u=nas

markshannon commented 4 years ago

Eddie, How did you compare the branching vs. non-branching implementations? I have read the code in the PR. What is the non-branching version that you used to compare it against?

Why not use the sign bit as the immortal bit? It simplifies the test considerably, making it faster, and doesn't need any assembly code.

Are you suggesting that making a set of common objects immortal will make things faster? Have you tested it? (I would expect that the negative impact on branch predictability would easily outweigh the cost of the memory write (A guaranteed L1 hit).

zooba commented 4 years ago

I would expect that the negative impact on branch predictability would easily outweigh the cost of the memory write (A guaranteed L1 hit)

If that were true then Spectre and Meltdown wouldn't have been so interesting :)

Pipelining processors are going to speculatively execute both paths, and will skip the write much more quickly than by doing it, and meanwhile nobody should have tried to read the value so it hasn't had to block for that path. I'm not aware of any that detect no-op writes and skip synchronising across cores - the dirty bit of the cache line is just set unconditionally.

Benchmarking already showed that the branching version is faster. It's possible that "refcount += (refcount & IMMORTAL) ? 0 : 1" could generate different code (should be mov,test,lea,cmovz rather than mov,and,add,mov or mov,and,jz,add,mov), but it's totally reasonable for a branch to be faster than unconditionally modifying memory.

pitrou commented 4 years ago

Benchmarking already showed that the branching version is faster.

But micro-benchmarks may tell you things which are not true in the real-world (for example an easily-predicted branch).

python / cpython

Fixing Copy on Writes from reference counting and immortal objects #84436

Linked PRs