python / cpython

The Python programming language
https://www.python.org
Other
63.08k stars 30.21k forks source link

Fixing Copy on Writes from reference counting and immortal objects #84436

Closed 476181bd-17c7-4f5b-a731-677932cb4f0c closed 7 months ago

476181bd-17c7-4f5b-a731-677932cb4f0c commented 4 years ago
BPO 40255
Nosy @tim-one, @nascheme, @gpshead, @pitrou, @vstinner, @carljm, @DinoV, @methane, @markshannon, @ericsnowcurrently, @zooba, @corona10, @pablogsal, @eduardo-elizondo, @remilapeyre, @shihai1991
PRs
  • python/cpython#19474
  • python/cpython#24828
  • python/cpython#31488
  • python/cpython#31489
  • python/cpython#31490
  • python/cpython#31491
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['interpreter-core', 'type-feature', '3.10'] title = 'Fixing Copy on Writes from reference counting and immortal objects' updated_at = user = 'https://github.com/eduardo-elizondo' ``` bugs.python.org fields: ```python activity = actor = 'eric.snow' assignee = 'none' closed = False closed_date = None closer = None components = ['Interpreter Core'] creation = creator = 'eelizondo' dependencies = [] files = [] hgrepos = [] issue_num = 40255 keywords = ['patch'] message_count = 69.0 messages = ['366216', '366221', '366224', '366325', '366326', '366328', '366329', '366330', '366331', '366341', '366344', '366345', '366387', '366388', '366390', '366402', '366403', '366404', '366406', '366407', '366408', '366412', '366413', '366416', '366422', '366423', '366428', '366456', '366518', '366519', '366520', '366536', '366561', '366565', '366566', '366573', '366577', '366578', '366579', '366593', '366605', '366623', '366624', '366625', '366628', '366629', '366744', '366818', '366825', '366826', '366828', '366834', '366835', '366837', '366852', '379644', '379871', '379917', '388525', '388526', '388760', '410592', '410613', '412956', '413080', '413116', '413693', '413694', '413717'] nosy_count = 16.0 nosy_names = ['tim.peters', 'nascheme', 'gregory.p.smith', 'pitrou', 'vstinner', 'carljm', 'dino.viehland', 'methane', 'Mark.Shannon', 'eric.snow', 'steve.dower', 'corona10', 'pablogsal', 'eelizondo', 'remi.lapeyre', 'shihai1991'] pr_nums = ['19474', '24828', '31488', '31489', '31490', '31491'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue40255' versions = ['Python 3.10'] ```

    Linked PRs

    zooba commented 4 years ago

    But micro-benchmarks may tell you things which are not true in the real-world (for example an easily-predicted branch)

    This is true, though this branch should be more easily predictable because INCREF/DECREF are inlined. If there were only one function doing the branch it'd be different.

    Wouldn't even surprise me if PGO made different optimisations to this in certain functions. Anything that normally operates on a code object, for example, is almost always going to take the branch when everything is imported early and before immortalisation.

    We also have the real world app that is Instagram. If they say that this has shown an improvement in their app, I'm inclined to believe that it's not reliant on a microbenchmark (other controlled circumstances, sure, but not a microbenchmark ;) ).

    pitrou commented 4 years ago

    We also have the real world app that is Instagram.

    I don't think Instagram is a single app. What is Python used for at Instagram?

    23982c60-ed6c-47d1-96c2-69d417bd81b3 commented 4 years ago

    > We also have the real world app that is Instagram.

    I don't think Instagram is a single app. What is Python used for at Instagram?

    According to their engineering blog (https://instagram-engineering.com/static-analysis-at-scale-an-instagram-story-8f498ab71a0c):

    Our server app is a monolith, one big codebase of several million lines and a few thousand Django endpoints [1], all loaded up and served together. A few services have been split out of the monolith, but we don’t have any plans to aggressively break it up.

    pitrou commented 4 years ago

    Wow. I see, thank you.

    markshannon commented 4 years ago

    On 20/04/2020 2:33 pm, Steve Dower wrote:

    Steve Dower \steve.dower@python.org\ added the comment:

    > I would expect that the negative impact on branch predictability would easily outweigh the cost of the memory write (A guaranteed L1 hit)

    If that were true then Spectre and Meltdown wouldn't have been so interesting :)

    I really don't follow your logic here.

    Pipelining processors are going to speculatively execute both paths, and will skip the write much more quickly than by doing it, and meanwhile nobody should have tried to read the value so it hasn't had to block for that path. I'm not aware of any that detect no-op writes and skip synchronising across cores - the dirty bit of the cache line is just set unconditionally.

    Processors only execute the one path, speculatively or otherwise, that's why branch prediction is so important. (And is important for the Spectre attack. If what you said were true, the Spectre attack wouldn't need to pre-bias the branch predictor.) I'm assuming a modern Intel processor.

    Benchmarking already showed that the branching version is faster. It's possible that "refcount += (refcount & IMMORTAL) ? 0 : 1" could generate different code (should be mov,test,lea,cmovz rather than mov,and,add,mov or mov,and,jz,add,mov), but it's totally reasonable for a branch to be faster than unconditionally modifying memory.

    It hasn't been shown that it is faster. I'm not claiming that it isn't, just that I haven't seen the evidence.

    ----------


    Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue40255\


    methane commented 3 years ago

    I'm big -1 too. But I am interested in Instagram usage.

    I have two ideas to reduce CoW:

    For long term, multi process friendly pyc format can share more memory inter processes, even though they are not created by prefork.

    vstinner commented 3 years ago

    Fast shutdown option

    You can use os._exit(0).

    methane commented 3 years ago

    > Fast shutdown option

    You can use os._exit(0).

    Yes. Instagram use it as atexit.register(os._exit, 0).

    https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172

    I think this hack can be supported in multiprocessing module in some way.

    ericsnowcurrently commented 3 years ago

    FYI, I've just put up a similar PR [1] to Eddie's, with a focus on object immortality rather than immutability. However, if Py_IMMORTAL_CONST_REFCOUNTS is defined then it should have the behavior Eddie is after.

    [1] https://github.com/python/cpython/pull/24828

    ericsnowcurrently commented 3 years ago

    Note that my PR does make the builtin types immortal (at least those initialized in _PyTypes_Init()). However, I did not make the singletons or other candidate objects immortal in the PR. That can be done separately.

    ericsnowcurrently commented 3 years ago

    While the eventual solution may overlap, my interests are divergent enough from those here that I've created a separate issue: bpo-43503.

    gpshead commented 2 years ago

    [data] I finally dug up the old YouTube doc at work with their findings. Mostly just posting this here for future public reference if anyone wants. Nothing surprising.

    When youtube experimented with a modified 2.7 adding "eternal refcounts" in 2015, they saw a 3-5% CPU performance regression. (not the 10% I had in my mind)

    Their version of this simply set a high bit on the refcount as the indicator and added the obvious conditional into the Py_INCREF/Py_DECREF macros.

    Unsurprisingly in line with what others have since found. For their preforked server and decision of what to mark eternal before forking, it saved them 10% ram (fewer copy on writes). The -ram vs +cpu +maintenance cost tradeoff wound up not being worthwhile to them though. Their motivation for trying was entirely COW memory savings.

    === CPython 2.7 object.h modification they used:

    +#ifdef GOOGLE_ETERNAL_REFCOUNT_SUPPORT
    +
    +#define PY_ETERNAL_REFCOUNT (PY_SSIZE_T_MAX / 2)
    +
    +#define Py_IS_ETERNAL(op) (                       \
    +  ((PyObject*)(op))->ob_refcnt >= PY_ETERNAL_REFCOUNT)
    +
    +#define Py_SET_ETERNAL(op)                                \
    +  do {                                                    \
    +      ((PyObject*)(op))->ob_refcnt = PY_ETERNAL_REFCOUNT; \
    +      if (PyObject_IS_GC(op)) {                           \
    +        PyObject_GC_UnTrack(op);                          \
    +      }                                                   \
    +  } while (0)
    +
    +#define Py_INCREF(op) (                           \
    +      Py_IS_ETERNAL(op)                           \
    +        ?  PY_ETERNAL_REFCOUNT                    \
    +        :  (_Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA \
    +           ((PyObject*)(op))->ob_refcnt++)        \
    +  )
    +
    +#define Py_DECREF(op)                                   \
    +    do {                                                \
    +        if (Py_IS_ETERNAL(op)) break;                   \
    +        if (_Py_DEC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
    +        --((PyObject*)(op))->ob_refcnt != 0)            \
    +            _Py_CHECK_REFCNT(op)                        \
    +        else                                            \
    +        _Py_Dealloc((PyObject *)(op));                  \
    +    } while (0)
    +
    +#else
    vstinner commented 2 years ago

    I just want to say that GOOGLE_ETERNAL_REFCOUNT_SUPPORT is a cool name :-D I love "eternal refcount"!

    ericsnowcurrently commented 2 years ago

    @Eddie, what can I do to push this forward? FYI, in addition to the python-dev thread a few weeks back, I've brought the matter up with the steering council. [1]

    Also, if we can get back to performance-neutral (currently at about 4% slower) then there would be a lot less controversy. Even at 2% it may be enough.

    [1] https://github.com/python/steering-council/issues/103

    476181bd-17c7-4f5b-a731-677932cb4f0c commented 2 years ago

    @eric.snow great to hear about this update! I'll start looking at some of the techniques that we talked about to improve performance, I'm optimistic that we'll be able to close down the gap to 2%.

    methane commented 2 years ago

    I think making more objects immortal by default will reduce the gap, although I am not sure it can be 2%. (I guess 3% and I think it is acceptable gap.)

    To reduce gap more, we need to reduce Python stack operation in ceval in some way.

    methane commented 2 years ago

    All of these optimizations should be disabled by default.

    476181bd-17c7-4f5b-a731-677932cb4f0c commented 2 years ago

    It seems that we are back on track with perf being back to neutral!

    I've created 4 new PRs. Each with an optimization applied on top of the baseline introduction of instance immortalization.

    The main PR 19474 currently stands at around 4%, after rebasing past Eric's PR 30928 it went down to 3%.

    This is the list of optimizations that I used to get some performance back:

    All the PRs contain the results of the pyperformance benchmarks and they should each stand on their own in case we want to go for a subset of these optimizations rather than all of them. Make sure to look at each PR to read the implementation details.

    For testing, in every PR I made sure all the tests were passing on my local environment. Though I do expect some failures happening in non-linux environments. I'll be fixing these over the next couple of days.

    ericsnowcurrently commented 2 years ago

    Thanks for the updates, Eddie.

    bhack commented 7 months ago

    Any update here? What is the current status?

    methane commented 7 months ago

    I think we can close this issue because we have immortal instance already.