Restore stack introspection ability on 3.12

oremanj commented 8 months ago

Fixes #388. The gr_frame accessor now returns a frame whose f_back chain is valid to walk, by rewriting the frame list on demand to exclude C-stack frames and reversing this the next time the greenlet is resumed. This carries an assumption that stack walking of a suspended greenlet occurs between an access to gr_frame and the next resumption of the greenlet, which should be valid for most applications. For those that are doing something stranger, I added a new gr_frames_always_exposed attribute; if set to True it will cause the rewrite to occur every time the greenlet is suspended, providing full semantic parity with 3.11-and-earlier at some performance cost.

I did some perf testing on my laptop using

% python -m timeit -s 'import greenlet
def switcher(main):
  while True:
    main.switch()
gr = greenlet.greenlet(switcher)
gr.switch(greenlet.getcurrent())' 'gr.switch()'

It was pretty consistently 286-287 nsec before this change and 281 nsec after. (This makes sense; it's less expensive to check the frames_were_exposed bool than it is to reset-and-set the topmost frame's previous pointer, which is what we were doing previously.)

If I set the new gr_frames_always_exposed attribute to True, we're back up to 288 nsec per loop (though the penalty of this would increase with the stack depth of the greenlet, and re-exposing existing frames is much less expensive than exposing new frames because the former case doesn't need to allocate any new frame objects). If I access gr_frame on every switch, it's 302 nsec per loop. All of this seems acceptable to me. I don't know if you have any more realistic performance tests available; it would be good to run them if so.

I marked this as 3.1.0 since it provides a new public API. Feel free to adjust based on however you think about versioning.

jamadden commented 8 months ago

Thanks for this! Just wanted to let you know I am looking at it.

I am considering dropping gr_frames_always_exposed and acting as if that were always true. I appreciate your attention to performance, but I think correctness-by-default is better than giving users a loaded footgun, and that's worth taking a minor performance hit. (I realize I may have given mixed signals there.)

oremanj commented 8 months ago

Thanks for the update! I would encourage you to do more detailed performance testing if you drop frames_always_exposed = False as I am concerned the impact could be quite bad under some workloads which are not accurately captured by the microbenchmarks I posted above. The cost at greenlet switching time is linear in the depth of the greenlet's Python stack, plus needing to allocate a new frame object for every frame that is newly on the greenlet's stack since the last time it switched. I don't believe CPython uses a freelist for frame objects anymore since switching to the interpreter-frame scheme, so that's one heap allocation per stack frame.

Here's a somewhat worst-case example where the greenlet stack has 100 new frames on each switch; the switch takes 2.5usec longer than it would otherwise. With 10 new frames it's 350nsec longer.

% python -m timeit -s 'import greenlet
def switcher(main):
  while True:
    recurse(100, main)
def recurse(n, main):
  if n == 0:
    main.switch()
  else:
    recurse(n - 1, main)
gr = greenlet.greenlet(switcher)
gr.switch(greenlet.getcurrent())' 'gr.switch()'
100000 loops, best of 5: 3.22 usec per loop

% python -m timeit -s 'import greenlet
def switcher(main):
  while True:
    recurse(100, main)
def recurse(n, main):
  if n == 0:
    main.switch()
  else:
    recurse(n - 1, main)
gr = greenlet.greenlet(switcher)
gr.gr_frames_always_exposed = True; gr.switch(greenlet.getcurrent())' 'gr.switch()'
50000 loops, best of 5: 5.73 usec per loop

I'm not particularly concerned about the footgun; I think use of gr_frame is a pretty advanced topic and the warning is well-signposted there, and holding onto the result of sys._getframe() is perhaps even more uncommon. I'd be comfortable flipping the default, and/or adding a global gate in addition to the per-greenlet one, but I'd worry that entirely removing the option to not-expose frames will strand someone with a significant perf regression and no way to remedy it.

jamadden commented 8 months ago

I'm still running higher-level benchmarks with gevent, but at a low level, I don't think the frame manipulation is going to hurt in a significant way. Unfortunately, it's not for a great reason.

I mentioned earlier that in general, we've been getting slower on newer versions of Python. Picking a set of micro-benchmarks focused on greenlet creation and switching, with one pure function call thrown in (getcurrent) (unless otherwise noted, this is with the released version of greenlet 3.0.2):

Benchmark	py3.9	py3.10	py3.11	py3.12
create a greenlet	49.5 ns	54.1 ns: 1.09x slower	62.0 ns: 1.25x slower	84.9 ns: 1.72x slower
getcurrent single thread	12.1 ns	12.1 ns: 1.01x slower	13.1 ns: 1.09x slower	15.0 ns: 1.24x slower
chain(100000)	132 ms	136 ms: 1.04x slower	306 ms: 2.32x slower	309 ms: 2.35x slower
read 2 nested frames	110 ms	107 ms: 1.03x faster	316 ms: 2.87x slower	338 ms: 3.07x slower
read 20 nested frames	344 ms	347 ms: 1.01x slower	443 ms: 1.29x slower	488 ms: 1.42x slower
read 200 nested frames	3.29 sec	3.22 sec: 1.02x faster	1.70 sec: 1.94x faster	2.00 sec: 1.64x faster
Geometric mean	(ref)	1.02x slower	1.35x slower	1.54x slower

Python 3.9 and 3.10 perform about the same, but things start slowing down after that. The one place we get faster, "read 200 nested frames" is most likely due to interpreter changes, and should only be compared within the same version of CPython.

Speaking of benchmarks that can only be compared with the same version of Python, I have a set that test switching between greenlets of various stack depths (2, 200, 400 for shallow, deep, and deeper) --- very, very similar to the examples shown above. These overall get faster on newer interpreters, probably largely a result of reduced function call overhead (as can be seen from the second chart, which is basically a benchmark of function calls).

Benchmark	py3.9	py3.10	py3.11	py3.12
switch between two greenlets (shallow)	274 ns	not significant	296 ns: 1.08x slower	342 ns: 1.25x slower
switch between two greenlets (deep)	44.2 us	40.9 us: 1.08x faster	17.3 us: 2.55x faster	20.6 us: 2.15x faster
switch between two greenlets (deeper)	93.3 us	85.0 us: 1.10x faster	34.6 us: 2.69x faster	42.5 us: 2.20x faster
Geometric mean	(ref)	1.06x faster	1.85x faster	1.56x faster

Benchmark	py3.9	py3.10	py3.11	py3.12
recur (2)	208 ns	188 ns: 1.11x faster	107 ns: 1.94x faster	124 ns: 1.68x faster
recur (20)	1.41 us	1.06 us: 1.34x faster	609 ns: 2.32x faster	665 ns: 2.13x faster
recur (200)	15.9 us	14.2 us: 1.12x faster	8.15 us: 1.95x faster	8.30 us: 1.92x faster
recur (400)	34.5 us	31.4 us: 1.10x faster	18.3 us: 1.89x faster	19.9 us: 1.73x faster
Geometric mean	(ref)	1.16x faster	2.02x faster	1.85x faster

OK, given all that, how do these benchmarks behave on 3.12 when we do not, or do, manipulate the stacks on switch?

Benchmark	py3.12	py3.12 (patched, always exposes)
create a greenlet	84.9 ns	83.7 ns: 1.01x faster
read 200 nested frames	2.00 sec	2.02 sec: 1.01x slower
read 20 nested frames	488 ms	492 ms: 1.01x slower
read 2 nested frames	338 ms	346 ms: 1.02x slower
switch between two greenlets (shallow)	342 ns	352 ns: 1.03x slower
chain(100000)	309 ms	324 ms: 1.05x slower
switch between two greenlets (deep)	20.6 us	29.7 us: 1.44x slower
switch between two greenlets (deeper)	42.5 us	62.2 us: 1.46x slower
Geometric mean	(ref)	1.10x slower

The first four rows are essentially unchanged, as they should be since they don't involve much switching. The "shallow" call stack switch is slightly slower, and the deeper the call stack gets we see a hit, which appears to be linear in the depth, also as expected. If a 200 depth call stack is towards the high end of the bell curve, yes, the relative difference looks scary, but the absolute time --- 10 microseconds --- seems reasonable.

But, coming from an earlier version of Python, how much of a performance difference will there be? As mentioned above, it's hard to compare that across versions, but let's try. If we subtract the function call time from the "recur" benchmarks and compare 3.9 to 3.12, this is what we get. First, the raw numbers:

Benchmark	py3.9	py3.12
switch between two greenlets (shallow)	274 ns	342 ns: 1.29x slower
switch between two greenlets (deep)	44.2 us	29.7 us: 1.49x faster
switch between two greenlets (deeper)	93.3 us	62.2 us: 1.50x faster

We look pretty decent in most of those! But taking out the function call time, we get:

Benchmark	py3.9 (no function overhead)	py3.12 (patched, always exposes, no function overhead)
switch between two greenlets (shallow)	65 ns	228 ns: 3.51x slower
switch between two greenlets (deep)	28.3 us	21.4 us: 1.32x faster
switch between two greenlets (deeper)	58.8 us	42.3 us: 1.39x faster

In all except the shallowest case, we're substantially faster. And the shallowest case only makes us look bad because the function call overhead is so much higher on 3.9 than it is on 3.12 --- subtracting 208 ns from 274 ns doesn't leave much (again, I'm not sure this is a valid way to attempt to normalize comparisons). Based on the trends, though, we can guess that somewhere between 2 calls on the stack and 200 calls on the stack, Python 3.12 with this patch starts out-performing Python 3.9 and 3.10.

Most real world functions do something besides call one other function, so the relative time spent switching will probably be somewhere between those two sets of results. And I haven't actually encountered a real-world application where greenlet switching time was the bottleneck, so Amdahl's law applies.

These are low-level benchmarks. But unless I'm missing something, I don't see anything too concerning here, nothing that makes me rethink.

And again, gevent benchmarks still to come.

oremanj commented 8 months ago

This is a great suite of benchmarks, thank you! I'm curious how the higher-level benchmarks turn out, but I'm convinced that safe-by-default is the right path forward. My inclination is to still expose the option to use the faster/less-safe mode, perhaps under a name like gr_fast_frames_unsafe that encourages people to read the fine print. But I guess we could also hold off until/unless someone asks for it.

jamadden commented 8 months ago

Picking 30 or so of gevent's benchmarks --- only one of which is specifically designed to test switching speed, but most/all of which do involve switching --- shows no significant performance differences from always exposing the frames or not. (The variance it does show, 2--5% or so, is well within the normal variance I can see from run to run, even when run with pyperf's rigorous mode.) I'll proceed with always exposing the frames.

I wanted to be comprehensive, so in addition to running the benchmarks in gevent's default mode, using all of its Cython-compiled C accelerator modules, I also ran the set in pure-Python mode, cutting the C modules down to a minimum. The results were interesting.

Benchmark	py3.9	py3.10	py3.11	py3.12
Cython accelerators	(ref)	1.03x slower	1.36x slower	1.54x slower
Pure Python	(ref)	1.06x faster	1.06x faster	1.04x slower

Using the native code libraries, there's a clear degradation of performance from 3.9 through 3.12. But without them, there's overall not much difference. So, maybe Cython is what's getting slower?

To be clear, these are relative numbers; in absolute terms, the compiled benchmarks are still notably faster than the pure-Python versions. What's especially interesting is that the difference between them appears to be shrinking. This probably bears looking into, if I can ever get around to it.

Benchmark	py3.9	py3.10	py3.11	py3.12
Pure Python relative to Cython	1.91x slower	1.75x slower	1.32x slower	1.29x slower

python-greenlet / greenlet

Restore stack introspection ability on 3.12 #393