Accounting for possible integration with the frame stack in 3.7+

ncoghlan commented 7 years ago

While I realise logical context management is something that needs to be solved for the async framework case in at least 3.5+ (and perhaps even 2.7 if we can find a way to manage it), a possibility I've been discussing with @njsmith and @1st1 is the idea of solving this problem in the general case for Python 3.7+, such that code like:

def rounded_decimals(iterable, precision):
        with decimal.localcontext() as ctx:
            ctx.prec = precision
            for value in iterable:
                yield +value

doesn't break code like the following the way it does today.

def surpising_state_changes(iterable):
    # Original decimal context applies here
    for value in rounded_decimals(iterable, precision=10):
        # Oops, the decimal context precision change applies here
        if value > 10: break
    # Worse, the decimal context here depends on:
    #    - whether or not the generator was exhausted
    #      (if it was, precision will have been reverted)
    #    - whether or not the interpreter uses a reference counting GC
    #      (if it does, precision will have been reverted even if we broke out of the loop)

Similarly, it would be desirable to support deterministic frame-linked cleanup in cases like:

def currently_nondeterministic_cleanup(fname):
        # File cleanup here is non-deterministic if the generator isn't run to completion
        # And you're not using a refcounted implementation like CPython
        # However, you won't get a ResourceWarning about it: http://bugs.python.org/issue28629
        with open(fname) as f:
            yield from fname

The reason I think this is relevant to your API design sketch here is that it means we probably need to set out two different sets of design goals:

for 3.5+ (and hopefully 2.7), make good logical context management possible if you're diligent about creating and destroying logical contexts at the appropriate time. This will likely only happen consistently if you're using an asynchronous framework that manages them for you (e.g. asyncio, tulip, curio, Twisted, Tornado)
for 3.7+, aim to make good logical context management straightforward by implicitly linking it to execution frame lifecycles if you don't otherwise manage it explicitly, as the differences in frame lifecycle between synchronous and asynchronous code are one of the main causes of resource management and shared state manipulation problems

To better explain the latter point, one of the key differences between synchronous code and asynchronous code in Python is that:

In synchronous code, the Python level frame stack aligns 100% with the C level function call stack, so you know not only that thread local state will outlive the execution lifecycle of the current frame, but also that any changes you make to thread local state will be invisible to other code as long as you revert them before the frame finishes execution. Thus the "change system state" + "revert state change" model supported by context managers is sufficient for both resource management and thread local state management.
In asynchronous code (i.e. generators, coroutines and callback chains), there are two different Python level frame stacks - the traditional synchronous one that tracks "this is what is executing in this thread right now" (i.e. the one sys._getframe() and inspect.currentframe() grant access to), and the logical asynchronous one, which currently isn't exposed for introspection in any consistent manner.

Thus, in asynchronous code, both of those core assumptions about the relationship between frame local state and thread local state are broken, and they're broken in essentially the same way that assumptions about safely accessing process global state from a synchronous function call are broken by multi-processing and multi-threading:

you can't be sure every invocation of the frame will see the same thread local state (the multi-processing problem)
if you suspend frame execution in any way, other code may see your modifications to the thread local state even if you revert those changes before returning your final result (the multi-threading problem)

Python 3.5+ provides at least partial visibility into the logical call stack, in that given a currently running coroutine or generator, you can trace the stack down via the gi_yieldfrom and cr_await attributes, however there's currently no way to do the reverse lookup at runtime:

running frames don't have a reference back to the object that originated them (whether that's a function, coroutine, or generator-iterator): http://bugs.python.org/issue12857
there aren't any gi_back or cr_back attributes to trace the logical call stack backwards for yield from and await

Given a robust logical context management system as a foundation, I think the above problems are solvable given some additional changes to the frame management requirements defined at the language level, but I also think there are going to be limits to how easy to use a logical context management system can be without those language level changes.

mitsuhiko commented 7 years ago

The decimal example is precisely why logical call contexts are great because you don't accidentally lose it. The same is necessary for translations in Django (or in theory the entire locale and gettext module in the stdlib should use it) and more.

The other case where this is super important are security contexts where you don't accidentally lose it. Explicit passing of information does not work there. This is a long recognized issue in Unix as well where the entire process sets an umask and you don't set the chmod on every file you open.

mitsuhiko commented 7 years ago

For what it's worth this repo is more about the API and less about the implementation. In particular the call contexts could be implemented a lot more efficiently if they directly map to frames.

ncoghlan commented 7 years ago

The latest iteration of the API looks really nice. In relation to isolated_call_context() and ctx.nested(), I was considering how you might be able to detect "thread context leaks", where a nested thread level context (especially an isolated one) escapes "up the stack" and becomes visible in the caller by way of yield or await being run while the thread context was modified.

Unfortunately, I still don't see any way of resolving that in the general case without the language level changes I mentioned above - as far as I can tell, we either need some form of "frame suspended" and "frame resumed" notification protocol as @njsmith has suggested (perhaps via callbacks registered with the frame), some form of per-frame context that is implicitly created on demand (inheriting from the nearest logical parent frame, but in a copy-on-write fashion), so any given frame will automatically ignore new logical contexts created by child frames, or else support for frame weak-references, so a similar setup could be used for frames as your draft idea for asyncio.

(Frame IDs wouldn't work, since frames are aggressively recycled via the freelist in CPython, and strong references would be hard to clean up properly. And not only don't frames support weak references, neither do any of their attributes)

mitsuhiko commented 7 years ago

Can you clarify what the exact case of thread context leak would be? In theory a context should never leak if you mark data accordingly. This is also the reason why the local flag exists. The common problem of leaks is known in other environments as "the user level queue problem" where you have some main loop code that dispatches to sub tasks and they should not pick up on potentially privileged context data from outside.

The way my proposal deals with this is twofold: first data can be marked as local which indicates that it will automatically become inaccessible for such isolated children. Secondly the context data management becomes a front and center API for users so that such code becomes immediately more aware of it. The issue that .NET has for instance is that the call contexts are so well hidden that most people are not even aware they exist. With this we would make this a conceptionally a very user facing API that everybody is aware of.

For the binding to frames I think the most appropriate behavior long term is that a frame starts and assigns itself an incremental integer (64bit might be enough) and that number is propagated and updated according to the context behavior. So a coroutine might inherit this value or not depending on the rules we define. If this stuff all moves into the interpreter it becomes much simpler and also leaves potential for improving the performance of it. For instance it could become part of the interpreter state directly.

On 09/11/2016 12:58, Nick Coghlan wrote:

The latest iteration of the API looks really nice. In relation to |isolated_call_context()| and |ctx.nested()|, I was considering how you might be able to detect "thread context leaks", where a nested thread level context (especially an isolated one) escapes "up the stack" and becomes visible in the caller by way of |yield| or |await| being run while the thread context was modified.

Unfortunately, I still don't see any way of resolving that in the general case without the language level changes I mentioned above - as far as I can tell, we either need some form of "frame suspended" and "frame resumed" notification protocol as @njsmith https://github.com/njsmith has suggested (perhaps via callbacks registered with the frame), some form of per-frame context that is implicitly created on demand (inheriting from the nearest logical parent frame, but in a copy-on-write fashion), so any given frame will automatically ignore new logical contexts created by child frames, or else support for frame weak-references, so a similar setup could be used for frames as your draft idea for asyncio.

(Frame IDs wouldn't work, since frames are aggressively recycled via the freelist in CPython, and strong references would be hard to clean up properly. And not only don't frames support weak references, neither do any of their attributes)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mitsuhiko/python-logical-call-context/issues/1#issuecomment-259398560, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAc5HDJyDg0Map_2AQEoCkbLibxAI7zks5q8bV_gaJpZM4KtLh2.

ncoghlan commented 7 years ago

Based on my reading of the code (I admit I didn't actually it try thia out), it looks like both of the following cases would leak state changes from a generator to its caller:

def call_in_nested_context(iterable):
    ctx = contextlib.get_call_context()
    for f in iterable:
        with ctx.nested():
            yield f()

def call_in_isolated_context(iterable):
    for f in iterable:
        with contextlib.isolated_call_context():
            yield f()

def surprising_state_changes(iterable):
    # Original thread context is set here
    for value in call_in_nested_context(iterable):
        # contextlib.get_call_context() returns nested context here
        if value is None: break
    # Result of  contextlib.get_call_context()
    #    - whether or not the generator was exhausted
    #      (if it was, active thread context will have been reverted)
    #    - whether or not the interpreter uses a reference counting GC
    #      (if it does, thread call context will have been reverted even if we broke out of the loop)

The call_in_isolated_context would be similar, but the ambiguity would be around the state of ctx.isolated rather than the identity of the active context. To write those generators correctly, you have to carefully restore the original state before yielding each result:

def call_in_nested_context(iterable):
    ctx = contextlib.get_call_context()
    for f in iterable:
        with ctx.nested():
            value = f()
        yield value

def call_in_isolated_context(iterable):
    for f in iterable:
        with contextlib.isolated_call_context():
            value = f()
        yield value

That kind of bad implicit interaction between thread local state (of any kind) and generators is the main challenge I was referring to when I suggested that there would be some usability issues that can't be addressed without active cooperation from the interpreter's frame management machinery.

mitsuhiko commented 7 years ago

What do you mean by "leak" here? It's intentional that call contexts are propagated. If you want something not to be propagated it has to be marked as local and contextlib.isolated_call_context has to be used.

mitsuhiko commented 7 years ago

For what it's worth, if you are referring to generators not closing deterministically it might make sense to somehow disallow generators to yield when a context modification took place. There are dragons there if people do this kind of stuff.

//EDIT: removed the part about the GC, i understand where you are coming from here.

ncoghlan commented 7 years ago

The GC controls when GeneratorExit gets called on a suspended generator frame. For example, in CPython 3.5:

>>> def gen():
...     try:
...        yield from range(10)
...     finally:
...        print("Closing generator")
... 
>>> for i in gen(): break
... 
Closing generator
>>>

Here, the still suspended frame gets cleaned up eagerly as soon as the calling frame drops the for loop's internal reference.

But a comparable example from PyPy 5.0.1:

>>>> def gen():
....     try:
....         for i in range(10): yield i
....     finally:
....         print("Closing generator")
....         
>>>> for i in gen(): break
>>>> import gc
>>>> gc.collect()
Closing generator
0

Here, PyPy doesn't clean up the suspended generator frame until the next GC run, rather than cleaning it up immediately at the end of the loop.

You also can't be 100% sure that the generator's close() method will always be called from the same thread that called __next__() the first time so you may end up "reverting" the wrong thread state.

As far as @njsmith, @1st1 and I have been able to figure out, this is an inherent limitation of our current language level frame management model - all current user libraries can say is "Don't yield without first reverting any alterations you've made to thread local or process global state if you need truly deterministic resource management"

ncoghlan commented 7 years ago

Ah, our messages crossed in flight. Yes, you understood what I meant, and yes, I think "Don't do that" is the only current recommendation we can give.

It would be nice if we could detect it and complain, but I haven't thought of a way to do that yet.

mitsuhiko commented 7 years ago

The problem with this is that this is really an issue with generators more than with context management. Unless we want to start maintaining context data more like individual stacks that can be pushed and popped as stuff is yielded from generators I am not sure how much can be done about that.

mitsuhiko commented 7 years ago

For what it's worth in Node land this issue is resolved by interpreter callbacks that are invoked every time node switches execution. The downside obviously is that this is quite an expensive operation.

ncoghlan commented 7 years ago

The draft concept I came up with is to have a form of local storage that lives directly on the frame itself (which I know JIT authors won't like because it will be another case that means they can't always optimise the frame instance away entirely).

Then when the thread-based call context code looks up or sets the active thread context, it would also cache it on the current frame. Context lookups would check for a frame context before looking elsewhere (like the active thread context in the thread local state).

Operations like yield, await and function calls would then become frame context aware - if neither the frame being started, suspended or resumed, nor the one suspending or resuming it, has a context set, they don't do anything new (avoiding overhead in the typical case), but if they do, then they change the active thread context accordingly (I haven't got as far as figuring out what the exact transition rules would be, as I think that needs to be informed by what you're able to achieve without frame level integration, but I'm confident it will be possible to come up with sensible rules based on that experience and the known problems with generator resource management).

mitsuhiko commented 7 years ago

One limitations with that approach would be that we would never be able to put resource management into the context. That might be an okay enough restriction but I see that being potentially problematic if people want to put resources like database connections into it.

ncoghlan commented 7 years ago

Yeah, actually doing it that way would also require defining a way to register cleanup functions with a frame, as we'd be dealing with a 2-dimensional dynamic resource management model at that point: the synchronous context in one direction (i.e. "What's running in this thread right now", which context managers already handle) and the frame context in the other (i.e. await chains, yield from chains, callback chains).

(The use case where I really started considering the question of frame linked storage was when @njsmith started asking about ways to ensure file handles opened by generators were handed back to the operating system deterministically)

njsmith commented 7 years ago

I skimmed the source here a bit, but don't have time to read through it in detail right now. It seems like there are a number of subtleties to the semantics... I assume that at some point there will be a nice detailed PEP for me to read? :-)

In the mean time, I wanted to throw out a few things that might be worth keeping in mind in the discussion:

PEP 521 provides one mechanism for making "local modifications" that respect yields. (Right now it only talks about synchronous context managers and yields in general; after the discovery of this bit of obnoxiousness then I plan to update it so that async context managers have __asuspend__/__aresume__ methods that get called on async yields (the kind that async generators make).) Not sure if it's better or worse than the other options here, but for right now at least it seems like the best solution to some cases, to me. We might well need multiple partially-overlapping mechanisms here.
I'm a bit dubious about the whole "frame stack" concept in general -- it seems very complicated and it's not clear to me that for regular generators you even want the semantics that they always execute in the context of who-ever first started them. Certainly the obnoxious timeout example I linked above is not going to be solved by this, because timeouts need to propagate through the real call stack -- async with timeout_after(5): await some_generator.__anext__() had better not disable the timeout because it's in a different context!
For some use cases, it's important that accessing the context be very fast. E.g. numpy has to read the current errstate configuration on every arithmetic operation. Thread-local-storage is ideal for this, because modern operating systems have ridiculously optimized implementations that make it basically free to access (like, the same cost as accessing a global variable in C). So from numpy's point of view, PEP 521 + regular TLS is perfect, because the lookup cost remains the same, and the overhead for most with blocks can be partially or totally optimized away, so that for the most part the only real cost is to people who are actually combining with + np.errstate + yield.

ncoghlan commented 7 years ago

The "fast access to the context" concern applies to decimal as well (especially cdecimal). However, the other prior art we can take into account here is the settrace function and weakref - rather than thinking in terms of protocol methods on objects, we may want to be thinking in terms of callbacks on frames:

a thread level one for starting new frames
frame level ones for suspending, resuming, and terminating frames

ncoghlan commented 7 years ago

In a callback based model like that, it would be context manager __enter__ methods that gained the responsibility of registering the appropriate callbacks with the running frame (so file.__enter__ would register self.close() as a frame level termination callback when inside a non-synchronous frame, for example), and only generator-iterator and coroutine frames would permit registering suspend and resume callbacks.

mitsuhiko commented 7 years ago

So from numpy's point of view, PEP 521 + regular TLS is perfect, because the lookup cost remains the same, and the overhead for most with blocks can be partially or totally optimized away, so that for the most part the only real cost is to people who are actually combining with + np.errstate + yield.

The lookup cost is suppose to be not more expensive than a TLS lookup anyways (other than the method overhead we have in Python). The issue is context switching and the solution in PEP 521 is an example of the worst case there because the logic is entirely moved to the Python side. To use a publicly documented protocol for this seems questionable.

njsmith commented 7 years ago

The lookup cost is suppose to be not more expensive than a TLS lookup anyways (other than the method overhead we have in Python).

Not sure what this means. A Python method call is way more expensive than a TLS lookup? (A TLS lookup using __thread is literally 2 cache misses in the worst case; method calls involve, like, allocating memory and stuff.)

The issue is context switching and the solution in PEP 521 is an example of the worst case there because the logic is entirely moved to the Python side. To use a publicly documented protocol for this seems questionable.

I have no idea what this means either :-).

mitsuhiko commented 7 years ago

@njsmith the public python exposed API (ignoring interpreter internals or C API) for both thread locals as well as the context locals is a method call (or attribute lookup). Internally in a real implementation obviously the reference to the current context is implemented with a native TLS variable.

I have no idea what this means either :-).

When you use that pep as guidance you invoke __suspend__ and __resume__ on any context switch (eg: yield out/in, greenlets switch etc.). This is expensive particularly because these might be python callbacks.

njsmith commented 7 years ago

Internally in a real implementation obviously the reference to the current context is implemented with a native TLS variable

Sure, but then there's also the question of: once you have a reference to the current context, how do you find the thing you're actually interested in? It's not obvious that this method will be terribly fast...

This is expensive particularly because these might be python callbacks.

Sure, but they don't have to be (e.g. decimal and numpy can certainly decide to implement them in C). And again, this is a cost that's paid only for people who are actually using the context managers that have the expensive callbacks. My main concern is that we not add (much) overhead to existing users who aren't doing anything clever with asyncio or whatever.

(Maybe it's useful to give an example of how careful NumPy has to be about per-operation overhead... one of the things NumPy has to do on each operation is to check inputs for special override attributes that usually don't exist, like __array_priority__. Here's how we do attribute lookups, along with this helper function (also used in other places). Calling PyObject_GetAttrString is way too expensive. That said -- this is partially because there are a number of attributes we have to look up, and CPython's standard attribute lookup APIs are gratuitously expensive for missing attributes because they allocate and set an exception. Our current code for looking up the errstate is not as carefully optimized as this. But we don't want to block optimizing it in the future!)

mitsuhiko commented 7 years ago

Sure, but then there's also the question of: once you have a reference to the current context, how do you find the thing you're actually interested in?

Not sure what you mean. The reference is what you are looking for. The only thing that method does is add some sanity checks that could move elsewhere.

Sure, but they don't have to be (e.g. decimal and numpy can certainly decide to implement them in C).

It doesn't really matter, once it's an exposed API you are always guaranteeing that this sort of thing will be supported.

mitsuhiko / python-logical-call-context

Accounting for possible integration with the frame stack in 3.7+ #1