python / cpython

The Python programming language
https://www.python.org
Other
62.76k stars 30.08k forks source link

missing global names inside a class body (calling `exec` with a `ChainMap`) #121306

Open CNSeniorious000 opened 3 months ago

CNSeniorious000 commented 3 months ago

Bug report

Bug description:

I need to implement layered context running python code, so I choose ChainMap. But Python needs every globals passed to exec to be a dict, so I mixins it into ChainMap.

This patch runs well as far, but in a class it can't get the value from the outside.

I think this may because in the class context CPython uses some function like PyDict_Get instead of __getitem__, so the values didn't get copied into the class's globals context.

This bug caused this code fails:

from collections import ChainMap

class ChainMap(ChainMap, dict):  # globals must be a real dict
    pass

source = """
a = 1

class A:
    print(a)
"""

exec(source, ChainMap())  # this line raises NameError
Other reproduction approaches

This is reproduceable in [`pyodide`](https://github.com/pyodide/pyodide) (a wasm port of CPython 3.12.1), so you can run the code above by just clicking one of the following link: - [Stackblitz](https://stackblitz.com/edit/chainmap-class-locals-issue?file=main.py) - you can edit code and the result will sync instantly - [Python Online](https://py3.online/console#ZnJvbSBjb2xsZWN0aW9ucyBpbXBvcnQgQ2hhaW5NYXAKCgpjbGFzcyBDaGFpbk1hcChDaGFpbk1hcCwgZGljdCk6ICAjIGdsb2JhbHMgbXVzdCBiZSBhIHJlYWwgZGljdAogICAgcGFzcwoKCnNvdXJjZSA9ICIiIgoKY2xhc3MgQToKICAgIHByaW50KGEpCgoiIiIKCgpleGVjKHNvdXJjZSwgeyJhIjogMX0pCgoKZXhlYyhzb3VyY2UsIENoYWluTWFwKHsiYSI6IDJ9KSkgICMgdGhpcyBsaW5lIHJhaXNlcyBlcnJvcgo=) - this is faster but just has a console instead of an IDE

CPython versions tested on:

3.12

Operating systems tested on:

Linux, Windows

Linked PRs

picnixz commented 3 months ago

The docs say (emphasis mine):

If only globals is provided, it must be a dictionary (and not a subclass of dictionary), which will be used for both the global and the local variables. If globals and locals are given, they are used for the global and local variables, respectively. If provided, locals can be any mapping object

I don't think this one should count as a bug. But maybe it could count as a bug in the sense that it raises a NameError (like, it doesn't seem to check the exact type)

CNSeniorious000 commented 3 months ago

Thank you! I found what you quoted in the docs here. Looks like raising a TypeError in exec and eval when type(globals) is not dict is reasonable.

That's a bit sad. I've been misusing this for a year in several packages. I used to think using a mixined ChainMap as globals is the best practise to implement layered context, for usages that we need to make the input dict readonly, or lazy-loaded, until I encountered this issue. I think it would be great to loosen this constraint maybe in the future, I mean, allow any dict subclass for globals. In fact, it worked in most cases now.


On the other hand, may I ask how can I implement layered context for exec without misusing globals? My only thought is hacking builtins like this:

{..., '__builtins__': ChainMap(..., builtins.__dict__)}
gaogaotiantian commented 3 months ago

We need to address this issue for whether we should allow a dict subclass to be passed in as globals. Currently, even we have some tests (test_alt_framework_macos, test_exec_with_unusual_global etc.) that do this - something explicity disallowed in the documentation.

This will never fully work in the current interpreter as LOAD_NAME assumes globals() to be an exact dict.

Also the documentation is not completely correct - even if the user passes both globals and locals, using a non-exact-dict globals will have issues.

So, we have a couple of options:

  1. Enforce this restriction, raise a TypeError in exec() and eval().
    • Potential backwards compatibility issue, some code that works before will not anymore
    • We need to change our tests.
  2. Change LOAD_NAME so it can deal with a mapping globals().
    • The performance will be impacted
    • No backwards compatibility issue and we can remove the restriction (as far as I can tell)
  3. Change the documentation, instead of saying the user must not do that, say that the user is not supposed to do that - it may or may not work.
    • No code changes at all
    • Things that work will still work and things don't work - well we told you it might not work.

@markshannon any suggestion here as you are the owner of the interpreter now.

blhsing commented 3 months ago

Change LOAD_NAME so it can deal with a mapping globals().

  • The performance will be impacted
  • No backwards compatibility issue and we can remove the restriction (as far as I can tell)

Great analysis. I vote for option 2 because I believe there are more people who care more about versatility than performance when they choose Python over other programming languages for a task. Note that there's a significant amount of code that calls PyDict_* functions besides the LOAD_NAME handler code in bytecodes.c that need to be modified to their PyMapping_* equivalents in order to make globals() as a mapping work.

picnixz commented 3 months ago

I have a branch with the exact check and I needed to fix a lot of tests... honestly I'd prefer amending the docs. But making LOAD_NAME works for nonexact dict might be a not so good idea if you lose performance. Maybe we can have two implementations for that when you assume that it's a dictionary and otherwise a (mutable) mapping?

gaogaotiantian commented 3 months ago

I only looked at LOAD_GLOBAL so I was under the impression that LOAD_NAME is the only bytecode that needs the change. Obviously at least STORE_GLOBAL and DELETE_GLOBAL use PyDict_*. Then I don't believe this is an option, too many code changes. We have been treating globals as a dict so I think we should enforce that.

I believe there are more people who care more about versatility than performance when they choose Python over other programming languages for a task.

We had a lot of great features when all local variables are in a real dict and we moved forward to the current fast locals implementation. I don't think we should sacrifice such a common bytecode just for some super dark corner.

The reason I said it was a super dark corner is because - STORE_GLOBAL will break a non-dict globals and no one had that issue yet.

I'm leaning forward to enforcing globals to be an exact dict, and that's basically our assumption now.

picnixz commented 3 months ago

So maybe 1 is the best. I already have something on that direction so I'll push it by the end of the day, hopefully.

picnixz commented 3 months ago

After investigating a bit more tests (e.g., in test_getpath), I'm wondering whether we could just accept subtypes of dict that are just subtypes but not mixed in with other classes. Otherwise a lot of tests need to be rewritten.

I have an important question. Why does it assume that it's a dict (and it still works when it's not a dict but a subclass of a dict without a mixin?) Is it only because of PyDict_* functions it uses or is it because of some other assumptions? if it's only because of PyDict_* and that it would still work for subclasses of dict without mixin, I may suggest the following:

I really want to known which invariants are expected for it to work (at least, document it internally if it's not already the case (I don't know where to look at for that)). Why would it work for class frozendict(dict) when frozendict is read-only (I mean we have tests assessing that the exception is thrown...)? and there are some tests where you have some pre-defined dict values and where you override __missing__ so that you can use attributes instead, so I would say those are fine, isn't it?

CNSeniorious000 commented 3 months ago

As the one who posted this issue, I want to say something. Although I am not familiar with CPython, I do not want to hurt its performance. If allowing globals to be any subclass of dict or any mapping-like instance is possible with impact on performance, I don't want it to happen. My use cases are not common and I think (2) may be not worthy.

However, using a subclass does work in most cases. The easiest way is to just subclass a dict and insert a print(key) inside its __getitem__ and __setitem__, and any load name happens inside an exec got printed. So I believe currently __getitem__ and __setitem__ are called during name resolution.

The only problem I encountered is that when accessing variables in a class.

I've tried using __missing__ and subclass without mixins, like this:

class D(D):
    def __missing__(self, k):
        import builtins
        return builtins.__dict__.get(k, k)

Everytime I access a missing key, it should return the key as-is:

eval('a', D())  # -> 'a'

But in a class scope, I still can't access the value:

exec('''
def f():
    print(a)

f()

class A:
    def f(self):
        print(b)

A.f(0)
A().f()

class B:
    print(c)
''', D())

Only the last print raises:

Traceback (most recent call last):
  ...
  File "<string>", line 15, in B
NameError: name 'c' is not defined

I don't know what happens when the interpreter enters a class context. I believe some copy happens here but the type doesn't remains, so accessing 'c' is done on a pure dict instead of our D.

So, my thought is: although the docs say that it can't be a subclass of dict, it actually can. LOAD_NAME is using __getitem__ somewhere. If this is real, then (2) wont impact on performance to a certain degree. And I do understand this may influence many tests so maybe it still not worth it. Then I personally prefer (3) over (1).


edit (2024/7/6):

I found that wrapping the code in a function definition can make it support ChainMap as globals. This may be a userland fix for this issue.

Code

```py from collections import ChainMap class ChainMap(ChainMap, dict): # globals must be a real dict pass source = """ def _(): a = 1 class A: print(a) _() """ exec(source, ChainMap()) # 1 ```

gaogaotiantian commented 3 months ago

Basically, there's a giant hole here - it's about LOAD_NAME, STORE_NAME vs LOAD_GLOBAL, STORE_GLOBAL. They have different assumptions about the global dict.

Your access a "global variable" inside a class will become a LOAD_NAME, while inside a function, it's a LOAD_GLOBAL. A global assignment in a module translates to a STORE_GLOBAL, however, when it is passed to exec, it's a STORE_NAME, unless you use global a which can force it to be a STORE_GLOBAL.

Therefore, the whole thing is a mess now and we should probably set up a rule for this - do we allow a dict-like global? If so, how alike? PyDict_* APIs require the underlying C data structure for dict, basically that means the new method of the real dict has to happen - that is not really the guarantee for derived class right? We can have a "pure subclass" of a dict that does not instantiate the object as dict and potentially cause some trouble.

In any case, we need extra input here. Let's wait for @markshannon 's opinion, and possibly other core devs.

blhsing commented 3 months ago

Basically, there's a giant hole here - it's about LOAD_NAME, STORE_NAME vs LOAD_GLOBAL, STORE_GLOBAL. They have different assumptions about the global dict.

While I totally agree with you on the need to standardize the assumptions about the globals dict from the implementations of different instructions involving globals, the lack of such a standardization is not really an obstacle to allowing a mapping as the globals argument to exec and eval while maintaining full backwards compatibility.

This can be done by adding a PyDict_Check-guarded fast path to existing PyDict_* calls with a fallback to a slow path to equivalent PyMapping_*/PyObject_* calls. Note that nothing needs to be done for LOAD_GLOBAL since there is already such a slow path after failing PyDict_CheckExact: https://github.com/python/cpython/blob/59be79ae60073f7b6bdf6ce921560c279937e4ab/Python/bytecodes.c#L1612

The performance impact with this approach should be minimal since PyDict_Check is cheap.

Please feel free to poke holes at my PR #121389.

gaogaotiantian commented 2 months ago

Allowing an arbitrary mapping for PyDict_* APIs is an even more backwards incompatible change. I don't believe that would be accepted. Also even for this specific scenario, this change basically determines what's the standard behavior, without explicity discussing and documenting.

blhsing commented 2 months ago

Allowing an arbitrary mapping for PyDict_* APIs is an even more backwards incompatible change. I don't believe that would be accepted. Also even for this specific scenario, this change basically determines what's the standard behavior, without explicity discussing and documenting.

Sorry, but can you elaborate on exactly in which ways allowing a mapping to the existing API for exec and eval would cause backwards incompatibility? My PR passes all existing tests with the only modifications being that some tests that expected SystemError or TypeError no longer do.

To quote your assessment above:

So, we have a couple of options: ...

  • Change LOAD_NAME so it can deal with a mapping globals(). The performance will be impacted No backwards compatibility issue and we can remove the restriction (as far as I can tell)
gaogaotiantian commented 2 months ago

I thought you are changing PyDict_* APIs. It seems like you are changing the calls to those APIs, then that's a much smaller issue. Like I said, I don't know if it's desired to accept arbitrary mappings as globals. That would add some extra burden to maintainability. I'm not saying it's not the right way to go, it's definitely an option. I'm just saying that we need more input from core devs who are more familiar with the context.

blhsing commented 2 months ago

I thought you are changing PyDict_* APIs. It seems like you are changing the calls to those APIs, then that's a much smaller issue.

Ah I see. Yeah my modifications are strictly about adding calls to the mapping API as a fallback so what works right now should continue to work.

Like I said, I don't know if it's desired to accept arbitrary mappings as globals. That would add some extra burden to maintainability. I'm not saying it's not the right way to go, it's definitely an option. I'm just saying that we need more input from core devs who are more familiar with the context.

I think the OP's use case of a layered context is a good one, as I also had such needs for dynamic injection of namespace before and had to resort to uglier workarounds. Python is known for its customizability, and a generalization of globals as a mapping like locals and builtins already are enables users to tailor namespace behaviors closer to the mental model of such projects.

For my PR though I've limited the scope of change to the execution path of exec and eval. __globals__ of a function and __dict__ of a module continue to be dict-only so the change should be relatively manageable.

@markshannon can you help with some feedbacks when you get a chance? Thanks!

markshannon commented 2 months ago

I don't see a good reason why the globals must be a dictionary in some places, but not in others that appear very similar.

If we are going to be consistent, it must be to allow general mappings, as restricting globals to a dictionary in all cases would be a breaking change.

Since our stats show that the only instruction under discussion that is performance critical is LOAD_GLOBAL and we already specialize that, there should be no noticeable performance impact.

blhsing commented 2 months ago

I don't see a good reason why the globals must be a dictionary in some places, but not in others that appear very similar.

If we are going to be consistent, it must be to allow general mappings, as restricting globals to a dictionary in all cases would be a breaking change.

Agreed. I was hesitant to allow a module's __dict__ to be a mapping because by the same reasoning of consistency it would necessarily mean that a mapping must be allowed to be the __dict__ of any object, since _Py_module_getattro_impl gets a module's attribute by calling _PyObject_GenericGetAttrWithDict. That sounds like a change bigger than I anticipated, but I now think the consistency makes sense. Will go forward with such a modification then. Thanks for the feedback!

blhsing commented 2 months ago

I've updated the PR to allow a mapping to be the __dict__ attribute of any object. Can you help review the PR, @markshannon? Thanks!