Open 1d44e9fe-3e56-4bea-8daf-72265b4ffc8d opened 2 years ago
This works:
๐๐๐ = 1
This raises SyntaxError:
import ast
exec(ast.unparse(ast.parse("๐๐๐ = 1")))
It looks like ast.parse
creates a Name
node with id='def'
, which is correct per PEP-3131, but ast.unparse
doesn't know it needs to mangle the output somehow, as "๐๐๐" or a similar Unicode replacement.
I can confirm that it happens on all versions from 3.9 to 3.11 (main).
Python 3.9.9 (main, Dec 21 2021, 11:35:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ast
>>> ast.unparse(ast.parse("๐๐๐ = 1"))
'def = 1'
>>> exec(ast.unparse(ast.parse("๐๐๐ = 1"))) # SyntaxError
Python 3.11.0a4+ (heads/main-dirty:ef3ef6fa43, Jan 20 2022, 20:48:25) [Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ast
>>> ast.unparse(ast.parse("๐๐๐ = 1"))
'def = 1'
>>> exec(ast.unparse(ast.parse("๐๐๐ = 1"))) # SyntaxError
Technically, this is a bug on the fact that it breaks the only guarantee of ast.unparse:
Unparse an ast.AST object and generate a string with code that would produce an equivalent ast.AST object if parsed back with ast.parse().
But I am not really sure if it should be handled at all, since we don't have access to the original form of the identifier in the AST due to the parser's normalization behavior.
If we want to only create a source that would give the same AST, abusing the fact that original keywords are always basic ASCII we could embed a map of characters that convert ASCII 'a', 'b', 'c', ... to their most similar unicode versions (https://util.unicode.org/UnicodeJsps/confusables.jsp). But I feel like this is a terrible idea, with no possible gain (very limited use case) and very prone to a lot of confusions.
I think just adding a warning to the documentation regarding this should be the definite resolution, unless @pablogsal has any other idea.
I've done very little work on CPython, but I do a lot of AST construction and call ast.unparse
a lot in my work on Hylang, and I think this is a wart worth fixing. The real mistake was letting the user say ๐๐๐ = 1
, but that's been legal Python syntax for a long time, so I doubt a change to that would be welcome, especially one affecting old stable versions of Python like 3.9. Python has made its bed and now it must lie in it.
I think that with a pretty small amount of code (using code-point arithmetic instead of a dictionary with every ASCII letter), I can add Unicode "escaping" of reserved words to the part of ast.unparse
that renders variable names. Would a patch of this kind be welcome?
And yes, while this behavior will look strange, the only code that will parse to AST nodes that require it will be code that uses exactly the same trick.
'Reserved words' include all double underscore words, like __reserved__. Using such is allowed, but we reserve the right to break such code by adding a use for the word. 'def' is a keyword. Using identifier normalization to smuggle keywords into compiled code is a clever hack. But I am not sure that there is an actionable bug anywhere.
The Unicode normalization rules are not defined by us. Changing how we use them or creating a custom normalization form is not to be done lightly.
Should ast.parse raise? The effect is the same as "globals()['๐๐๐']=1" (which is the same as passing 'def' or anything else that normalizes to it) and that in turn allows ">>> ๐๐๐", which returns 1. Should such identifiers be outlawed?
https://docs.python.org/3/reference/lexical_analysis.html#identifiers says "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC." This does not say when an identifier is compared to the keyword set, before or after normalization. Currently is it before. Changing this to after could be considered a backwards-incompatible feature change that would require a deprecation period with syntax warnings. (Do other implementations also compare before normalization?)
Batuhan already quoted https://docs.python.org/3/library/ast.html#ast.unparse and I mostly agree with his comments. The "would produce" part is contingent upon the result having no syntax errors, and that cannot be guaranteed. What could be done is to check every identifier against keywords and change the first character to a chosen NFKD equivalent. Although 'fixing' the ast this way would make unparse seem to work better succeed in this case, there are other fixes that might also be suggested for the same reason.
Until this is done in CPython, anyone who cares could write an AST visitor to make the same change before calling unparse. Example code could be attached to this issue.
(Hilariously, I couldn't post this comment on bugs.python.org due to some kind of Unicode bug ("Edit Error: 'utf8' codec can't decode bytes in position 208-210: invalid continuation byte"), so I've rendered "\U0001D555\U0001D556\U0001D557" as "DEF" in the below.)
Thanks for clarifying the terminology re: reserved words vs. keywords.
The effect is the same as "globals()['DEF']=1" (which is the same as passing 'def' or anything else that normalizes to it) and that in turn allows ">>> DEF", which returns 1.
This doesn't quite seem to be the case, at least on Pythons 3.9 and 3.10:
>>> globals()['DEF']=1
>>> DEF
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'def' is not defined
>>> globals()['def']=1
>>> DEF
1
It looks the dictionary interface to globals
doesn't normalize like the parser does.
The PR to fix this is a few months old. Can we get it moving?
I am not inclined to accept it. I don't want to quickly reject the issue, but want to listen the opinions of other core developers before. @pablogsal @serhiy-storchaka any feedback on this?
But then how else would the bug be fixed? Should we make ๐๐๐ = 1
syntactically illegal?
Names like ๐๐๐ are just confusing. I would raise an error if a non-ASCII name is normalized to ASCII. Is there a valid use case for them?
In Hylang, they provide a way for us to produce Python code for Hy code like (send :from obj)
. send(from = obj)
is illegal Python, so we need to produce something like send(๐๐ฃ๐ ๐ = obj)
instead. Lisp doesn't need most reserved words in the way that C-like languages do, and our previous strategy of trying to determine what names would be Python-illegal where was error-prone.
That, in turn, would lead to fun surprises where a Hy program that was compiled into AST and run directly would work fine, but if you tried to produce a Python program for it and run that, it wouldn't work.
All "bad" identifier characters:
>>> ''.join(c for c in map(chr, range(0x80, 0x110000)) if ('a'+c).isidentifier() and unicodedata.normalize('NFKC', c) < '\u0080')
'ยชยบฤฒฤณฤฟลลฟวว ววววววววฑวฒวณสฐสฒสณสทสธหกหขหฃแดฌแดฎแดฐแดฑแดณแดดแดตแดถแดทแดธแดนแดบแดผแดพแดฟแตแตแตแตแตแตแตแตแตแตแตแตแตแตแตแตขแตฃแตคแตฅแถแถ แถปแบโฑโฟโโโโโโโโโโโโโโโโโโโโโโโโโโโโโคโจโชโฌโญโฏโฐโฑโณโดโนโ โ โ โ โ โ โ กโ ขโ ฃโ คโ ฅโ ฆโ งโ จโ ฉโ ชโ ซโ ฌโ ญโ ฎโ ฏโ ฐโ ฑโ ฒโ ณโ ดโ ตโ ถโ ทโ ธโ นโ บโ ปโ ผโ ฝโ พโ ฟโฑผโฑฝ๊ฒ๊ณ๊ด๏ฌ๏ฌ๏ฌ๏ฌ๏ฌ๏ฌ ๏ฌ๏ธณ๏ธด๏น๏น๏น๏ผ๏ผ๏ผ๏ผ๏ผ๏ผ๏ผ๏ผ๏ผ๏ผ๏ผก๏ผข๏ผฃ๏ผค๏ผฅ๏ผฆ๏ผง๏ผจ๏ผฉ๏ผช๏ผซ๏ผฌ๏ผญ๏ผฎ๏ผฏ๏ผฐ๏ผฑ๏ผฒ๏ผณ๏ผด๏ผต๏ผถ๏ผท๏ผธ๏ผน๏ผบ๏ผฟ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๐ฅ๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ข๐ฅ๐ฆ๐ฉ๐ช๐ซ๐ฌ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐ป๐ฝ๐พ๐ฟ๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐ป๐ผ๐ฝ๐พ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐ฏฐ๐ฏฑ๐ฏฒ๐ฏณ๐ฏด๐ฏต๐ฏถ๐ฏท๐ฏธ๐ฏน'
In Hylang, they provide a way for us to produce Python code for Hy code like
(send :from obj)
.send(from = obj)
is illegal Python, so we need to produce something likesend(๐๐ฃ๐ ๐ = obj)
instead.
It is a common problem of interoperability between different programming languages. For example, Tkinter allows you to use from_
and class_
to specify Tk options -from
and -class
. You can also write it as send(**{'from': obj})
, but it does not looks nice.
I do not think that requiring users to type send(๐๐ฃ๐ ๐ = obj)
in their text editors is a very good idea too.
Yes, which I guess comes to show it's more useful for generated code (as in Hy's case, where the user just types (send :from obj)
and the compiler does the Unicode-mangling) than hand-written code.
I am of the same opinion as @serhiy-storchaka: raising an error if a non-ASCII name is normalized to ASCII
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-bug', 'library', '3.9', '3.10', '3.11']
title = 'ast.unparse produces bad code for identifiers that become keywords'
updated_at =
user = 'https://github.com/Kodiologist'
```
bugs.python.org fields:
```python
activity =
actor = 'Kodiologist'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'Kodiologist'
dependencies = []
files = []
hgrepos = []
issue_num = 46520
keywords = ['patch']
message_count = 7.0
messages = ['411606', '411669', '411742', '411780', '411781', '412045', '412078']
nosy_count = 7.0
nosy_names = ['terry.reedy', 'benjamin.peterson', 'JelleZijlstra', 'Kodiologist', 'pablogsal', 'BTaskaya', 'sobolevn']
pr_nums = ['31012']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue46520'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']
```