python / cpython

The Python programming language
https://www.python.org
Other
62.98k stars 30.16k forks source link

ast.unparse produces bad code for identifiers that become keywords #90678

Open 1d44e9fe-3e56-4bea-8daf-72265b4ffc8d opened 2 years ago

1d44e9fe-3e56-4bea-8daf-72265b4ffc8d commented 2 years ago
BPO 46520
Nosy @terryjreedy, @benjaminp, @JelleZijlstra, @Kodiologist, @pablogsal, @isidentical, @sobolevn
PRs
  • python/cpython#31012
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'library', '3.9', '3.10', '3.11'] title = 'ast.unparse produces bad code for identifiers that become keywords' updated_at = user = 'https://github.com/Kodiologist' ``` bugs.python.org fields: ```python activity = actor = 'Kodiologist' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'Kodiologist' dependencies = [] files = [] hgrepos = [] issue_num = 46520 keywords = ['patch'] message_count = 7.0 messages = ['411606', '411669', '411742', '411780', '411781', '412045', '412078'] nosy_count = 7.0 nosy_names = ['terry.reedy', 'benjamin.peterson', 'JelleZijlstra', 'Kodiologist', 'pablogsal', 'BTaskaya', 'sobolevn'] pr_nums = ['31012'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue46520' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11'] ```

    1d44e9fe-3e56-4bea-8daf-72265b4ffc8d commented 2 years ago

    This works:

    ๐••๐•–๐•— = 1

    This raises SyntaxError:

        import ast
        exec(ast.unparse(ast.parse("๐••๐•–๐•— = 1")))

    It looks like ast.parse creates a Name node with id='def', which is correct per PEP-3131, but ast.unparse doesn't know it needs to mangle the output somehow, as "๐••๐•–๐•—" or a similar Unicode replacement.

    sobolevn commented 2 years ago

    I can confirm that it happens on all versions from 3.9 to 3.11 (main).

    Python 3.9.9 (main, Dec 21 2021, 11:35:28) 
    [Clang 11.0.0 (clang-1100.0.33.16)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import ast
    >>> ast.unparse(ast.parse("๐••๐•–๐•— = 1"))
    'def = 1'
    >>> exec(ast.unparse(ast.parse("๐••๐•–๐•— = 1")))  # SyntaxError
    Python 3.11.0a4+ (heads/main-dirty:ef3ef6fa43, Jan 20 2022, 20:48:25) [Clang 11.0.0 (clang-1100.0.33.16)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import ast
    >>> ast.unparse(ast.parse("๐••๐•–๐•— = 1"))
    'def = 1'
    >>> exec(ast.unparse(ast.parse("๐••๐•–๐•— = 1")))  # SyntaxError
    isidentical commented 2 years ago

    Technically, this is a bug on the fact that it breaks the only guarantee of ast.unparse:

    Unparse an ast.AST object and generate a string with code that would produce an equivalent ast.AST object if parsed back with ast.parse().

    But I am not really sure if it should be handled at all, since we don't have access to the original form of the identifier in the AST due to the parser's normalization behavior.

    If we want to only create a source that would give the same AST, abusing the fact that original keywords are always basic ASCII we could embed a map of characters that convert ASCII 'a', 'b', 'c', ... to their most similar unicode versions (https://util.unicode.org/UnicodeJsps/confusables.jsp). But I feel like this is a terrible idea, with no possible gain (very limited use case) and very prone to a lot of confusions.

    I think just adding a warning to the documentation regarding this should be the definite resolution, unless @pablogsal has any other idea.

    1d44e9fe-3e56-4bea-8daf-72265b4ffc8d commented 2 years ago

    I've done very little work on CPython, but I do a lot of AST construction and call ast.unparse a lot in my work on Hylang, and I think this is a wart worth fixing. The real mistake was letting the user say ๐••๐•–๐•— = 1, but that's been legal Python syntax for a long time, so I doubt a change to that would be welcome, especially one affecting old stable versions of Python like 3.9. Python has made its bed and now it must lie in it.

    I think that with a pretty small amount of code (using code-point arithmetic instead of a dictionary with every ASCII letter), I can add Unicode "escaping" of reserved words to the part of ast.unparse that renders variable names. Would a patch of this kind be welcome?

    1d44e9fe-3e56-4bea-8daf-72265b4ffc8d commented 2 years ago

    And yes, while this behavior will look strange, the only code that will parse to AST nodes that require it will be code that uses exactly the same trick.

    terryjreedy commented 2 years ago

    'Reserved words' include all double underscore words, like __reserved__. Using such is allowed, but we reserve the right to break such code by adding a use for the word. 'def' is a keyword. Using identifier normalization to smuggle keywords into compiled code is a clever hack. But I am not sure that there is an actionable bug anywhere.

    The Unicode normalization rules are not defined by us. Changing how we use them or creating a custom normalization form is not to be done lightly.

    Should ast.parse raise?  The effect is the same as "globals()['๐••๐•–๐•—']=1" (which is the same as passing 'def' or anything else that normalizes to it) and that in turn allows ">>> ๐••๐•–๐•—", which returns 1.  Should such identifiers be outlawed?

    https://docs.python.org/3/reference/lexical_analysis.html#identifiers says "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC." This does not say when an identifier is compared to the keyword set, before or after normalization. Currently is it before. Changing this to after could be considered a backwards-incompatible feature change that would require a deprecation period with syntax warnings. (Do other implementations also compare before normalization?)

    Batuhan already quoted https://docs.python.org/3/library/ast.html#ast.unparse and I mostly agree with his comments. The "would produce" part is contingent upon the result having no syntax errors, and that cannot be guaranteed. What could be done is to check every identifier against keywords and change the first character to a chosen NFKD equivalent. Although 'fixing' the ast this way would make unparse seem to work better succeed in this case, there are other fixes that might also be suggested for the same reason.

    Until this is done in CPython, anyone who cares could write an AST visitor to make the same change before calling unparse. Example code could be attached to this issue.

    1d44e9fe-3e56-4bea-8daf-72265b4ffc8d commented 2 years ago

    (Hilariously, I couldn't post this comment on bugs.python.org due to some kind of Unicode bug ("Edit Error: 'utf8' codec can't decode bytes in position 208-210: invalid continuation byte"), so I've rendered "\U0001D555\U0001D556\U0001D557" as "DEF" in the below.)

    Thanks for clarifying the terminology re: reserved words vs. keywords.

    The effect is the same as "globals()['DEF']=1" (which is the same as passing 'def' or anything else that normalizes to it) and that in turn allows ">>> DEF", which returns 1.

    This doesn't quite seem to be the case, at least on Pythons 3.9 and 3.10:

        >>> globals()['DEF']=1
        >>> DEF
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        NameError: name 'def' is not defined
        >>> globals()['def']=1
        >>> DEF
        1

    It looks the dictionary interface to globals doesn't normalize like the parser does.

    Kodiologist commented 2 years ago

    The PR to fix this is a few months old. Can we get it moving?

    isidentical commented 2 years ago

    I am not inclined to accept it. I don't want to quickly reject the issue, but want to listen the opinions of other core developers before. @pablogsal @serhiy-storchaka any feedback on this?

    Kodiologist commented 2 years ago

    But then how else would the bug be fixed? Should we make ๐••๐•–๐•— = 1 syntactically illegal?

    serhiy-storchaka commented 2 years ago

    Names like ๐••๐•–๐•— are just confusing. I would raise an error if a non-ASCII name is normalized to ASCII. Is there a valid use case for them?

    Kodiologist commented 2 years ago

    In Hylang, they provide a way for us to produce Python code for Hy code like (send :from obj). send(from = obj) is illegal Python, so we need to produce something like send(๐•—๐•ฃ๐• ๐•ž = obj) instead. Lisp doesn't need most reserved words in the way that C-like languages do, and our previous strategy of trying to determine what names would be Python-illegal where was error-prone.

    Kodiologist commented 2 years ago

    That, in turn, would lead to fun surprises where a Hy program that was compiled into AST and run directly would work fine, but if you tried to produce a Python program for it and run that, it wouldn't work.

    serhiy-storchaka commented 2 years ago

    All "bad" identifier characters:

    >>> ''.join(c for c in map(chr, range(0x80, 0x110000)) if ('a'+c).isidentifier() and unicodedata.normalize('NFKC', c) < '\u0080')

    'ยชยบฤฒฤณฤฟล€ลฟว„ว…ว†ว‡วˆว‰วŠว‹วŒวฑวฒวณสฐสฒสณสทสธหกหขหฃแดฌแดฎแดฐแดฑแดณแดดแดตแดถแดทแดธแดนแดบแดผแดพแดฟแต€แตแต‚แตƒแต‡แตˆแต‰แตแตแตแต’แต–แต—แต˜แต›แตขแตฃแตคแตฅแถœแถ แถปแบšโฑโฟโ‚โ‚‘โ‚’โ‚“โ‚•โ‚–โ‚—โ‚˜โ‚™โ‚šโ‚›โ‚œโ„‚โ„Šโ„‹โ„Œโ„โ„Žโ„โ„‘โ„’โ„“โ„•โ„™โ„šโ„›โ„œโ„โ„คโ„จโ„ชโ„ฌโ„ญโ„ฏโ„ฐโ„ฑโ„ณโ„ดโ„นโ……โ…†โ…‡โ…ˆโ…‰โ… โ…กโ…ขโ…ฃโ…คโ…ฅโ…ฆโ…งโ…จโ…ฉโ…ชโ…ซโ…ฌโ…ญโ…ฎโ…ฏโ…ฐโ…ฑโ…ฒโ…ณโ…ดโ…ตโ…ถโ…ทโ…ธโ…นโ…บโ…ปโ…ผโ…ฝโ…พโ…ฟโฑผโฑฝ๊Ÿฒ๊Ÿณ๊Ÿด๏ฌ€๏ฌ๏ฌ‚๏ฌƒ๏ฌ„๏ฌ…๏ฌ†๏ธณ๏ธด๏น๏นŽ๏น๏ผ๏ผ‘๏ผ’๏ผ“๏ผ”๏ผ•๏ผ–๏ผ—๏ผ˜๏ผ™๏ผก๏ผข๏ผฃ๏ผค๏ผฅ๏ผฆ๏ผง๏ผจ๏ผฉ๏ผช๏ผซ๏ผฌ๏ผญ๏ผฎ๏ผฏ๏ผฐ๏ผฑ๏ผฒ๏ผณ๏ผด๏ผต๏ผถ๏ผท๏ผธ๏ผน๏ผบ๏ผฟ๏ฝ๏ฝ‚๏ฝƒ๏ฝ„๏ฝ…๏ฝ†๏ฝ‡๏ฝˆ๏ฝ‰๏ฝŠ๏ฝ‹๏ฝŒ๏ฝ๏ฝŽ๏ฝ๏ฝ๏ฝ‘๏ฝ’๏ฝ“๏ฝ”๏ฝ•๏ฝ–๏ฝ—๏ฝ˜๏ฝ™๏ฝš๐žฅ๐€๐๐‚๐ƒ๐„๐…๐†๐‡๐ˆ๐‰๐Š๐‹๐Œ๐๐Ž๐๐๐‘๐’๐“๐”๐•๐–๐—๐˜๐™๐š๐›๐œ๐๐ž๐Ÿ๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐‘€๐‘๐‘‚๐‘ƒ๐‘„๐‘…๐‘†๐‘‡๐‘ˆ๐‘‰๐‘Š๐‘‹๐‘Œ๐‘๐‘Ž๐‘๐‘๐‘‘๐‘’๐‘“๐‘”๐‘–๐‘—๐‘˜๐‘™๐‘š๐‘›๐‘œ๐‘๐‘ž๐‘Ÿ๐‘ ๐‘ก๐‘ข๐‘ฃ๐‘ค๐‘ฅ๐‘ฆ๐‘ง๐‘จ๐‘ฉ๐‘ช๐‘ซ๐‘ฌ๐‘ญ๐‘ฎ๐‘ฏ๐‘ฐ๐‘ฑ๐‘ฒ๐‘ณ๐‘ด๐‘ต๐‘ถ๐‘ท๐‘ธ๐‘น๐‘บ๐‘ป๐‘ผ๐‘ฝ๐‘พ๐‘ฟ๐’€๐’๐’‚๐’ƒ๐’„๐’…๐’†๐’‡๐’ˆ๐’‰๐’Š๐’‹๐’Œ๐’๐’Ž๐’๐’๐’‘๐’’๐’“๐’”๐’•๐’–๐’—๐’˜๐’™๐’š๐’›๐’œ๐’ž๐’Ÿ๐’ข๐’ฅ๐’ฆ๐’ฉ๐’ช๐’ซ๐’ฌ๐’ฎ๐’ฏ๐’ฐ๐’ฑ๐’ฒ๐’ณ๐’ด๐’ต๐’ถ๐’ท๐’ธ๐’น๐’ป๐’ฝ๐’พ๐’ฟ๐“€๐“๐“‚๐“ƒ๐“…๐“†๐“‡๐“ˆ๐“‰๐“Š๐“‹๐“Œ๐“๐“Ž๐“๐“๐“‘๐“’๐““๐“”๐“•๐“–๐“—๐“˜๐“™๐“š๐“›๐“œ๐“๐“ž๐“Ÿ๐“ ๐“ก๐“ข๐“ฃ๐“ค๐“ฅ๐“ฆ๐“ง๐“จ๐“ฉ๐“ช๐“ซ๐“ฌ๐“ญ๐“ฎ๐“ฏ๐“ฐ๐“ฑ๐“ฒ๐“ณ๐“ด๐“ต๐“ถ๐“ท๐“ธ๐“น๐“บ๐“ป๐“ผ๐“ฝ๐“พ๐“ฟ๐”€๐”๐”‚๐”ƒ๐”„๐”…๐”‡๐”ˆ๐”‰๐”Š๐”๐”Ž๐”๐”๐”‘๐”’๐”“๐””๐”–๐”—๐”˜๐”™๐”š๐”›๐”œ๐”ž๐”Ÿ๐” ๐”ก๐”ข๐”ฃ๐”ค๐”ฅ๐”ฆ๐”ง๐”จ๐”ฉ๐”ช๐”ซ๐”ฌ๐”ญ๐”ฎ๐”ฏ๐”ฐ๐”ฑ๐”ฒ๐”ณ๐”ด๐”ต๐”ถ๐”ท๐”ธ๐”น๐”ป๐”ผ๐”ฝ๐”พ๐•€๐•๐•‚๐•ƒ๐•„๐•†๐•Š๐•‹๐•Œ๐•๐•Ž๐•๐•๐•’๐•“๐•”๐••๐•–๐•—๐•˜๐•™๐•š๐•›๐•œ๐•๐•ž๐•Ÿ๐• ๐•ก๐•ข๐•ฃ๐•ค๐•ฅ๐•ฆ๐•ง๐•จ๐•ฉ๐•ช๐•ซ๐•ฌ๐•ญ๐•ฎ๐•ฏ๐•ฐ๐•ฑ๐•ฒ๐•ณ๐•ด๐•ต๐•ถ๐•ท๐•ธ๐•น๐•บ๐•ป๐•ผ๐•ฝ๐•พ๐•ฟ๐–€๐–๐–‚๐–ƒ๐–„๐–…๐–†๐–‡๐–ˆ๐–‰๐–Š๐–‹๐–Œ๐–๐–Ž๐–๐–๐–‘๐–’๐–“๐–”๐–•๐––๐–—๐–˜๐–™๐–š๐–›๐–œ๐–๐–ž๐–Ÿ๐– ๐–ก๐–ข๐–ฃ๐–ค๐–ฅ๐–ฆ๐–ง๐–จ๐–ฉ๐–ช๐–ซ๐–ฌ๐–ญ๐–ฎ๐–ฏ๐–ฐ๐–ฑ๐–ฒ๐–ณ๐–ด๐–ต๐–ถ๐–ท๐–ธ๐–น๐–บ๐–ป๐–ผ๐–ฝ๐–พ๐–ฟ๐—€๐—๐—‚๐—ƒ๐—„๐—…๐—†๐—‡๐—ˆ๐—‰๐—Š๐—‹๐—Œ๐—๐—Ž๐—๐—๐—‘๐—’๐—“๐—”๐—•๐—–๐——๐—˜๐—™๐—š๐—›๐—œ๐—๐—ž๐—Ÿ๐— ๐—ก๐—ข๐—ฃ๐—ค๐—ฅ๐—ฆ๐—ง๐—จ๐—ฉ๐—ช๐—ซ๐—ฌ๐—ญ๐—ฎ๐—ฏ๐—ฐ๐—ฑ๐—ฒ๐—ณ๐—ด๐—ต๐—ถ๐—ท๐—ธ๐—น๐—บ๐—ป๐—ผ๐—ฝ๐—พ๐—ฟ๐˜€๐˜๐˜‚๐˜ƒ๐˜„๐˜…๐˜†๐˜‡๐˜ˆ๐˜‰๐˜Š๐˜‹๐˜Œ๐˜๐˜Ž๐˜๐˜๐˜‘๐˜’๐˜“๐˜”๐˜•๐˜–๐˜—๐˜˜๐˜™๐˜š๐˜›๐˜œ๐˜๐˜ž๐˜Ÿ๐˜ ๐˜ก๐˜ข๐˜ฃ๐˜ค๐˜ฅ๐˜ฆ๐˜ง๐˜จ๐˜ฉ๐˜ช๐˜ซ๐˜ฌ๐˜ญ๐˜ฎ๐˜ฏ๐˜ฐ๐˜ฑ๐˜ฒ๐˜ณ๐˜ด๐˜ต๐˜ถ๐˜ท๐˜ธ๐˜น๐˜บ๐˜ป๐˜ผ๐˜ฝ๐˜พ๐˜ฟ๐™€๐™๐™‚๐™ƒ๐™„๐™…๐™†๐™‡๐™ˆ๐™‰๐™Š๐™‹๐™Œ๐™๐™Ž๐™๐™๐™‘๐™’๐™“๐™”๐™•๐™–๐™—๐™˜๐™™๐™š๐™›๐™œ๐™๐™ž๐™Ÿ๐™ ๐™ก๐™ข๐™ฃ๐™ค๐™ฅ๐™ฆ๐™ง๐™จ๐™ฉ๐™ช๐™ซ๐™ฌ๐™ญ๐™ฎ๐™ฏ๐™ฐ๐™ฑ๐™ฒ๐™ณ๐™ด๐™ต๐™ถ๐™ท๐™ธ๐™น๐™บ๐™ป๐™ผ๐™ฝ๐™พ๐™ฟ๐š€๐š๐š‚๐šƒ๐š„๐š…๐š†๐š‡๐šˆ๐š‰๐šŠ๐š‹๐šŒ๐š๐šŽ๐š๐š๐š‘๐š’๐š“๐š”๐š•๐š–๐š—๐š˜๐š™๐šš๐š›๐šœ๐š๐šž๐šŸ๐š ๐šก๐šข๐šฃ๐ŸŽ๐Ÿ๐Ÿ๐Ÿ‘๐Ÿ’๐Ÿ“๐Ÿ”๐Ÿ•๐Ÿ–๐Ÿ—๐Ÿ˜๐Ÿ™๐Ÿš๐Ÿ›๐Ÿœ๐Ÿ๐Ÿž๐ŸŸ๐Ÿ ๐Ÿก๐Ÿข๐Ÿฃ๐Ÿค๐Ÿฅ๐Ÿฆ๐Ÿง๐Ÿจ๐Ÿฉ๐Ÿช๐Ÿซ๐Ÿฌ๐Ÿญ๐Ÿฎ๐Ÿฏ๐Ÿฐ๐Ÿฑ๐Ÿฒ๐Ÿณ๐Ÿด๐Ÿต๐Ÿถ๐Ÿท๐Ÿธ๐Ÿน๐Ÿบ๐Ÿป๐Ÿผ๐Ÿฝ๐Ÿพ๐Ÿฟ๐Ÿฏฐ๐Ÿฏฑ๐Ÿฏฒ๐Ÿฏณ๐Ÿฏด๐Ÿฏต๐Ÿฏถ๐Ÿฏท๐Ÿฏธ๐Ÿฏน'

    serhiy-storchaka commented 2 years ago

    In Hylang, they provide a way for us to produce Python code for Hy code like (send :from obj). send(from = obj) is illegal Python, so we need to produce something like send(๐•—๐•ฃ๐• ๐•ž = obj) instead.

    It is a common problem of interoperability between different programming languages. For example, Tkinter allows you to use from_ and class_ to specify Tk options -from and -class. You can also write it as send(**{'from': obj}), but it does not looks nice.

    I do not think that requiring users to type send(๐•—๐•ฃ๐• ๐•ž = obj) in their text editors is a very good idea too.

    Kodiologist commented 2 years ago

    Yes, which I guess comes to show it's more useful for generated code (as in Hy's case, where the user just types (send :from obj) and the compiler does the Unicode-mangling) than hand-written code.

    pablogsal commented 2 years ago

    I am of the same opinion as @serhiy-storchaka: raising an error if a non-ASCII name is normalized to ASCII