ast.unparse produces bad code for identifiers that become keywords

1d44e9fe-3e56-4bea-8daf-72265b4ffc8d commented 2 years ago

BPO	46520
Nosy	@terryjreedy, @benjaminp, @JelleZijlstra, @Kodiologist, @pablogsal, @isidentical, @sobolevn
PRs	python/cpython#31012

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'library', '3.9', '3.10', '3.11'] title = 'ast.unparse produces bad code for identifiers that become keywords' updated_at = user = 'https://github.com/Kodiologist' ``` bugs.python.org fields: ```python activity = actor = 'Kodiologist' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'Kodiologist' dependencies = [] files = [] hgrepos = [] issue_num = 46520 keywords = ['patch'] message_count = 7.0 messages = ['411606', '411669', '411742', '411780', '411781', '412045', '412078'] nosy_count = 7.0 nosy_names = ['terry.reedy', 'benjamin.peterson', 'JelleZijlstra', 'Kodiologist', 'pablogsal', 'BTaskaya', 'sobolevn'] pr_nums = ['31012'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue46520' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11'] ```

1d44e9fe-3e56-4bea-8daf-72265b4ffc8d commented 2 years ago

This works:

𝕕𝕖𝕗 = 1

This raises SyntaxError:

    import ast
    exec(ast.unparse(ast.parse("𝕕𝕖𝕗 = 1")))

It looks like ast.parse creates a Name node with id='def', which is correct per PEP-3131, but ast.unparse doesn't know it needs to mangle the output somehow, as "𝕕𝕖𝕗" or a similar Unicode replacement.

sobolevn commented 2 years ago

I can confirm that it happens on all versions from 3.9 to 3.11 (main).

Python 3.9.9 (main, Dec 21 2021, 11:35:28) 
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ast
>>> ast.unparse(ast.parse("𝕕𝕖𝕗 = 1"))
'def = 1'
>>> exec(ast.unparse(ast.parse("𝕕𝕖𝕗 = 1")))  # SyntaxError

Python 3.11.0a4+ (heads/main-dirty:ef3ef6fa43, Jan 20 2022, 20:48:25) [Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ast
>>> ast.unparse(ast.parse("𝕕𝕖𝕗 = 1"))
'def = 1'
>>> exec(ast.unparse(ast.parse("𝕕𝕖𝕗 = 1")))  # SyntaxError

isidentical commented 2 years ago

Technically, this is a bug on the fact that it breaks the only guarantee of ast.unparse:

Unparse an ast.AST object and generate a string with code that would produce an equivalent ast.AST object if parsed back with ast.parse().

But I am not really sure if it should be handled at all, since we don't have access to the original form of the identifier in the AST due to the parser's normalization behavior.

If we want to only create a source that would give the same AST, abusing the fact that original keywords are always basic ASCII we could embed a map of characters that convert ASCII 'a', 'b', 'c', ... to their most similar unicode versions (https://util.unicode.org/UnicodeJsps/confusables.jsp). But I feel like this is a terrible idea, with no possible gain (very limited use case) and very prone to a lot of confusions.

I think just adding a warning to the documentation regarding this should be the definite resolution, unless @pablogsal has any other idea.

1d44e9fe-3e56-4bea-8daf-72265b4ffc8d commented 2 years ago

I've done very little work on CPython, but I do a lot of AST construction and call ast.unparse a lot in my work on Hylang, and I think this is a wart worth fixing. The real mistake was letting the user say 𝕕𝕖𝕗 = 1, but that's been legal Python syntax for a long time, so I doubt a change to that would be welcome, especially one affecting old stable versions of Python like 3.9. Python has made its bed and now it must lie in it.

I think that with a pretty small amount of code (using code-point arithmetic instead of a dictionary with every ASCII letter), I can add Unicode "escaping" of reserved words to the part of ast.unparse that renders variable names. Would a patch of this kind be welcome?

1d44e9fe-3e56-4bea-8daf-72265b4ffc8d commented 2 years ago

And yes, while this behavior will look strange, the only code that will parse to AST nodes that require it will be code that uses exactly the same trick.

terryjreedy commented 2 years ago

'Reserved words' include all double underscore words, like __reserved__. Using such is allowed, but we reserve the right to break such code by adding a use for the word. 'def' is a keyword. Using identifier normalization to smuggle keywords into compiled code is a clever hack. But I am not sure that there is an actionable bug anywhere.

The Unicode normalization rules are not defined by us. Changing how we use them or creating a custom normalization form is not to be done lightly.

Should ast.parse raise?  The effect is the same as "globals()['𝕕𝕖𝕗']=1" (which is the same as passing 'def' or anything else that normalizes to it) and that in turn allows ">>> 𝕕𝕖𝕗", which returns 1.  Should such identifiers be outlawed?

https://docs.python.org/3/reference/lexical_analysis.html#identifiers says "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC." This does not say when an identifier is compared to the keyword set, before or after normalization. Currently is it before. Changing this to after could be considered a backwards-incompatible feature change that would require a deprecation period with syntax warnings. (Do other implementations also compare before normalization?)

Batuhan already quoted https://docs.python.org/3/library/ast.html#ast.unparse and I mostly agree with his comments. The "would produce" part is contingent upon the result having no syntax errors, and that cannot be guaranteed. What could be done is to check every identifier against keywords and change the first character to a chosen NFKD equivalent. Although 'fixing' the ast this way would make unparse seem to work better succeed in this case, there are other fixes that might also be suggested for the same reason.

Until this is done in CPython, anyone who cares could write an AST visitor to make the same change before calling unparse. Example code could be attached to this issue.

1d44e9fe-3e56-4bea-8daf-72265b4ffc8d commented 2 years ago

(Hilariously, I couldn't post this comment on bugs.python.org due to some kind of Unicode bug ("Edit Error: 'utf8' codec can't decode bytes in position 208-210: invalid continuation byte"), so I've rendered "\U0001D555\U0001D556\U0001D557" as "DEF" in the below.)

Thanks for clarifying the terminology re: reserved words vs. keywords.

The effect is the same as "globals()['DEF']=1" (which is the same as passing 'def' or anything else that normalizes to it) and that in turn allows ">>> DEF", which returns 1.

This doesn't quite seem to be the case, at least on Pythons 3.9 and 3.10:

    >>> globals()['DEF']=1
    >>> DEF
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NameError: name 'def' is not defined
    >>> globals()['def']=1
    >>> DEF
    1

It looks the dictionary interface to globals doesn't normalize like the parser does.

Kodiologist commented 2 years ago

The PR to fix this is a few months old. Can we get it moving?

isidentical commented 2 years ago

I am not inclined to accept it. I don't want to quickly reject the issue, but want to listen the opinions of other core developers before. @pablogsal @serhiy-storchaka any feedback on this?

Kodiologist commented 2 years ago

But then how else would the bug be fixed? Should we make 𝕕𝕖𝕗 = 1 syntactically illegal?

serhiy-storchaka commented 2 years ago

Names like 𝕕𝕖𝕗 are just confusing. I would raise an error if a non-ASCII name is normalized to ASCII. Is there a valid use case for them?

Kodiologist commented 2 years ago

In Hylang, they provide a way for us to produce Python code for Hy code like (send :from obj). send(from = obj) is illegal Python, so we need to produce something like send(𝕗𝕣𝕠𝕞 = obj) instead. Lisp doesn't need most reserved words in the way that C-like languages do, and our previous strategy of trying to determine what names would be Python-illegal where was error-prone.

Kodiologist commented 2 years ago

That, in turn, would lead to fun surprises where a Hy program that was compiled into AST and run directly would work fine, but if you tried to produce a Python program for it and run that, it wouldn't work.

serhiy-storchaka commented 2 years ago

All "bad" identifier characters:

>>> ''.join(c for c in map(chr, range(0x80, 0x110000)) if ('a'+c).isidentifier() and unicodedata.normalize('NFKC', c) < '\u0080')

'ªºĲĳĿŀſǄǅǆǇǈǉǊǋǌǱǲǳʰʲʳʷʸˡˢˣᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁᵂᵃᵇᵈᵉᵍᵏᵐᵒᵖᵗᵘᵛᵢᵣᵤᵥᶜᶠᶻẚⁱⁿₐₑₒₓₕₖₗₘₙₚₛₜℂℊℋℌℍℎℐℑℒℓℕℙℚℛℜℝℤℨKℬℭℯℰℱℳℴℹⅅⅆⅇⅈⅉⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿⱼⱽꟲꟳꟴﬀﬁﬂﬃﬄﬅﬆ︳︴﹍﹎﹏０１２３４５６７８９ＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯＰＱＲＳＴＵＶＷＸＹＺ＿ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ𐞥𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳𝐴𝐵𝐶𝐷𝐸𝐹𝐺𝐻𝐼𝐽𝐾𝐿𝑀𝑁𝑂𝑃𝑄𝑅𝑆𝑇𝑈𝑉𝑊𝑋𝑌𝑍𝑎𝑏𝑐𝑑𝑒𝑓𝑔𝑖𝑗𝑘𝑙𝑚𝑛𝑜𝑝𝑞𝑟𝑠𝑡𝑢𝑣𝑤𝑥𝑦𝑧𝑨𝑩𝑪𝑫𝑬𝑭𝑮𝑯𝑰𝑱𝑲𝑳𝑴𝑵𝑶𝑷𝑸𝑹𝑺𝑻𝑼𝑽𝑾𝑿𝒀𝒁𝒂𝒃𝒄𝒅𝒆𝒇𝒈𝒉𝒊𝒋𝒌𝒍𝒎𝒏𝒐𝒑𝒒𝒓𝒔𝒕𝒖𝒗𝒘𝒙𝒚𝒛𝒜𝒞𝒟𝒢𝒥𝒦𝒩𝒪𝒫𝒬𝒮𝒯𝒰𝒱𝒲𝒳𝒴𝒵𝒶𝒷𝒸𝒹𝒻𝒽𝒾𝒿𝓀𝓁𝓂𝓃𝓅𝓆𝓇𝓈𝓉𝓊𝓋𝓌𝓍𝓎𝓏𝓐𝓑𝓒𝓓𝓔𝓕𝓖𝓗𝓘𝓙𝓚𝓛𝓜𝓝𝓞𝓟𝓠𝓡𝓢𝓣𝓤𝓥𝓦𝓧𝓨𝓩𝓪𝓫𝓬𝓭𝓮𝓯𝓰𝓱𝓲𝓳𝓴𝓵𝓶𝓷𝓸𝓹𝓺𝓻𝓼𝓽𝓾𝓿𝔀𝔁𝔂𝔃𝔄𝔅𝔇𝔈𝔉𝔊𝔍𝔎𝔏𝔐𝔑𝔒𝔓𝔔𝔖𝔗𝔘𝔙𝔚𝔛𝔜𝔞𝔟𝔠𝔡𝔢𝔣𝔤𝔥𝔦𝔧𝔨𝔩𝔪𝔫𝔬𝔭𝔮𝔯𝔰𝔱𝔲𝔳𝔴𝔵𝔶𝔷𝔸𝔹𝔻𝔼𝔽𝔾𝕀𝕁𝕂𝕃𝕄𝕆𝕊𝕋𝕌𝕍𝕎𝕏𝕐𝕒𝕓𝕔𝕕𝕖𝕗𝕘𝕙𝕚𝕛𝕜𝕝𝕞𝕟𝕠𝕡𝕢𝕣𝕤𝕥𝕦𝕧𝕨𝕩𝕪𝕫𝕬𝕭𝕮𝕯𝕰𝕱𝕲𝕳𝕴𝕵𝕶𝕷𝕸𝕹𝕺𝕻𝕼𝕽𝕾𝕿𝖀𝖁𝖂𝖃𝖄𝖅𝖆𝖇𝖈𝖉𝖊𝖋𝖌𝖍𝖎𝖏𝖐𝖑𝖒𝖓𝖔𝖕𝖖𝖗𝖘𝖙𝖚𝖛𝖜𝖝𝖞𝖟𝖠𝖡𝖢𝖣𝖤𝖥𝖦𝖧𝖨𝖩𝖪𝖫𝖬𝖭𝖮𝖯𝖰𝖱𝖲𝖳𝖴𝖵𝖶𝖷𝖸𝖹𝖺𝖻𝖼𝖽𝖾𝖿𝗀𝗁𝗂𝗃𝗄𝗅𝗆𝗇𝗈𝗉𝗊𝗋𝗌𝗍𝗎𝗏𝗐𝗑𝗒𝗓𝗔𝗕𝗖𝗗𝗘𝗙𝗚𝗛𝗜𝗝𝗞𝗟𝗠𝗡𝗢𝗣𝗤𝗥𝗦𝗧𝗨𝗩𝗪𝗫𝗬𝗭𝗮𝗯𝗰𝗱𝗲𝗳𝗴𝗵𝗶𝗷𝗸𝗹𝗺𝗻𝗼𝗽𝗾𝗿𝘀𝘁𝘂𝘃𝘄𝘅𝘆𝘇𝘈𝘉𝘊𝘋𝘌𝘍𝘎𝘏𝘐𝘑𝘒𝘓𝘔𝘕𝘖𝘗𝘘𝘙𝘚𝘛𝘜𝘝𝘞𝘟𝘠𝘡𝘢𝘣𝘤𝘥𝘦𝘧𝘨𝘩𝘪𝘫𝘬𝘭𝘮𝘯𝘰𝘱𝘲𝘳𝘴𝘵𝘶𝘷𝘸𝘹𝘺𝘻𝘼𝘽𝘾𝘿𝙀𝙁𝙂𝙃𝙄𝙅𝙆𝙇𝙈𝙉𝙊𝙋𝙌𝙍𝙎𝙏𝙐𝙑𝙒𝙓𝙔𝙕𝙖𝙗𝙘𝙙𝙚𝙛𝙜𝙝𝙞𝙟𝙠𝙡𝙢𝙣𝙤𝙥𝙦𝙧𝙨𝙩𝙪𝙫𝙬𝙭𝙮𝙯𝙰𝙱𝙲𝙳𝙴𝙵𝙶𝙷𝙸𝙹𝙺𝙻𝙼𝙽𝙾𝙿𝚀𝚁𝚂𝚃𝚄𝚅𝚆𝚇𝚈𝚉𝚊𝚋𝚌𝚍𝚎𝚏𝚐𝚑𝚒𝚓𝚔𝚕𝚖𝚗𝚘𝚙𝚚𝚛𝚜𝚝𝚞𝚟𝚠𝚡𝚢𝚣𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡𝟢𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿🯰🯱🯲🯳🯴🯵🯶🯷🯸🯹'

serhiy-storchaka commented 2 years ago

In Hylang, they provide a way for us to produce Python code for Hy code like (send :from obj). send(from = obj) is illegal Python, so we need to produce something like send(𝕗𝕣𝕠𝕞 = obj) instead.

It is a common problem of interoperability between different programming languages. For example, Tkinter allows you to use from_ and class_ to specify Tk options -from and -class. You can also write it as send(**{'from': obj}), but it does not looks nice.

I do not think that requiring users to type send(𝕗𝕣𝕠𝕞 = obj) in their text editors is a very good idea too.

Kodiologist commented 2 years ago

Yes, which I guess comes to show it's more useful for generated code (as in Hy's case, where the user just types (send :from obj) and the compiler does the Unicode-mangling) than hand-written code.

pablogsal commented 2 years ago

I am of the same opinion as @serhiy-storchaka: raising an error if a non-ASCII name is normalized to ASCII

python / cpython

ast.unparse produces bad code for identifiers that become keywords #90678