python / cpython

The Python programming language
https://www.python.org
Other
63.35k stars 30.34k forks source link

Different error messages for same error - invalid string prefixes #84427

Closed lysnikolaou closed 4 years ago

lysnikolaou commented 4 years ago
BPO 40246
Nosy @gvanrossum, @terryjreedy, @vstinner, @ericvsmith, @encukou, @serhiy-storchaka, @hroncok, @lysnikolaou, @pablogsal, @aeros
PRs
  • python/cpython#19476
  • python/cpython#19888
  • python/cpython#20402
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', 'type-bug', '3.9'] title = 'Different error messages for same error - invalid string prefixes' updated_at = user = 'https://github.com/lysnikolaou' ``` bugs.python.org fields: ```python activity = actor = 'lys.nikolaou' assignee = 'none' closed = True closed_date = closer = 'lys.nikolaou' components = ['Interpreter Core'] creation = creator = 'lys.nikolaou' dependencies = [] files = [] hgrepos = [] issue_num = 40246 keywords = ['patch'] message_count = 41.0 messages = ['366143', '366146', '366164', '366165', '366187', '366197', '366202', '366215', '366260', '367586', '367587', '367588', '367591', '367592', '367597', '367599', '367611', '367628', '367649', '367658', '367660', '367664', '367672', '367689', '367690', '367691', '367693', '367698', '367977', '368038', '368039', '368040', '368045', '368062', '368191', '368919', '368920', '368925', '369077', '369217', '369931'] nosy_count = 10.0 nosy_names = ['gvanrossum', 'terry.reedy', 'vstinner', 'eric.smith', 'petr.viktorin', 'serhiy.storchaka', 'hroncok', 'lys.nikolaou', 'pablogsal', 'aeros'] pr_nums = ['19476', '19888', '20402'] priority = 'normal' resolution = 'wont fix' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue40246' versions = ['Python 3.9'] ```

    lysnikolaou commented 4 years ago

    While testing pegen, we found this out:

    Python 3.9.0a5+ (heads/pegen:502dfb719e, Apr 10 2020, 20:57:05) 
    [GCC 9.2.1 20191008] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> fur''
      File "<stdin>", line 1
        fur''
           ^
    SyntaxError: invalid syntax
    >>> eval("fur''")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
        fur''
           ^
    SyntaxError: unexpected EOF while parsing
    lysnikolaou commented 4 years ago

    After some investigating, I found out that this is happening because when we execute fur'' in the interactive interpreter there is an implicit newline, which means that EOF does not get reached.

    OTOH there is no newline when passing it as a string inside an eval call. After the tokenizer encounters the first quote, it calls tok_nextc twice more, in order to differentiate between a single-quoted and a triple-quoted string, thus reaching EOF and proapgating an EOF error up through the perrdetail struct.

    gvanrossum commented 4 years ago

    So a slightly shorter example uses ru''. This is an error because you can't combine the r prefix and the u prefix (in fact you can't combine anything with the r prefix).

    I declare that this is a bug to report EOF here and the code should *first check for a valid combination (e.g. fr'' or rf'') and only *then try to distinguish between single and triple quotes.

    lysnikolaou commented 4 years ago

    At the moment, if the prefix is invalid, it gets tokenized as a NAME node and the string following is tokenized as a STRING node with no prefix.

    ╰─ echo "ur''" | ./python -m tokenize 1,0-1,2: NAME 'ur'
    1,2-1,4: STRING "''"
    1,4-1,5: NEWLINE '\n'
    2,0-2,0: ENDMARKER ''

    ╰─ echo "f''" | ./python -m tokenize
    1,0-1,3: STRING "f''"
    1,3-1,4: NEWLINE '\n'
    2,0-2,0: ENDMARKER ''

    gvanrossum commented 4 years ago

    But that is the tokenize.py module. It does not have 100% the same behavior as the builtin tokenizer.c module. In this case it probably doesn't.

    lysnikolaou commented 4 years ago

    In my understanding of the C code that's what the C tokenizer is doing as well.

    Here's the relevant snippet of the tokenizer (https://github.com/python/cpython/blob/4b222c9491d1700e9bdd98e6889b8d0ea1c7321e/Parser/tokenizer.c#L1358): When the tokenizer sees a valid identifier start, it goes into a loop that checks for a valid combination of string prefixes. If the combination is valid and it sees a quote directly after that, it goto's to the STRING-handling code. If not, then it breaks out of the loop and returns a NAME node.

    Am I missing something?

    lysnikolaou commented 4 years ago

    I have working code that checks if there is a quotation mark right after an identifier. Here is an example:

    ╰─ ./python
    Python 3.9.0a5+ (heads/pegen-dirty:502dfb719e, Apr 11 2020, 14:43:12) 
    [GCC 9.2.1 20191008] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> ur''
      File "<stdin>", line 1
        ur''
         ^
    SyntaxError: invalid string prefix

    One observation about this is that it has precedence over an EOL error:

    >>> ur'
      File "<stdin>", line 1
        ur'
         ^
    SyntaxError: invalid string prefix

    Would that be acceptable?

    gvanrossum commented 4 years ago

    yes

    On Sat, Apr 11, 2020 at 04:46 Lysandros Nikolaou \report@bugs.python.org\ wrote:

    Lysandros Nikolaou \lisandrosnik@gmail.com\ added the comment:

    I have working code that checks if there is a quotation mark right after an identifier. Here is an example:

    ╰─ ./python Python 3.9.0a5+ (heads/pegen-dirty:502dfb719e, Apr 11 2020, 14:43:12) [GCC 9.2.1 20191008] on linux Type "help", "copyright", "credits" or "license" for more information. >>> ur'' File "\<stdin>", line 1 ur'' ^ SyntaxError: invalid string prefix

    One observation about this is that it has precedence over an EOL error:

    >>> ur' File "\<stdin>", line 1 ur' ^ SyntaxError: invalid string prefix

    Would that be acceptable?

    ----------


    Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue40246\


    -- --Guido (mobile)

    pablogsal commented 4 years ago

    New changeset 41d5b94af44e34ac05d4cd57460ed104ccf96628 by Lysandros Nikolaou in branch 'master': bpo-40246: Report a better error message for invalid string prefixes (GH-19476) https://github.com/python/cpython/commit/41d5b94af44e34ac05d4cd57460ed104ccf96628

    lysnikolaou commented 4 years ago

    On bpo-40334 @vstinner reported that this change broke some code in turtledemo. The code looks like this:

    print(dict(state=clear, bg="#d00" if clear else"#fca"))

    Due to the absence of a space between else and ", the else keyword is now interpreted as an invalid string prefix. Is that acceptable?

    vstinner commented 4 years ago

    I reopen the issue.

    The following code works on Python 3.8, but fails with SyntaxError on Python 3.9.0a6 with the old and the new parser (see bpo-40431):

        clear = "NORMAL"
        print(dict(state=clear, bg="#d00" if clear else"#fca"))

    Well, the code looks like a typo error... but it worked previously.

    Not sure if we should keep the SyntaxError or not. Fixing the code is trivial: see PR 19777 attached to bpo-40431; replace >else"#fca"\< with >else "#fca"\<.

    Note: I first reported the issue to https://bugs.python.org/issue40334#msg367580

    pablogsal commented 4 years ago

    I personally think the new behaviour is more consistent and more likely to catch errors.

    vstinner commented 4 years ago

    I personally think the new behaviour is more consistent and more likely to catch errors.

    Would it be possible to emit a SyntaxWarning in 3.9 and only emit a SyntaxError in 3.10?

    gvanrossum commented 4 years ago

    Personally I don't think that else"stuff" should become a syntax error. It's no different from other edge cases where for some reason the space seems optional, like 1and 2.

    vstinner commented 4 years ago

    I think that a linter can be very pedantic on such code. My concern is about backward compatibility.

    The >else"#fca"\< code was added 6 years ago (commit 8b95d5e0bf00d9d0098579d29fd6bb9322071879) and nobody complained, it "just worked". If we keep the SyntaxError, random projects will be broken, but I don't see much benefits for the users.

    In short, I concur with Guido :-)

    lysnikolaou commented 4 years ago

    A possible solution would be to only emit a SyntaxError if the NAME directly preceding a STRING token contains one of the valid string prefixes (either one of 'f', 'r', 'u', 'b'). This would still output a nicer error message, but would not break code like the one of the example. What do you think about this?

    aeros commented 4 years ago

    For addressing the backwards compatibility concern, I think we should just convert it to a SyntaxWarning for cases like the above to indicate that it's not really correct syntax, but not harmful enough to justify code breakage. I think it fits the documented description of SyntaxWarning well, which is to address "dubious syntax".

    Lysandros Nikolaou wrote:

    A possible solution would be to only emit a SyntaxError if the NAME directly preceding a STRING token contains one of the valid string prefixes (either one of 'f', 'r', 'u', 'b'). This would still output a nicer error message, but would not break code like the one of the example. What do you think about this?

    That would certainly help to minimize the breakage, so I'd be in favor of that over a SyntaxError for all invalid prefixes. But, I'm not certain that it's additionally harmful if an invalid string prefix proceeds a valid one. Is there any additional harm, other than from a visual perspective?

    serhiy-storchaka commented 4 years ago

    It's no different from other edge cases where for some reason the space seems optional, like 1and 2.

    The parser is not consistent here. 0or[] is an error, while 0and[] and 1or[] are valid. See https://mail.python.org/archives/list/python-dev@python.org/message/D2WPCITHG2LBQAP7DBTC6CY26WQUBAKP/

    A possible solution would be to only emit a SyntaxError if the NAME directly preceding a STRING token contains one of the valid string prefixes (either one of 'f', 'r', 'u', 'b').

    This would not help for if'a'<=x<='z'. And we will get more breakage when add more string prefixes.

    We should either require a whitespace between an identifier and a string literal (with SyntaxWarning in meantime) or allow to omit it.

    encukou commented 4 years ago

    reportlab reported failures on code like:

    norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')

    Note that or has a r in it.

    ericvsmith commented 4 years ago

    From earlier in this issue: https://bugs.python.org/msg366164

    So a slightly shorter example uses ru''. This is an error because you can't combine the r prefix and the u prefix (in fact you can't combine anything with the r prefix).

    That's not true: 'rf' is a valid prefix: rf'{1}'. And so are all of the wacky combinations rF, Rf, RF, fr, fR, Fr, FR.

    serhiy-storchaka commented 4 years ago

    I think it was a typo. You cannot combine anything with the u prefix.

    ericvsmith commented 4 years ago

    I think you're right, since rb obviously works, too. I just wanted to make sure we're covering all the bases. There's some code in either the stdlib or the test suite where I generate all of the valid prefixes (it's something like 80 prefixes). I can dig it up if anyone needs it.

    lysnikolaou commented 4 years ago

    What's also possible is to handle keywords at tokenizer level and return a different token type for each one of them (like with async and await), which is currently handled at parser level. This would enable us to allow reserved keywords right before a STRING token, thus covering all the possible broken code cases, but continue disallowing arbitrary NAMEs, which would mean better error-reporting in the case of invalid prefixes.

    pablogsal commented 4 years ago

    What's also possible is to handle keywords at tokenizer level

    Sadly, the tokenizer is unaware of keywords (maybe except async and await because their history as soft keywords) as that distinction usually is up to the parser as the parser is the one who knows about the grammar. Introducing keywords in the tokenizer would couple them too much IMHO apart from other possible problems.

    gvanrossum commented 4 years ago

    If it's all handled by the tokenizer, how come it's different in the new parser?

    Anyway, it seems we really have to support what the old parser supported here. I don't know exactly what rules it uses. Maybe only look for a string quote following a name that can definitely be a string prefix?

    pablogsal commented 4 years ago

    If it's all handled by the tokenizer, how come it's different in the newparser?

    Is not different in the new parser: both parsers have analogous behaviour now:

    ~/github/python/master master
    ❯ ./python
    Python 3.9.0a6+ (heads/master:84724dd239, Apr 29 2020, 20:26:52)
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
      File "<stdin>", line 1
        norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
                                             ^
    SyntaxError: invalid string prefix
    >>>
    
    ~/github/python/master master
    ❯ ./python -X oldparser
    Python 3.9.0a6+ (heads/master:84724dd239, Apr 29 2020, 20:26:52)
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
      File "<stdin>", line 1
        norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
                                           ^
    SyntaxError: invalid string prefix

    This issue is exclusively due to the changes in https://github.com/python/cpython/pull/19476 if I understand correctly.

    pablogsal commented 4 years ago

    Indeed, reverting commit 41d5b94af44e34ac05d4cd57460ed104ccf96628 makes it work with both parsers:

    ~/github/python/master master*
    ❯ ./python -X oldparser
    Python 3.9.0a6+ (heads/master-dirty:84724dd239, Apr 29 2020, 20:29:53)
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
    >>>
    
    ~/github/python/master master*
    ❯ ./python
    Python 3.9.0a6+ (heads/master-dirty:84724dd239, Apr 29 2020, 20:29:53)
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
    >>>
    serhiy-storchaka commented 4 years ago

    The original issue was about different error messages in REPL and eval(). But it is not related to string prefixes. We have the same difference without involving strings:

    >>> a b
      File "<stdin>", line 1
        a b
          ^
    SyntaxError: invalid syntax
    >>> eval("a b")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
        a b
          ^
    SyntaxError: unexpected EOF while parsing

    I suggest to revert this change and close the issue.

    bc75918c-a209-4fa3-b6cf-28cfb7317f76 commented 4 years ago

    We also get:

    File "/usr/lib/python3.9/site-packages/dnf/comps.py", line 111 return'.'.join(lcl) ^ SyntaxError: invalid string prefix

    pablogsal commented 4 years ago

    New changeset 846d8b28ab9bb6197ee81372820311c0abe509c0 by Lysandros Nikolaou in branch 'master': bpo-40246: Revert reporting of invalid string prefixes (GH-19888) https://github.com/python/cpython/commit/846d8b28ab9bb6197ee81372820311c0abe509c0

    lysnikolaou commented 4 years ago

    The revert is in. Now the question is if we want to take additional action to address the original issue of this.

    bc75918c-a209-4fa3-b6cf-28cfb7317f76 commented 4 years ago

    I will soon come back with what Fedora package were affected by the problem. That could give some data about how to handle this.

    bc75918c-a209-4fa3-b6cf-28cfb7317f76 commented 4 years ago

    Not that many:

    cpython itself (fixed via PR) demjson (fixed via PR) asn1crypto (fixed via PR) dnf (fixed via PR) freeipa (fixed via PR)

    I gave up sending PRs at this point.

    waf weasyprint virt-who thrift salt wxpython4 rosdistro mne pycairo libstoragemgmt (possibly via bundled lsm/external/xmltodict) dput-ng ddupdate aubio (via bundled waf)

    vstinner commented 4 years ago

    Lysandros:

    The revert is in. Now the question is if we want to take additional action to address the original issue of this.

    If someone cares of that, I suggest to open an issue in pylint, pyflakes and similar tools to emit a warning in linters.

    lysnikolaou commented 4 years ago

    Ok, I'm closing this, after consulting with Guido.

    vstinner commented 4 years ago

    The following change broke test_fstring when using the old parser. I reopen the issue.

    commit 846d8b28ab9bb6197ee81372820311c0abe509c0 (refs/bisect/bad) Author: Lysandros Nikolaou \lisandrosnik@gmail.com\ Date: Mon May 4 14:32:18 2020 +0300

    bpo-40246: Revert reporting of invalid string prefixes (GH-19888)
    
    Due to backwards compatibility concerns regarding keywords immediately followed by a string without whitespace between them (like in `bg="#d00" if clear else"#fca"`) will fail to parse,
    commit 41d5b94af44e34ac05d4cd57460ed104ccf96628 has to be reverted.
    $ ./python -X oldparser -m test -v test_fstring 
    (...)

    ====================================================================== FAIL: test_invalid_string_prefixes (test.test_fstring.TestCase) (str='BF""') ---------------------------------------------------------------------- File "\<string>", line 1 BF"" ^ SyntaxError: invalid syntax

    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/vstinner/python/master/Lib/test/test_fstring.py", line 29, in assertAllRaise
        eval(str)
    AssertionError: "unexpected EOF while parsing" does not match "invalid syntax (<string>, line 1)"
    (...)
    vstinner commented 4 years ago

    The error can be seen on the new cool AMD64 Arch Linux VintageParser 3.x, vintage is the new cool: https://buildbot.python.org/all/#builders/648/builds/185

    gvanrossum commented 4 years ago

    We should run the tests with the old parser in at least one build.

    terryjreedy commented 4 years ago

    The >else"#fca"\< code was added 6 years ago (commit 8b95d5e0bf00d9d0098579d29fd6bb9322071879)

    That was my typo in turtledemo.__main__ (similar lines before had the space) and I wish that it had been caught then.

    Why did compileall in the test suite not catch this SyntaxError before 3.6.0a6 was released? Does it intentionally skip .__main__ files? Should I add a test_turtledemo file (with checks for the presence of tkinter and turtle)?

    vstinner commented 4 years ago

    "make install" ignores compileall errors no? If I recall correctly, only an error is logged into stdlib, "make install" doesn't fail.

    pablogsal commented 4 years ago

    New changeset 6cb0ad20396116b5076a58b05b55286d6d5e0c94 by Lysandros Nikolaou in branch '3.9': bpo-40246: Fix test_fstring when run with the old parser (GH-20402) https://github.com/python/cpython/commit/6cb0ad20396116b5076a58b05b55286d6d5e0c94