Closed lysnikolaou closed 4 years ago
While testing pegen, we found this out:
Python 3.9.0a5+ (heads/pegen:502dfb719e, Apr 10 2020, 20:57:05)
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> fur''
File "<stdin>", line 1
fur''
^
SyntaxError: invalid syntax
>>> eval("fur''")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 1
fur''
^
SyntaxError: unexpected EOF while parsing
After some investigating, I found out that this is happening because when we execute fur''
in the interactive interpreter there is an implicit newline, which means that EOF does not get reached.
OTOH there is no newline when passing it as a string inside an eval
call. After the tokenizer encounters the first quote, it calls tok_nextc
twice more, in order to differentiate between a single-quoted and a triple-quoted string, thus reaching EOF and proapgating an EOF error up through the perrdetail
struct.
So a slightly shorter example uses ru''. This is an error because you can't combine the r prefix and the u prefix (in fact you can't combine anything with the r prefix).
I declare that this is a bug to report EOF here and the code should *first check for a valid combination (e.g. fr'' or rf'') and only *then try to distinguish between single and triple quotes.
At the moment, if the prefix is invalid, it gets tokenized as a NAME node and the string following is tokenized as a STRING node with no prefix.
╰─ echo "ur''" | ./python -m tokenize
1,0-1,2: NAME 'ur'
1,2-1,4: STRING "''"
1,4-1,5: NEWLINE '\n'
2,0-2,0: ENDMARKER ''
╰─ echo "f''" | ./python -m tokenize
1,0-1,3: STRING "f''"
1,3-1,4: NEWLINE '\n'
2,0-2,0: ENDMARKER ''
But that is the tokenize.py module. It does not have 100% the same behavior as the builtin tokenizer.c module. In this case it probably doesn't.
In my understanding of the C code that's what the C tokenizer is doing as well.
Here's the relevant snippet of the tokenizer (https://github.com/python/cpython/blob/4b222c9491d1700e9bdd98e6889b8d0ea1c7321e/Parser/tokenizer.c#L1358): When the tokenizer sees a valid identifier start, it goes into a loop that checks for a valid combination of string prefixes. If the combination is valid and it sees a quote directly after that, it goto's to the STRING-handling code. If not, then it breaks out of the loop and returns a NAME node.
Am I missing something?
I have working code that checks if there is a quotation mark right after an identifier. Here is an example:
╰─ ./python
Python 3.9.0a5+ (heads/pegen-dirty:502dfb719e, Apr 11 2020, 14:43:12)
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> ur''
File "<stdin>", line 1
ur''
^
SyntaxError: invalid string prefix
One observation about this is that it has precedence over an EOL error:
>>> ur'
File "<stdin>", line 1
ur'
^
SyntaxError: invalid string prefix
Would that be acceptable?
yes
On Sat, Apr 11, 2020 at 04:46 Lysandros Nikolaou \report@bugs.python.org\ wrote:
Lysandros Nikolaou \lisandrosnik@gmail.com\ added the comment:
I have working code that checks if there is a quotation mark right after an identifier. Here is an example:
╰─ ./python Python 3.9.0a5+ (heads/pegen-dirty:502dfb719e, Apr 11 2020, 14:43:12) [GCC 9.2.1 20191008] on linux Type "help", "copyright", "credits" or "license" for more information. >>> ur'' File "\<stdin>", line 1 ur'' ^ SyntaxError: invalid string prefix
One observation about this is that it has precedence over an EOL error:
>>> ur' File "\<stdin>", line 1 ur' ^ SyntaxError: invalid string prefix
Would that be acceptable?
----------
Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue40246\
-- --Guido (mobile)
New changeset 41d5b94af44e34ac05d4cd57460ed104ccf96628 by Lysandros Nikolaou in branch 'master': bpo-40246: Report a better error message for invalid string prefixes (GH-19476) https://github.com/python/cpython/commit/41d5b94af44e34ac05d4cd57460ed104ccf96628
On bpo-40334 @vstinner reported that this change broke some code in turtledemo. The code looks like this:
print(dict(state=clear, bg="#d00" if clear else"#fca"))
Due to the absence of a space between else
and "
, the else keyword is now interpreted as an invalid string prefix. Is that acceptable?
I reopen the issue.
The following code works on Python 3.8, but fails with SyntaxError on Python 3.9.0a6 with the old and the new parser (see bpo-40431):
clear = "NORMAL"
print(dict(state=clear, bg="#d00" if clear else"#fca"))
Well, the code looks like a typo error... but it worked previously.
Not sure if we should keep the SyntaxError or not. Fixing the code is trivial: see PR 19777 attached to bpo-40431; replace >else"#fca"\< with >else "#fca"\<.
Note: I first reported the issue to https://bugs.python.org/issue40334#msg367580
I personally think the new behaviour is more consistent and more likely to catch errors.
I personally think the new behaviour is more consistent and more likely to catch errors.
Would it be possible to emit a SyntaxWarning in 3.9 and only emit a SyntaxError in 3.10?
Personally I don't think that else"stuff"
should become a syntax error. It's no different from other edge cases where for some reason the space seems optional, like 1and 2
.
I think that a linter can be very pedantic on such code. My concern is about backward compatibility.
The >else"#fca"\< code was added 6 years ago (commit 8b95d5e0bf00d9d0098579d29fd6bb9322071879) and nobody complained, it "just worked". If we keep the SyntaxError, random projects will be broken, but I don't see much benefits for the users.
In short, I concur with Guido :-)
A possible solution would be to only emit a SyntaxError if the NAME directly preceding a STRING token contains one of the valid string prefixes (either one of 'f', 'r', 'u', 'b'). This would still output a nicer error message, but would not break code like the one of the example. What do you think about this?
For addressing the backwards compatibility concern, I think we should just convert it to a SyntaxWarning for cases like the above to indicate that it's not really correct syntax, but not harmful enough to justify code breakage. I think it fits the documented description of SyntaxWarning well, which is to address "dubious syntax".
Lysandros Nikolaou wrote:
A possible solution would be to only emit a SyntaxError if the NAME directly preceding a STRING token contains one of the valid string prefixes (either one of 'f', 'r', 'u', 'b'). This would still output a nicer error message, but would not break code like the one of the example. What do you think about this?
That would certainly help to minimize the breakage, so I'd be in favor of that over a SyntaxError for all invalid prefixes. But, I'm not certain that it's additionally harmful if an invalid string prefix proceeds a valid one. Is there any additional harm, other than from a visual perspective?
It's no different from other edge cases where for some reason the space seems optional, like
1and 2
.
The parser is not consistent here. 0or[]
is an error, while 0and[]
and 1or[]
are valid. See https://mail.python.org/archives/list/python-dev@python.org/message/D2WPCITHG2LBQAP7DBTC6CY26WQUBAKP/
A possible solution would be to only emit a SyntaxError if the NAME directly preceding a STRING token contains one of the valid string prefixes (either one of 'f', 'r', 'u', 'b').
This would not help for if'a'<=x<='z'
. And we will get more breakage when add more string prefixes.
We should either require a whitespace between an identifier and a string literal (with SyntaxWarning in meantime) or allow to omit it.
reportlab reported failures on code like:
norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
Note that or
has a r
in it.
From earlier in this issue: https://bugs.python.org/msg366164
So a slightly shorter example uses ru''. This is an error because you can't combine the r prefix and the u prefix (in fact you can't combine anything with the r prefix).
That's not true: 'rf' is a valid prefix: rf'{1}'. And so are all of the wacky combinations rF, Rf, RF, fr, fR, Fr, FR.
I think it was a typo. You cannot combine anything with the u prefix.
I think you're right, since rb obviously works, too. I just wanted to make sure we're covering all the bases. There's some code in either the stdlib or the test suite where I generate all of the valid prefixes (it's something like 80 prefixes). I can dig it up if anyone needs it.
What's also possible is to handle keywords at tokenizer level and return a different token type for each one of them (like with async and await), which is currently handled at parser level. This would enable us to allow reserved keywords right before a STRING token, thus covering all the possible broken code cases, but continue disallowing arbitrary NAMEs, which would mean better error-reporting in the case of invalid prefixes.
What's also possible is to handle keywords at tokenizer level
Sadly, the tokenizer is unaware of keywords (maybe except async and await because their history as soft keywords) as that distinction usually is up to the parser as the parser is the one who knows about the grammar. Introducing keywords in the tokenizer would couple them too much IMHO apart from other possible problems.
If it's all handled by the tokenizer, how come it's different in the new parser?
Anyway, it seems we really have to support what the old parser supported here. I don't know exactly what rules it uses. Maybe only look for a string quote following a name that can definitely be a string prefix?
If it's all handled by the tokenizer, how come it's different in the newparser?
Is not different in the new parser: both parsers have analogous behaviour now:
~/github/python/master master
❯ ./python
Python 3.9.0a6+ (heads/master:84724dd239, Apr 29 2020, 20:26:52)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
File "<stdin>", line 1
norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
^
SyntaxError: invalid string prefix
>>>
~/github/python/master master
❯ ./python -X oldparser
Python 3.9.0a6+ (heads/master:84724dd239, Apr 29 2020, 20:26:52)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
File "<stdin>", line 1
norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
^
SyntaxError: invalid string prefix
This issue is exclusively due to the changes in https://github.com/python/cpython/pull/19476 if I understand correctly.
Indeed, reverting commit 41d5b94af44e34ac05d4cd57460ed104ccf96628 makes it work with both parsers:
~/github/python/master master*
❯ ./python -X oldparser
Python 3.9.0a6+ (heads/master-dirty:84724dd239, Apr 29 2020, 20:29:53)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
>>>
~/github/python/master master*
❯ ./python
Python 3.9.0a6+ (heads/master-dirty:84724dd239, Apr 29 2020, 20:29:53)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
>>>
The original issue was about different error messages in REPL and eval(). But it is not related to string prefixes. We have the same difference without involving strings:
>>> a b
File "<stdin>", line 1
a b
^
SyntaxError: invalid syntax
>>> eval("a b")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 1
a b
^
SyntaxError: unexpected EOF while parsing
I suggest to revert this change and close the issue.
We also get:
File "/usr/lib/python3.9/site-packages/dnf/comps.py", line 111 return'.'.join(lcl) ^ SyntaxError: invalid string prefix
New changeset 846d8b28ab9bb6197ee81372820311c0abe509c0 by Lysandros Nikolaou in branch 'master': bpo-40246: Revert reporting of invalid string prefixes (GH-19888) https://github.com/python/cpython/commit/846d8b28ab9bb6197ee81372820311c0abe509c0
The revert is in. Now the question is if we want to take additional action to address the original issue of this.
I will soon come back with what Fedora package were affected by the problem. That could give some data about how to handle this.
Not that many:
cpython itself (fixed via PR) demjson (fixed via PR) asn1crypto (fixed via PR) dnf (fixed via PR) freeipa (fixed via PR)
I gave up sending PRs at this point.
waf weasyprint virt-who thrift salt wxpython4 rosdistro mne pycairo libstoragemgmt (possibly via bundled lsm/external/xmltodict) dput-ng ddupdate aubio (via bundled waf)
Lysandros:
The revert is in. Now the question is if we want to take additional action to address the original issue of this.
If someone cares of that, I suggest to open an issue in pylint, pyflakes and similar tools to emit a warning in linters.
Ok, I'm closing this, after consulting with Guido.
The following change broke test_fstring when using the old parser. I reopen the issue.
commit 846d8b28ab9bb6197ee81372820311c0abe509c0 (refs/bisect/bad) Author: Lysandros Nikolaou \lisandrosnik@gmail.com\ Date: Mon May 4 14:32:18 2020 +0300
bpo-40246: Revert reporting of invalid string prefixes (GH-19888)
Due to backwards compatibility concerns regarding keywords immediately followed by a string without whitespace between them (like in `bg="#d00" if clear else"#fca"`) will fail to parse,
commit 41d5b94af44e34ac05d4cd57460ed104ccf96628 has to be reverted.
$ ./python -X oldparser -m test -v test_fstring
(...)
====================================================================== FAIL: test_invalid_string_prefixes (test.test_fstring.TestCase) (str='BF""') ---------------------------------------------------------------------- File "\<string>", line 1 BF"" ^ SyntaxError: invalid syntax
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/vstinner/python/master/Lib/test/test_fstring.py", line 29, in assertAllRaise
eval(str)
AssertionError: "unexpected EOF while parsing" does not match "invalid syntax (<string>, line 1)"
(...)
The error can be seen on the new cool AMD64 Arch Linux VintageParser 3.x, vintage is the new cool: https://buildbot.python.org/all/#builders/648/builds/185
We should run the tests with the old parser in at least one build.
The >else"#fca"\< code was added 6 years ago (commit 8b95d5e0bf00d9d0098579d29fd6bb9322071879)
That was my typo in turtledemo.__main__ (similar lines before had the space) and I wish that it had been caught then.
Why did compileall in the test suite not catch this SyntaxError before 3.6.0a6 was released? Does it intentionally skip .__main__ files? Should I add a test_turtledemo file (with checks for the presence of tkinter and turtle)?
"make install" ignores compileall errors no? If I recall correctly, only an error is logged into stdlib, "make install" doesn't fail.
New changeset 6cb0ad20396116b5076a58b05b55286d6d5e0c94 by Lysandros Nikolaou in branch '3.9': bpo-40246: Fix test_fstring when run with the old parser (GH-20402) https://github.com/python/cpython/commit/6cb0ad20396116b5076a58b05b55286d6d5e0c94
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['interpreter-core', 'type-bug', '3.9']
title = 'Different error messages for same error - invalid string prefixes'
updated_at =
user = 'https://github.com/lysnikolaou'
```
bugs.python.org fields:
```python
activity =
actor = 'lys.nikolaou'
assignee = 'none'
closed = True
closed_date =
closer = 'lys.nikolaou'
components = ['Interpreter Core']
creation =
creator = 'lys.nikolaou'
dependencies = []
files = []
hgrepos = []
issue_num = 40246
keywords = ['patch']
message_count = 41.0
messages = ['366143', '366146', '366164', '366165', '366187', '366197', '366202', '366215', '366260', '367586', '367587', '367588', '367591', '367592', '367597', '367599', '367611', '367628', '367649', '367658', '367660', '367664', '367672', '367689', '367690', '367691', '367693', '367698', '367977', '368038', '368039', '368040', '368045', '368062', '368191', '368919', '368920', '368925', '369077', '369217', '369931']
nosy_count = 10.0
nosy_names = ['gvanrossum', 'terry.reedy', 'vstinner', 'eric.smith', 'petr.viktorin', 'serhiy.storchaka', 'hroncok', 'lys.nikolaou', 'pablogsal', 'aeros']
pr_nums = ['19476', '19888', '20402']
priority = 'normal'
resolution = 'wont fix'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue40246'
versions = ['Python 3.9']
```