Closed mgorny closed 9 months ago
I've digged a bit since there were some regressions in Python 3.12's tokenizer but this doesn't seem to be done. FWICS Babel is decoding the BOM into U+FEFF, and then passing it into generate_tokens()
.
Note that in Python 3.11 this returned ERRORTOKEN:
>>> list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline))
[TokenInfo(type=60 (ERRORTOKEN), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
whereas in Python 3.12 it raises a SyntaxError
:
>>> list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.12/tokenize.py", line 451, in _tokenize
for token in _generate_tokens_from_c_tokenizer(source, extra_tokens=True):
File "/usr/lib/python3.12/tokenize.py", line 542, in _generate_tokens_from_c_tokenizer
for info in c_tokenizer.TokenizerIter(source, extra_tokens=extra_tokens):
File "<string>", line 1
^
SyntaxError: invalid non-printable character U+FEFF
CPython itself is stripping BOM as part of the encoding detection, before starting to decode the source for tokenization. Babel probably needs to do the same.
Are you still able to reproduce the issue with the just released Python 3.12.0rc3 version? The issue was created at May 25, I supposed that Python 3.12.0 beta1 was tested. But bugs were fixed in the meanwhile.
I get a different behavior with https://github.com/python-babel/babel/issues/1005#issuecomment-1566238052 example and Python 3.12.0rc2.
bug.py
:
import io, tokenize
print(list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline)))
Output:
$ python3.11 bug.py
[TokenInfo(type=60 (ERRORTOKEN), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
$ python3.12 bug.py
[TokenInfo(type=1 (NAME), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
$ python3.11 -VV
Python 3.11.5 (main, Aug 28 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)]
$ python3.12 -VV
Python 3.12.0rc2 (main, Sep 6 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)]
I don't get a SyntaxError.
Using the REPL:
$ python3.12
Python 3.12.0rc2 (main, Sep 6 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tokenize, io
>>> list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline))
[TokenInfo(type=1 (NAME), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
The first three failures seem to be gone. These two seem to remain (plus the missing setuptools
dependency):
___________________________________________________ ExtractTestCase.test_f_strings ____________________________________________________
self = <tests.messages.test_extract.ExtractTestCase testMethod=test_f_strings>
def test_f_strings(self):
buf = BytesIO(br"""
t1 = _('foobar')
t2 = _(f'spameggs' f'feast') # should be extracted; constant parts only
t2 = _(f'spameggs' 'kerroshampurilainen') # should be extracted (mixing f with no f)
t3 = _(f'''whoa! a ''' # should be extracted (continues on following lines)
f'flying shark'
'... hello'
)
t4 = _(f'spameggs {t1}') # should not be extracted
""")
messages = list(extract.extract('python', buf, extract.DEFAULT_KEYWORDS, [], {}))
> assert len(messages) == 4
E AssertionError: assert 3 == 4
E + where 3 = len([(2, 'foobar', [], None), (4, 'kerroshampurilainen', [], None), (5, '... hello', [], None)])
tests/messages/test_extract.py:544: AssertionError
_______________________________________________ ExtractTestCase.test_f_strings_non_utf8 _______________________________________________
self = <tests.messages.test_extract.ExtractTestCase testMethod=test_f_strings_non_utf8>
def test_f_strings_non_utf8(self):
buf = BytesIO(b"""
# -- coding: latin-1 --
t2 = _(f'\xe5\xe4\xf6' f'\xc5\xc4\xd6')
""")
messages = list(extract.extract('python', buf, extract.DEFAULT_KEYWORDS, [], {}))
> assert len(messages) == 1
E assert 0 == 1
E + where 0 = len([])
tests/messages/test_extract.py:556: AssertionError
To install distutils on Python 3.12, you can use this change:
diff --git a/tox.ini b/tox.ini
index 11cca0c..7c4d56a 100644
--- a/tox.ini
+++ b/tox.ini
@@ -11,6 +11,7 @@ deps =
backports.zoneinfo;python_version<"3.9"
tzdata;sys_platform == 'win32'
pytz: pytz
+ setuptools;python_version>="3.12"
allowlist_externals = make
commands = make clean-cldr test
setenv =
Here's a PR for the f-string parsing: https://github.com/python-babel/babel/pull/1027
Released in https://pypi.org/project/Babel/2.13.0/ just now 🎉
Regarding https://github.com/python-babel/babel/issues/1005#issuecomment-1728105742, adding the "setuptools" dependency only for CI was not the correct solution, because it's the package itself that depends on it, so CI of other projects (and actual local usages) will still break. I opened issue #1031 and a pull request accordingly.
Overview Description
The test suite fails when run with Python 3.12.0b1:
Furthermore,
tox -e py312
fails by default because of missingdistutils
module (installingsetuptools
can workaround that but distutils use should be removed altogether).Steps to Reproduce
tox -e py312
Actual Results
Expected Results
Passing tests (or at least passing as well as py3.11 did).
Reproducibility
Always.
Additional Information
Confirmed with git 8b152dbe47cb830f66ad12bd3057e6128aeac072.