mgorny commented 1 year ago

Overview Description

The test suite fails when run with Python 3.12.0b1:

FAILED tests/messages/test_extract.py::ExtractPythonTestCase::test_utf8_message_with_utf8_bom -   File "<string>", line 1
FAILED tests/messages/test_extract.py::ExtractPythonTestCase::test_utf8_message_with_utf8_bom_and_magic_comment -   File "<string>", line 1
FAILED tests/messages/test_extract.py::ExtractPythonTestCase::test_utf8_raw_strings_match_unicode_strings -   File "<string>", line 1
FAILED tests/messages/test_extract.py::ExtractTestCase::test_f_strings - AssertionError: assert 3 == 4
FAILED tests/messages/test_extract.py::ExtractTestCase::test_f_strings_non_utf8 - assert 0 == 1

Furthermore, tox -e py312 fails by default because of missing distutils module (installing setuptools can workaround that but distutils use should be removed altogether).

Steps to Reproduce

tox -e py312

Actual Results

________________________________________ ExtractPythonTestCase.test_utf8_message_with_utf8_bom ________________________________________

self = <tests.messages.test_extract.ExtractPythonTestCase testMethod=test_utf8_message_with_utf8_bom>

        def test_utf8_message_with_utf8_bom(self):
            buf = BytesIO(codecs.BOM_UTF8 + """
    # NOTE: hello
    msg = _('Bonjour à tous')
    """.encode('utf-8'))
>           messages = list(extract.extract_python(buf, ('_',), ['NOTE:'], {}))

tests/messages/test_extract.py:367: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
babel/messages/extract.py:500: in extract_python
    for tok, value, (lineno, _), _, _ in tokens:
/usr/lib/python3.12/tokenize.py:451: in _tokenize
    for token in _generate_tokens_from_c_tokenizer(source, extra_tokens=True):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

source = "\ufeff\n# NOTE: hello\nmsg = _('Bonjour à tous')\n", extra_tokens = True

    def _generate_tokens_from_c_tokenizer(source, extra_tokens=False):
        """Tokenize a source reading Python code as unicode strings using the internal C tokenizer"""
        import _tokenize as c_tokenizer
>       for info in c_tokenizer.TokenizerIter(source, extra_tokens=extra_tokens):
E         File "<string>", line 1
E           
E           ^
E       SyntaxError: invalid non-printable character U+FEFF

/usr/lib/python3.12/tokenize.py:542: SyntaxError
_______________________________ ExtractPythonTestCase.test_utf8_message_with_utf8_bom_and_magic_comment _______________________________

self = <tests.messages.test_extract.ExtractPythonTestCase testMethod=test_utf8_message_with_utf8_bom_and_magic_comment>

        def test_utf8_message_with_utf8_bom_and_magic_comment(self):
            buf = BytesIO(codecs.BOM_UTF8 + """# -*- coding: utf-8 -*-
    # NOTE: hello
    msg = _('Bonjour à tous')
    """.encode('utf-8'))
>           messages = list(extract.extract_python(buf, ('_',), ['NOTE:'], {}))

tests/messages/test_extract.py:376: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
babel/messages/extract.py:500: in extract_python
    for tok, value, (lineno, _), _, _ in tokens:
/usr/lib/python3.12/tokenize.py:451: in _tokenize
    for token in _generate_tokens_from_c_tokenizer(source, extra_tokens=True):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

source = "\ufeff# -*- coding: utf-8 -*-\n# NOTE: hello\nmsg = _('Bonjour à tous')\n", extra_tokens = True

    def _generate_tokens_from_c_tokenizer(source, extra_tokens=False):
        """Tokenize a source reading Python code as unicode strings using the internal C tokenizer"""
        import _tokenize as c_tokenizer
>       for info in c_tokenizer.TokenizerIter(source, extra_tokens=extra_tokens):
E         File "<string>", line 1
E           # -*- coding: utf-8 -*-
E           ^
E       SyntaxError: invalid non-printable character U+FEFF

/usr/lib/python3.12/tokenize.py:542: SyntaxError
__________________________________ ExtractPythonTestCase.test_utf8_raw_strings_match_unicode_strings __________________________________

self = <tests.messages.test_extract.ExtractPythonTestCase testMethod=test_utf8_raw_strings_match_unicode_strings>

        def test_utf8_raw_strings_match_unicode_strings(self):
            buf = BytesIO(codecs.BOM_UTF8 + """
    msg = _('Bonjour à tous')
    msgu = _(u'Bonjour à tous')
    """.encode('utf-8'))
>           messages = list(extract.extract_python(buf, ('_',), ['NOTE:'], {}))

tests/messages/test_extract.py:393: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
babel/messages/extract.py:500: in extract_python
    for tok, value, (lineno, _), _, _ in tokens:
/usr/lib/python3.12/tokenize.py:451: in _tokenize
    for token in _generate_tokens_from_c_tokenizer(source, extra_tokens=True):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

source = "\ufeff\nmsg = _('Bonjour à tous')\nmsgu = _(u'Bonjour à tous')\n", extra_tokens = True

    def _generate_tokens_from_c_tokenizer(source, extra_tokens=False):
        """Tokenize a source reading Python code as unicode strings using the internal C tokenizer"""
        import _tokenize as c_tokenizer
>       for info in c_tokenizer.TokenizerIter(source, extra_tokens=extra_tokens):
E         File "<string>", line 1
E           
E           ^
E       SyntaxError: invalid non-printable character U+FEFF

/usr/lib/python3.12/tokenize.py:542: SyntaxError
___________________________________________________ ExtractTestCase.test_f_strings ____________________________________________________

self = <tests.messages.test_extract.ExtractTestCase testMethod=test_f_strings>

        def test_f_strings(self):
            buf = BytesIO(br"""
    t1 = _('foobar')
    t2 = _(f'spameggs' f'feast')  # should be extracted; constant parts only
    t2 = _(f'spameggs' 'kerroshampurilainen')  # should be extracted (mixing f with no f)
    t3 = _(f'''whoa! a '''  # should be extracted (continues on following lines)
    f'flying shark'
        '... hello'
    )
    t4 = _(f'spameggs {t1}')  # should not be extracted
    """)
            messages = list(extract.extract('python', buf, extract.DEFAULT_KEYWORDS, [], {}))
>           assert len(messages) == 4
E           AssertionError: assert 3 == 4
E            +  where 3 = len([(2, 'foobar', [], None), (4, 'kerroshampurilainen', [], None), (5, '... hello', [], None)])

tests/messages/test_extract.py:544: AssertionError
_______________________________________________ ExtractTestCase.test_f_strings_non_utf8 _______________________________________________

self = <tests.messages.test_extract.ExtractTestCase testMethod=test_f_strings_non_utf8>

        def test_f_strings_non_utf8(self):
            buf = BytesIO(b"""
    # -- coding: latin-1 --
    t2 = _(f'\xe5\xe4\xf6' f'\xc5\xc4\xd6')
    """)
            messages = list(extract.extract('python', buf, extract.DEFAULT_KEYWORDS, [], {}))
>           assert len(messages) == 1
E           assert 0 == 1
E            +  where 0 = len([])

tests/messages/test_extract.py:556: AssertionError

Expected Results

Passing tests (or at least passing as well as py3.11 did).

Reproducibility

Always.

Additional Information

Confirmed with git 8b152dbe47cb830f66ad12bd3057e6128aeac072.

mgorny commented 1 year ago

I've digged a bit since there were some regressions in Python 3.12's tokenizer but this doesn't seem to be done. FWICS Babel is decoding the BOM into U+FEFF, and then passing it into generate_tokens().

Note that in Python 3.11 this returned ERRORTOKEN:

>>> list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline))
[TokenInfo(type=60 (ERRORTOKEN), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

whereas in Python 3.12 it raises a SyntaxError:

>>> list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.12/tokenize.py", line 451, in _tokenize
    for token in _generate_tokens_from_c_tokenizer(source, extra_tokens=True):
  File "/usr/lib/python3.12/tokenize.py", line 542, in _generate_tokens_from_c_tokenizer
    for info in c_tokenizer.TokenizerIter(source, extra_tokens=extra_tokens):
  File "<string>", line 1
    
    ^
SyntaxError: invalid non-printable character U+FEFF

CPython itself is stripping BOM as part of the encoding detection, before starting to decode the source for tokenization. Babel probably needs to do the same.

vstinner commented 9 months ago

Are you still able to reproduce the issue with the just released Python 3.12.0rc3 version? The issue was created at May 25, I supposed that Python 3.12.0 beta1 was tested. But bugs were fixed in the meanwhile.

I get a different behavior with https://github.com/python-babel/babel/issues/1005#issuecomment-1566238052 example and Python 3.12.0rc2.

bug.py:

import io, tokenize
print(list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline)))

Output:

$ python3.11 bug.py 
[TokenInfo(type=60 (ERRORTOKEN), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

$ python3.12 bug.py 
[TokenInfo(type=1 (NAME), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

$ python3.11 -VV
Python 3.11.5 (main, Aug 28 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)]

$ python3.12 -VV
Python 3.12.0rc2 (main, Sep  6 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)]

I don't get a SyntaxError.

Using the REPL:

$ python3.12
Python 3.12.0rc2 (main, Sep  6 2023, 00:00:00) [GCC 13.2.1 20230728 (Red Hat 13.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tokenize, io
>>> list(tokenize.generate_tokens(io.StringIO('\ufeff\n').readline))
[TokenInfo(type=1 (NAME), string='\ufeff', start=(1, 0), end=(1, 1), line='\ufeff\n'), TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='\ufeff\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

mgorny commented 9 months ago

The first three failures seem to be gone. These two seem to remain (plus the missing setuptools dependency):

___________________________________________________ ExtractTestCase.test_f_strings ____________________________________________________

self = <tests.messages.test_extract.ExtractTestCase testMethod=test_f_strings>

        def test_f_strings(self):
            buf = BytesIO(br"""
    t1 = _('foobar')
    t2 = _(f'spameggs' f'feast')  # should be extracted; constant parts only
    t2 = _(f'spameggs' 'kerroshampurilainen')  # should be extracted (mixing f with no f)
    t3 = _(f'''whoa! a '''  # should be extracted (continues on following lines)
    f'flying shark'
        '... hello'
    )
    t4 = _(f'spameggs {t1}')  # should not be extracted
    """)
            messages = list(extract.extract('python', buf, extract.DEFAULT_KEYWORDS, [], {}))
>           assert len(messages) == 4
E           AssertionError: assert 3 == 4
E            +  where 3 = len([(2, 'foobar', [], None), (4, 'kerroshampurilainen', [], None), (5, '... hello', [], None)])

tests/messages/test_extract.py:544: AssertionError
_______________________________________________ ExtractTestCase.test_f_strings_non_utf8 _______________________________________________

self = <tests.messages.test_extract.ExtractTestCase testMethod=test_f_strings_non_utf8>

        def test_f_strings_non_utf8(self):
            buf = BytesIO(b"""
    # -- coding: latin-1 --
    t2 = _(f'\xe5\xe4\xf6' f'\xc5\xc4\xd6')
    """)
            messages = list(extract.extract('python', buf, extract.DEFAULT_KEYWORDS, [], {}))
>           assert len(messages) == 1
E           assert 0 == 1
E            +  where 0 = len([])

tests/messages/test_extract.py:556: AssertionError

vstinner commented 9 months ago

To install distutils on Python 3.12, you can use this change:

diff --git a/tox.ini b/tox.ini
index 11cca0c..7c4d56a 100644
--- a/tox.ini
+++ b/tox.ini
@@ -11,6 +11,7 @@ deps =
     backports.zoneinfo;python_version<"3.9"
     tzdata;sys_platform == 'win32'
     pytz: pytz
+    setuptools;python_version>="3.12"
 allowlist_externals = make
 commands = make clean-cldr test
 setenv =

encukou commented 9 months ago

Here's a PR for the f-string parsing: https://github.com/python-babel/babel/pull/1027

akx commented 9 months ago

1027 was just merged and we're now running CI on 3.12 too as of #1028. Thanks all! ❤️

akx commented 9 months ago

Released in https://pypi.org/project/Babel/2.13.0/ just now 🎉

oprypin commented 9 months ago

Regarding https://github.com/python-babel/babel/issues/1005#issuecomment-1728105742, adding the "setuptools" dependency only for CI was not the correct solution, because it's the package itself that depends on it, so CI of other projects (and actual local usages) will still break. I opened issue #1031 and a pull request accordingly.

python-babel / babel

Test failures with Python 3.12.0b1 #1005