python / cpython

The Python programming language
https://www.python.org
Other
62.37k stars 29.96k forks source link

test_re.py fails #45950

Closed aa2c5943-8264-4a78-97ed-7013d2cb52f6 closed 16 years ago

aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago
BPO 1609
Nosy @gvanrossum, @loewis, @amauryfa
Files
  • test2.py
  • test.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/loewis' closed_at = created_at = labels = ['invalid', 'type-bug', 'tests'] title = 'test_re.py fails' updated_at = user = 'https://bugs.python.org/donmez' ``` bugs.python.org fields: ```python activity = actor = 'donmez' assignee = 'loewis' closed = True closed_date = closer = 'gvanrossum' components = ['Tests'] creation = creator = 'donmez' dependencies = [] files = ['9005', '9006'] hgrepos = [] issue_num = 1609 keywords = [] message_count = 34.0 messages = ['58527', '58542', '58548', '58553', '58556', '58559', '58565', '58585', '58587', '58639', '58700', '58824', '58825', '58826', '58830', '58831', '58832', '58833', '58834', '58835', '58837', '58843', '58844', '58847', '58848', '58849', '58862', '58869', '58884', '58887', '58888', '58890', '58927', '58928'] nosy_count = 4.0 nosy_names = ['gvanrossum', 'loewis', 'amaury.forgeotdarc', 'donmez'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = None status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue1609' versions = ['Python 2.5'] ```

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Using python 2.5 revision 59479 from release25-maint branch,

    [~/python-2.5]> LD_LIBRARY_PATH=/home/cartman/python-2.5: ./python ./Lib/test/test_re.py test_anyall (main.ReTests) ... ok test_basic_re_sub (main.ReTests) ... ok test_bigcharset (main.ReTests) ... ok test_bug_113254 (main.ReTests) ... ok test_bug_1140 (main.ReTests) ... ok test_bug_114660 (main.ReTests) ... ok test_bug_117612 (main.ReTests) ... ok test_bug_418626 (main.ReTests) ... ok test_bug_448951 (main.ReTests) ... ok test_bug_449000 (main.ReTests) ... ok test_bug_449964 (main.ReTests) ... ok test_bug_462270 (main.ReTests) ... ok test_bug_527371 (main.ReTests) ... ok test_bug_545855 (main.ReTests) ... ok test_bug_581080 (main.ReTests) ... ok test_bug_612074 (main.ReTests) ... ok test_bug_725106 (main.ReTests) ... ok test_bug_725149 (main.ReTests) ... ok test_bug_764548 (main.ReTests) ... ok test_bug_817234 (main.ReTests) ... ok test_bug_926075 (main.ReTests) ... ok test_bug_931848 (main.ReTests) ... ok test_category (main.ReTests) ... ok test_constants (main.ReTests) ... ok test_empty_array (main.ReTests) ... ok test_expand (main.ReTests) ... ok test_finditer (main.ReTests) ... ok test_flags (main.ReTests) ... ok test_getattr (main.ReTests) ... ok test_getlower (main.ReTests) ... ok test_groupdict (main.ReTests) ... ok test_ignore_case (main.ReTests) ... ok test_non_consuming (main.ReTests) ... ok test_not_literal (main.ReTests) ... ok test_pickling (main.ReTests) ... ok test_qualified_re_split (main.ReTests) ... ok test_qualified_re_sub (main.ReTests) ... ok test_re_escape (main.ReTests) ... ok test_re_findall (main.ReTests) ... ok test_re_groupref (main.ReTests) ... ok test_re_groupref_exists (main.ReTests) ... ok test_re_match (main.ReTests) ... ok test_re_split (main.ReTests) ... ok test_re_subn (main.ReTests) ... ok test_repeat_minmax (main.ReTests) ... ok test_scanner (main.ReTests) ... ok test_search_coverage (main.ReTests) ... ok test_search_star_plus (main.ReTests) ... ok test_special_escapes (main.ReTests) ... ok test_sre_character_class_literals (main.ReTests) ... ok test_sre_character_literals (main.ReTests) ... ok test_stack_overflow (main.ReTests) ... ok test_sub_template_numeric_escape (main.ReTests) ... ok test_symbolic_refs (main.ReTests) ... ok test_weakref (main.ReTests) ... ok

    ---------------------------------------------------------------------- Ran 55 tests in 0.194s

    OK Running re_tests test suite === Failed incorrectly ('(?u)\\b.\\b', u'\xc4', 0, 'found', u'\xc4') === Failed incorrectly ('(?u)\\w', u'\xc4', 0, 'found', u'\xc4')

    gvanrossum commented 16 years ago

    Can't reproduce.

    Like before, what platform, compiler etc.? Does using ./configure --with-pydebug make a difference? What's the LD_LIBRARY_PATH for?

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    gcc 4.3, Linux 2.6.18, 32bit.

    Without LD_LIBRARY_PATH it would use the system libraries and not the compiled ones which anyway is not wanted.

    Configure line used is (damn I forgot to specify this before, sorry)

    --with-fpectl \ --enable-shared \ --enable-ipv6 \ --with-threads \ --enable-unicode=ucs4 \ --with-wctype-functions

    --enable-pydebug doesn't help.

    gvanrossum commented 16 years ago

    Without LD_LIBRARY_PATH it would use the system libraries and not the compiled ones which anyway is not wanted.

    What system libraries?

    Does it make a difference if you don't specify either of

    --enable-unicode=ucs4 \ --with-wctype-functions

    ?

    Is GCC 4.3 released yet?

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    What system libraries?

    libpython2.5.so.1.0 , this is a shared lib build after all.

    Does it make a difference if you don't specify either of

    --enable-unicode=ucs4 \ --with-wctype-functions

    Removing --with-wctype-functions fixes the issue.

    Is GCC 4.3 released yet?

    Not yet but soon, its less buggy compared to 4.1 and 4.2 at the moment.

    amauryfa commented 16 years ago

    > Is GCC 4.3 released yet?

    Not yet but soon, its less buggy compared to 4.1 and 4.2 at the moment.

    Not quite yet, gcc 4.3 had a big inlining bug that was just corrected two weeks ago: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33434 You may have encountered this bug, or another similar one...

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Not quite yet, gcc 4.3 had a big inlining bug that was just corrected two weeks ago: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33434 You may have encountered this bug, or another similar one...

    Two weeks ago is too old for me, I am using SVN snapshot from yesterday :-)

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Removing --with-wctype-functions in total fixes following regression tests,

    test_codecs test_re test_ucn test_unicodedata

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Remove test_ucn from the list, it still fails but its for another bug report.

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Any ideas/comments on how to move forward with this?

    Thanks, ismail

    gvanrossum commented 16 years ago

    Focus on how using --with-wctype-functions changes things and how this could affect the regex implementation. (I wouldn't be surprised if the other failing tests were to to the regex bugs.)

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Python README says --with-wctype-functions is deprecated and will be removed in Python 2.6 , I don't think its worth to fix it now. Also test failures with --with-wctype-functions is seems to be known according to Google.

    What I wonder if removing --with-wctype-functions causes any regressions under Turkish locale. I will do some research on that.

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Indeed there seems to be regressions:

    Python 2.4 :

    [~]> python
    Python 2.4.4 (#1, Oct 23 2007, 11:25:50)
    [GCC 3.4.6] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import locale
    >>> locale.setlocale(locale.LC_ALL,"")
    'tr_TR.UTF-8'
    >>> print unicode("ıııı")
    ıııı
    >>> print unicode("ıııı").upper()
    IIII
    >>> print unicode("iiiii").upper()
    İİİİİ
    >>> print unicode("İİİİİ").lower()
    iiiii
    >>> print unicode("IIIIIII").lower()
    ııııııı

    Python 2.5 (incorrect) :

    >>> import locale
    >>> locale.setlocale(locale.LC_ALL,"")
    'tr_TR.UTF-8'
    >>> print unicode("iiiii").upper()
    IIIII
    >>> print unicode("ıııı").upper()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
    ordinal not in range(128)
    >>> print unicode("iiii").upper()
    IIII

    Looks like wctypes should not be dropped.

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Situation is even more complicated, following functions behave _correctly_ when wctypes is enabled :

    >>> print unicode("iiiii").upper()
    İİİİİ
    >>> print unicode("IIII").lower()
    ıııı

    Following doesn't work even if wctypes is enabled :

    >>> print unicode("ıııı").upper()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
    ordinal not in range(128)
    >>> print unicode("İİİİİ").lower()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
    ordinal not in range(128)

    All of these four calls works fine in python 2.4 when wctypes is enabled.

    gvanrossum commented 16 years ago

    Martin, can you have a look at this?

    Cartman, can you produce a unittest for the correct behavior that only uses ASCII input (using \u.... instead of just typing Turkish characters)?

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Test works fine when using the \u syntax. You have to use the unicode() with Turkish characters to get the error. See attached test2.py

    With python 2.4 :

    [~]> python test2.py Following should print I I Following should print i i

    With python 2.5 SVN :

    [~/python-2.5]> ./python \~/test2.py Following should print I Got a unicode decode error Following should print i Got a unicode decode error

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    So in conclusion,

    Attached test.py tests Turkish corner cases of lower()/upper() . Correct output is which python 2.4 gives :

    Following should print I I Following should print i i Following should print İ İ Following should print ı ı

    gvanrossum commented 16 years ago

    Hm. The test2.py file, when I download it, contains the two bytes "\xc4\xb1" in the first unicode() call, and "\xc4\xb0" in the second one. This is *always* supposed to produce a UnicodeDecodeError, since it would use the default encoding which is ASCII. So I don't understand how you get this to pass with 2.4 at all.

    When you replace the arguments with these hex escapes, does it still pass for you? Or does that break it?

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Replacing Turkish characters with hex versions in test2.py still results in UnicodeDecodeError and works with python 2.4.

    gvanrossum commented 16 years ago

    Replacing Turkish characters with hex versions in test2.py still results in UnicodeDecodeError and works with python 2.4.

    I'm hoping Martin can confirm this, but I suspect that this is due to a tightening of the rules for converting from 8-bit strings to unicode strings.

    What happens if you change to unicode("....", "utf-8")?

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Ok that was because we had modified default encoding in Lib/site.py to be utf-8. Sorry!

    The only problem left is last 2 conversions in test.py gives wrong results when wctypes is disabled, that is :

    print u"\u0069".upper()

    should give \u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE)

    print u"\u0049".lower()

    should give \u0131 (LATIN SMALL LETTER DOTLESS I)

    These transformations work fine with python2.5 when --with-wctype-functions is used.

    gvanrossum commented 16 years ago

    print u"\u0069".upper()

    should give \u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE)

    print u"\u0049".lower()

    should give \u0131 (LATIN SMALL LETTER DOTLESS I)

    These transformations work fine with python2.5 when --with-wctype-functions is used.

    I think that is rather a bug in the wctype functions. Those are ASCII letters 'i' and 'I' and their upper/lower versions are fixed by the Unicode standard to be the corresponding ASCII letters ('I' and 'i'). The Unicode case conversions are not affected by locale.

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    But it should be affected by locale, thats the point of locale.setlocale call. This is how libc's wc functions behave.

    gvanrossum commented 16 years ago

    But it should be affected by locale, thats the point of locale.setlocale call. This is how libc's wc functions behave.

    No, the locale should only affect 8-bit string operations, never unicode operations.

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Ok then what is the suggested way to get back the Turkish way of doing upper/lower on i & I ?

    gvanrossum commented 16 years ago

    Ok then what is the suggested way to get back the Turkish way of doing upper/lower on i & I ?

    That's a question for Martin von Loewis. I suppose you could use 8-bit strings exclusively. Or you could use .translate() with a custom dict.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 16 years ago

    I think too many issues get mixed in this report. I would like to ignore all but one issue, but I don't understand what the one issue is that this report should deal with.

    cartman, when you compare Python 2.4 and 2.5, could it be that the 2.4 Python was compiled --with-wctype-functions, and the 2.5 Python --without-wctype-functions? That would surely explain the difference.

    The Unicode lower/upper implementations are, by default, locale-inaware. That is correct behavior, and by design. If you want locale-dependent behavior, use 8-bit strings as Guido says.

    ISTM that the original report was resolved - the tests don't support --with-wctype-functions. This is because they assume that they know that LATIN CAPITAL LETTER A WITH DIAERESIS is a letter - which may not be the case if the isletter test is locale-specific. If this is too be fixed, the proper fix would be to just remove the test, which I advise against

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Hi Martin,

    Actually the only problem is how can I get wctype functionality with 8-bit strings, any example is appreciated.

    This bug itself is invalid because --with-wctype-functions is deprecated. But as I said I just hope removing that doesn't regress Turkish functionality.

    gvanrossum commented 16 years ago

    Two easy ways to get the functionality using 8-bit strings, assuming you've already set your locale properly:

    (1) If your data is already an 8-bit string (i.e. isinstance(data, str)), simply use data.upper() or data.lower()

    (2) If your data is Unicode (i.e. isinstance(data, unicode)), convert to 8-bit using encode, apply upper/lower, and convert back to unicode. E.g. data.encode("Latin-1").upper().decode("Latin-1"). (I don't know which encoding to use though -- So substitute whatever you have for Latin-1, but don't use UTF-8.)

    PS Martin: the 2.4/2.5 differences were caused by Cartman having hacked his 2.4 installation to change the default encoding.

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Funnily,

    print "iiii".encode("iso-8859-9").decode("iso-8859-9").upper()

    works, but

    print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9")

    not.

    gvanrossum commented 16 years ago

    Funnily,

    print "iiii".encode("iso-8859-9").decode("iso-8859-9").upper()

    works, but

    print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9")

    not.

    You'll have to debug this yourself.

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    I guess so, I will no longer spam this bug. Thanks for the suggestions.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 16 years ago

    print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9") does not

    Please get your types right. "iiii" is a byte string (in Python 2.x). encode: unicode -> string decode: string -> unicode

    That you still can apply .encode to the byte string is a bug/pit fall in Python 2.x, which gets fixed in 3.x (by only supporting .encode on the unicode type).

    aa2c5943-8264-4a78-97ed-7013d2cb52f6 commented 16 years ago

    Tried like ,

    unicode("iii").encode("iso-8859-9").upper()

    doesn't work, I'll ask on python users list. Thanks.