Closed aa2c5943-8264-4a78-97ed-7013d2cb52f6 closed 16 years ago
Using python 2.5 revision 59479 from release25-maint branch,
[~/python-2.5]> LD_LIBRARY_PATH=/home/cartman/python-2.5: ./python ./Lib/test/test_re.py test_anyall (main.ReTests) ... ok test_basic_re_sub (main.ReTests) ... ok test_bigcharset (main.ReTests) ... ok test_bug_113254 (main.ReTests) ... ok test_bug_1140 (main.ReTests) ... ok test_bug_114660 (main.ReTests) ... ok test_bug_117612 (main.ReTests) ... ok test_bug_418626 (main.ReTests) ... ok test_bug_448951 (main.ReTests) ... ok test_bug_449000 (main.ReTests) ... ok test_bug_449964 (main.ReTests) ... ok test_bug_462270 (main.ReTests) ... ok test_bug_527371 (main.ReTests) ... ok test_bug_545855 (main.ReTests) ... ok test_bug_581080 (main.ReTests) ... ok test_bug_612074 (main.ReTests) ... ok test_bug_725106 (main.ReTests) ... ok test_bug_725149 (main.ReTests) ... ok test_bug_764548 (main.ReTests) ... ok test_bug_817234 (main.ReTests) ... ok test_bug_926075 (main.ReTests) ... ok test_bug_931848 (main.ReTests) ... ok test_category (main.ReTests) ... ok test_constants (main.ReTests) ... ok test_empty_array (main.ReTests) ... ok test_expand (main.ReTests) ... ok test_finditer (main.ReTests) ... ok test_flags (main.ReTests) ... ok test_getattr (main.ReTests) ... ok test_getlower (main.ReTests) ... ok test_groupdict (main.ReTests) ... ok test_ignore_case (main.ReTests) ... ok test_non_consuming (main.ReTests) ... ok test_not_literal (main.ReTests) ... ok test_pickling (main.ReTests) ... ok test_qualified_re_split (main.ReTests) ... ok test_qualified_re_sub (main.ReTests) ... ok test_re_escape (main.ReTests) ... ok test_re_findall (main.ReTests) ... ok test_re_groupref (main.ReTests) ... ok test_re_groupref_exists (main.ReTests) ... ok test_re_match (main.ReTests) ... ok test_re_split (main.ReTests) ... ok test_re_subn (main.ReTests) ... ok test_repeat_minmax (main.ReTests) ... ok test_scanner (main.ReTests) ... ok test_search_coverage (main.ReTests) ... ok test_search_star_plus (main.ReTests) ... ok test_special_escapes (main.ReTests) ... ok test_sre_character_class_literals (main.ReTests) ... ok test_sre_character_literals (main.ReTests) ... ok test_stack_overflow (main.ReTests) ... ok test_sub_template_numeric_escape (main.ReTests) ... ok test_symbolic_refs (main.ReTests) ... ok test_weakref (main.ReTests) ... ok
---------------------------------------------------------------------- Ran 55 tests in 0.194s
OK Running re_tests test suite === Failed incorrectly ('(?u)\\b.\\b', u'\xc4', 0, 'found', u'\xc4') === Failed incorrectly ('(?u)\\w', u'\xc4', 0, 'found', u'\xc4')
Can't reproduce.
Like before, what platform, compiler etc.? Does using ./configure --with-pydebug make a difference? What's the LD_LIBRARY_PATH for?
gcc 4.3, Linux 2.6.18, 32bit.
Without LD_LIBRARY_PATH it would use the system libraries and not the compiled ones which anyway is not wanted.
Configure line used is (damn I forgot to specify this before, sorry)
--with-fpectl \ --enable-shared \ --enable-ipv6 \ --with-threads \ --enable-unicode=ucs4 \ --with-wctype-functions
--enable-pydebug doesn't help.
Without LD_LIBRARY_PATH it would use the system libraries and not the compiled ones which anyway is not wanted.
What system libraries?
Does it make a difference if you don't specify either of
--enable-unicode=ucs4 \ --with-wctype-functions
?
Is GCC 4.3 released yet?
What system libraries?
libpython2.5.so.1.0 , this is a shared lib build after all.
Does it make a difference if you don't specify either of
--enable-unicode=ucs4 \ --with-wctype-functions
Removing --with-wctype-functions fixes the issue.
Is GCC 4.3 released yet?
Not yet but soon, its less buggy compared to 4.1 and 4.2 at the moment.
> Is GCC 4.3 released yet?
Not yet but soon, its less buggy compared to 4.1 and 4.2 at the moment.
Not quite yet, gcc 4.3 had a big inlining bug that was just corrected two weeks ago: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33434 You may have encountered this bug, or another similar one...
Not quite yet, gcc 4.3 had a big inlining bug that was just corrected two weeks ago: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33434 You may have encountered this bug, or another similar one...
Two weeks ago is too old for me, I am using SVN snapshot from yesterday :-)
Removing --with-wctype-functions in total fixes following regression tests,
test_codecs test_re test_ucn test_unicodedata
Remove test_ucn from the list, it still fails but its for another bug report.
Any ideas/comments on how to move forward with this?
Thanks, ismail
Focus on how using --with-wctype-functions changes things and how this could affect the regex implementation. (I wouldn't be surprised if the other failing tests were to to the regex bugs.)
Python README says --with-wctype-functions is deprecated and will be removed in Python 2.6 , I don't think its worth to fix it now. Also test failures with --with-wctype-functions is seems to be known according to Google.
What I wonder if removing --with-wctype-functions causes any regressions under Turkish locale. I will do some research on that.
Indeed there seems to be regressions:
Python 2.4 :
[~]> python
Python 2.4.4 (#1, Oct 23 2007, 11:25:50)
[GCC 3.4.6] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL,"")
'tr_TR.UTF-8'
>>> print unicode("ıııı")
ıııı
>>> print unicode("ıııı").upper()
IIII
>>> print unicode("iiiii").upper()
İİİİİ
>>> print unicode("İİİİİ").lower()
iiiii
>>> print unicode("IIIIIII").lower()
ııııııı
Python 2.5 (incorrect) :
>>> import locale
>>> locale.setlocale(locale.LC_ALL,"")
'tr_TR.UTF-8'
>>> print unicode("iiiii").upper()
IIIII
>>> print unicode("ıııı").upper()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)
>>> print unicode("iiii").upper()
IIII
Looks like wctypes should not be dropped.
Situation is even more complicated, following functions behave _correctly_ when wctypes is enabled :
>>> print unicode("iiiii").upper()
İİİİİ
>>> print unicode("IIII").lower()
ıııı
Following doesn't work even if wctypes is enabled :
>>> print unicode("ıııı").upper()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)
>>> print unicode("İİİİİ").lower()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)
All of these four calls works fine in python 2.4 when wctypes is enabled.
Martin, can you have a look at this?
Cartman, can you produce a unittest for the correct behavior that only uses ASCII input (using \u.... instead of just typing Turkish characters)?
Test works fine when using the \u syntax. You have to use the unicode() with Turkish characters to get the error. See attached test2.py
With python 2.4 :
[~]> python test2.py Following should print I I Following should print i i
With python 2.5 SVN :
[~/python-2.5]> ./python \~/test2.py Following should print I Got a unicode decode error Following should print i Got a unicode decode error
So in conclusion,
Attached test.py tests Turkish corner cases of lower()/upper() . Correct output is which python 2.4 gives :
Following should print I I Following should print i i Following should print İ İ Following should print ı ı
Hm. The test2.py file, when I download it, contains the two bytes "\xc4\xb1" in the first unicode() call, and "\xc4\xb0" in the second one. This is *always* supposed to produce a UnicodeDecodeError, since it would use the default encoding which is ASCII. So I don't understand how you get this to pass with 2.4 at all.
When you replace the arguments with these hex escapes, does it still pass for you? Or does that break it?
Replacing Turkish characters with hex versions in test2.py still results in UnicodeDecodeError and works with python 2.4.
Replacing Turkish characters with hex versions in test2.py still results in UnicodeDecodeError and works with python 2.4.
I'm hoping Martin can confirm this, but I suspect that this is due to a tightening of the rules for converting from 8-bit strings to unicode strings.
What happens if you change to unicode("....", "utf-8")?
Ok that was because we had modified default encoding in Lib/site.py to be utf-8. Sorry!
The only problem left is last 2 conversions in test.py gives wrong results when wctypes is disabled, that is :
print u"\u0069".upper()
should give \u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE)
print u"\u0049".lower()
should give \u0131 (LATIN SMALL LETTER DOTLESS I)
These transformations work fine with python2.5 when --with-wctype-functions is used.
print u"\u0069".upper()
should give \u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE)
print u"\u0049".lower()
should give \u0131 (LATIN SMALL LETTER DOTLESS I)
These transformations work fine with python2.5 when --with-wctype-functions is used.
I think that is rather a bug in the wctype functions. Those are ASCII letters 'i' and 'I' and their upper/lower versions are fixed by the Unicode standard to be the corresponding ASCII letters ('I' and 'i'). The Unicode case conversions are not affected by locale.
But it should be affected by locale, thats the point of locale.setlocale call. This is how libc's wc functions behave.
But it should be affected by locale, thats the point of locale.setlocale call. This is how libc's wc functions behave.
No, the locale should only affect 8-bit string operations, never unicode operations.
Ok then what is the suggested way to get back the Turkish way of doing upper/lower on i & I ?
Ok then what is the suggested way to get back the Turkish way of doing upper/lower on i & I ?
That's a question for Martin von Loewis. I suppose you could use 8-bit strings exclusively. Or you could use .translate() with a custom dict.
I think too many issues get mixed in this report. I would like to ignore all but one issue, but I don't understand what the one issue is that this report should deal with.
cartman, when you compare Python 2.4 and 2.5, could it be that the 2.4 Python was compiled --with-wctype-functions, and the 2.5 Python --without-wctype-functions? That would surely explain the difference.
The Unicode lower/upper implementations are, by default, locale-inaware. That is correct behavior, and by design. If you want locale-dependent behavior, use 8-bit strings as Guido says.
ISTM that the original report was resolved - the tests don't support --with-wctype-functions. This is because they assume that they know that LATIN CAPITAL LETTER A WITH DIAERESIS is a letter - which may not be the case if the isletter test is locale-specific. If this is too be fixed, the proper fix would be to just remove the test, which I advise against
Hi Martin,
Actually the only problem is how can I get wctype functionality with 8-bit strings, any example is appreciated.
This bug itself is invalid because --with-wctype-functions is deprecated. But as I said I just hope removing that doesn't regress Turkish functionality.
Two easy ways to get the functionality using 8-bit strings, assuming you've already set your locale properly:
(1) If your data is already an 8-bit string (i.e. isinstance(data, str)), simply use data.upper() or data.lower()
(2) If your data is Unicode (i.e. isinstance(data, unicode)), convert to 8-bit using encode, apply upper/lower, and convert back to unicode. E.g. data.encode("Latin-1").upper().decode("Latin-1"). (I don't know which encoding to use though -- So substitute whatever you have for Latin-1, but don't use UTF-8.)
PS Martin: the 2.4/2.5 differences were caused by Cartman having hacked his 2.4 installation to change the default encoding.
Funnily,
print "iiii".encode("iso-8859-9").decode("iso-8859-9").upper()
works, but
print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9")
not.
Funnily,
print "iiii".encode("iso-8859-9").decode("iso-8859-9").upper()
works, but
print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9")
not.
You'll have to debug this yourself.
I guess so, I will no longer spam this bug. Thanks for the suggestions.
print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9") does not
Please get your types right. "iiii" is a byte string (in Python 2.x). encode: unicode -> string decode: string -> unicode
That you still can apply .encode to the byte string is a bug/pit fall in Python 2.x, which gets fixed in 3.x (by only supporting .encode on the unicode type).
Tried like ,
unicode("iii").encode("iso-8859-9").upper()
doesn't work, I'll ask on python users list. Thanks.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/loewis' closed_at =
created_at =
labels = ['invalid', 'type-bug', 'tests']
title = 'test_re.py fails'
updated_at =
user = 'https://bugs.python.org/donmez'
```
bugs.python.org fields:
```python
activity =
actor = 'donmez'
assignee = 'loewis'
closed = True
closed_date =
closer = 'gvanrossum'
components = ['Tests']
creation =
creator = 'donmez'
dependencies = []
files = ['9005', '9006']
hgrepos = []
issue_num = 1609
keywords = []
message_count = 34.0
messages = ['58527', '58542', '58548', '58553', '58556', '58559', '58565', '58585', '58587', '58639', '58700', '58824', '58825', '58826', '58830', '58831', '58832', '58833', '58834', '58835', '58837', '58843', '58844', '58847', '58848', '58849', '58862', '58869', '58884', '58887', '58888', '58890', '58927', '58928']
nosy_count = 4.0
nosy_names = ['gvanrossum', 'loewis', 'amaury.forgeotdarc', 'donmez']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue1609'
versions = ['Python 2.5']
```