Python lib re cannot handle Unicode properly due to narrow/wide bug

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

BPO	12729
Nosy	@malemburg, @gvanrossum, @terryjreedy, @abalkin, @pitrou, @vstinner, @jkloth, @ezio-melotti, @bitdancer
Files	utf16.py: Revised prototype w/ codepoint indexing/slicing utf16.py: 3rd version with improved iteration

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['expert-regex', 'type-feature'] title = 'Python lib re cannot handle Unicode properly due to narrow/wide bug' updated_at = user = 'https://bugs.python.org/tchrist' ``` bugs.python.org fields: ```python activity = actor = 'pitrou' assignee = 'none' closed = True closed_date = closer = 'pitrou' components = ['Regular Expressions'] creation = creator = 'tchrist' dependencies = [] files = ['23006', '23025'] hgrepos = [] issue_num = 12729 keywords = [] message_count = 59.0 messages = ['141917', '141930', '141992', '141995', '142005', '142024', '142036', '142037', '142038', '142039', '142041', '142042', '142043', '142044', '142047', '142053', '142054', '142064', '142066', '142069', '142070', '142076', '142079', '142084', '142085', '142089', '142091', '142093', '142096', '142097', '142098', '142102', '142107', '142121', '142749', '142877', '143033', '143054', '143055', '143056', '143061', '143088', '143111', '143123', '143392', '143432', '143433', '143446', '143702', '143720', '143735', '144256', '144260', '144266', '144287', '144289', '144307', '144312', '147776'] nosy_count = 15.0 nosy_names = ['lemburg', 'gvanrossum', 'terry.reedy', 'belopolsky', 'pitrou', 'vstinner', 'jkloth', 'ezio.melotti', 'mrabarnett', 'Arfrever', 'v+python', 'r.david.murray', 'zbysz', 'abacabadabacaba', 'tchrist'] pr_nums = [] priority = 'normal' resolution = 'out of date' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue12729' versions = ['Python 3.3'] ```

terryjreedy commented 13 years ago

On 9/8/2011 4:32 AM, Ezio Melotti wrote:

So to summarize a bit, there are different possible level of strictness: 1) all the possible encodable values, including the ones>10FFFF; 2) values in range 0..10FFFF; 3) values in range 0..10FFFF except surrogates (aka scalar values); 4) values in range 0..10FFFF except surrogates and noncharacters;

and this is what is currently available in Python: 1) not available, probably it will never be; 2) available through the 'surrogatepass' error handler; 3) default behavior (i.e. with the 'strict' error handler); 4) currently not available.

Now, assume that we don't care about option 1 and want to implement the missing option 4 (which I'm still not 100% sure about). The possible options are:

add a new codec (actually one for each UTF encoding);

add a new error handler that explicitly disallows noncharacters;

change the meaning of 'strict' to match option 4;

If 'strict' meant option 4, then 'scalarpass' could mean option 3. 'surrogatepass' would then mean 'pass surragates also, in addition to non-char scalers'.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

"Terry J. Reedy" \report@bugs.python.org\ wrote on Thu, 08 Sep 2011 18:56:11 -0000:

On 9/8/2011 4:32 AM, Ezio Melotti wrote:

> So to summarize a bit, there are different possible level of strictness: > 1) all the possible encodable values, including the ones>10FFFF; > 2) values in range 0..10FFFF; > 3) values in range 0..10FFFF except surrogates (aka scalar values); > 4) values in range 0..10FFFF except surrogates and noncharacters;

> and this is what is currently available in Python: > 1) not available, probably it will never be; > 2) available through the 'surrogatepass' error handler; > 3) default behavior (i.e. with the 'strict' error handler); > 4) currently not available.

> Now, assume that we don't care about option 1 and want to implement the missing option 4 (which I'm still not 100% sure about). The possible options are: > add a new codec (actually one for each UTF encoding); > add a new error handler that explicitly disallows noncharacters; > * change the meaning of 'strict' to match option 4;

If 'strict' meant option 4, then 'scalarpass' could mean option 3. 'surrogatepass' would then mean 'pass surragates also, in addition to non-char scalers'.

I'm pretty sure that anything that claims to be UTF-{8,16,32} needs
to reject both surrogates *and* noncharacters. Here's something from the published Unicode Standard's p.24 about noncharacter code points:

• Noncharacter code points are reserved for internal use, such as for 
  sentinel values. They should never be interchanged. They do, however,
  have well-formed representations in Unicode encoding forms and survive
  conversions between encoding forms. This allows sentinel values to be
  preserved internally across Unicode encoding forms, even though they are
  not designed to be used in open interchange.

And here from the Unicode Standard's chapter on Conformance, section 3.2, p. 59:

C2 A process shall not interpret a noncharacter code point as an 
   abstract character.

    • The noncharacter code points may be used internally, such as for 
      sentinel values or delimiters, but should not be exchanged publicly.

I'd have to check the fine print, but I am pretty sure that "shall not" is an imperative form. We have understand that to read that a comforming process *must*not* do that. It's because of that wording that in Perl, using either of {en,de}code() with any of the "UTF-{8,16,32}" encodings, including the LE/BE versions as appropriate, it will not produce nor accept a noncharacter code point like FDD0 or FFFE.

Do you think we may perhaps have misread that conformance clause?

Using Perl's special, loose-fitting "utf8" encoding, you can get it do noncharacter code points and even surrogates, but you have to suppress certain things to make that happen quietly. You can only do this with "utf8", not any of the UTF-16 or UTF-32 flavors. There we give them no choice, so you must be strict. I agree this is not fully orthogonal.

Note that this is the normal thing that people do:

binmode(STDOUT, ":utf8");

which is the *loose* version. The strict one is "utf8-strict" or "UTF-8":

open(my $fh, "< :encoding(UTF-8)", $pathname)

So it is a bit too easy to get the loose one. We felt we had to do this because we were already using the loose definition (and allowing up to chr(2**32) etc) when the Unicode Consortium made clear what sorts of things must not be accepted, or perhaps, before we made ourselves clear on this. This will have been back in 2003, when I wasn't paying very close attention.

I think that just like Perl, Python has a legacy of the original loose definition. So some way to accommodate that legacy while still allowing for a comformant application should be devised. My concern with Python is that people tend to make they own manual calls to encode/decode a lot more often than they do in Perl. That people that if you only catch it on a stream encoding, you'll miss it, because they will use binary I/O and miss the check.

--tom

Below I show a bit of how this works in Perl.  Currently the builtin
utf8 encoding is controlled somewhat differently from how the Encode
module's encode/decode functions are.  Yes, this is not my idea of good.

This shows that noncharacters and surrogates do not survive the
encoding/decoding process for UTF-16:

    % perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", chr(0xFDD0)))' | uniquote -v
    \N{REPLACEMENT CHARACTER}
    % perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", chr(0xFFFE)))' | uniquote -v
    \N{REPLACEMENT CHARACTER}
    % perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", chr(0xD800)))' | uniquote -v
    UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.

If you pass a third argument to encode/decode, you can tell it what to
do on error; an argument of 1 raises an exception.  Not supplying a
third argument gets the "default" behavior, which varies by encoding.
(The careful programmer is apt to want to pass in an appropropriate
 bit mask of things like DIE_ON_ERR, WARN_ON_ERR, RETURN_ON_ERR,
 LEAVE_SRC, PERLQQ, HTMLCREF, or XMLCREF.)

With "utf8" vs "UTF-8" using encode(), the default behavior is to swap in 
the Unicode replacement character for things that don't map to the given 
encoding, as you saw above with UTF-16:

    % perl -C0 -MEncode -wle 'print encode("utf8", chr(0xFDD0))' | uniquote -v
    \N{U+FDD0}
    % perl -C0 -MEncode -wle 'print encode("UTF-8", chr(0xFDD0))' | uniquote -v
    \N{REPLACEMENT CHARACTER}

    % perl -C0 -MEncode= -wle 'print encode("utf8", chr(0xD800))' | uniquote -v
    \N{U+D800}
    % perl -C0 -MEncode= -wle 'print encode("UTF-8", chr(0xFDD0))' | uniquote -v
    \N{REPLACEMENT CHARACTER}

    % perl -C0 -MEncode=:all -wle 'print encode("utf8", chr(0x100_0000))' | uniquote -v
    \N{U+1000000}
    % perl -C0 -MEncode=:all -wle 'print encode("UTF-8", chr(0x100_0000))' | uniquote -v
    \N{REPLACEMENT CHARACTER}

With the builtin "utf8" encoding, which does *not* go through the
Encode module, you instead control all this through lexical
warnings/exceptions categories.   By default, you get a warning if
you try to use noncharacter, surrogate, or nonunicode code points
even on a loose utf8 stream (which is what -CS gets you):

    % perl -CS -le 'print chr for 0xFDD0, 0xD800, 0x100_0000' | uniquote
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Code point 0x1000000 is not Unicode, may not be portable at -e line 1.
    \N{U+FDD0}
    \N{U+D800}
    \N{U+1000000}

Notice I didn't ask for warnings there, but I still got them.  This
promotes all utf8 warnings into exceptions, thus dying on the first one
it finds:

    % perl -CS -Mwarnings=FATAL,utf8 -le 'print chr for 0xFDD0, 0xD800, 0x100_0000' | uniquote
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.

You can control these separately.  For example, these all die of an
exception:

    % perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0xFDD0)'   
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
    % perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0xD800)'   
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    % perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0x100_0000)' 
    Code point 0x1000000 is not Unicode, may not be portable at -e line 1.

While these do not:

    % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "nonchar";     print chr(0xFDD0)'     | uniquote
    \N{U+FDD0}
    % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "surrogate";   print chr(0xD800)'     | uniquote
    \N{U+D800}
    % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "non_unicode"; print chr(0x100_0000)' | uniquote
    \N{U+1000000}

    % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings qw(nonchar surrogate non_unicode);
                    print chr for 0xFDD0, 0xD800, 0x100_0000' | uniquote
    \N{U+FDD0}
    \N{U+D800}
    \N{U+1000000}

terryjreedy commented 13 years ago

My long-ago memory is that 'should not' is slightly looser in w3c parlance than 'must not'. However, it is a moot point if we decide to follow the 'should' in 3.3 for the default 'strict' mode, which both Ezio and I think we 'should' ;-). Our 'errors' parameter makes it easy to request something else, but it has to be explicit.

ezio-melotti commented 13 years ago

We could also look at what other languages do and/or ask to the Unicode consortium 0.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Ezio Melotti \report@bugs.python.org\ wrote on Mon, 19 Sep 2011 11:11:48 -0000:

We could also look at what other languages do and/or ask to the Unicode consortium.

I will look at what Java does a bit later on this morning, which is the only other commonly used language besides C that I feel even reasonably competent at. I seem to recall that Java changed its default behavior on certain Unicode decoding issues from warnings to exceptions between one release and the next, but can't remember any details.

As the Perl Foundation is a member of the Unicode Consortium and I am on the mailing list, I suppose I could just ask them. I feel a bit timid though because the last thing I brought up there was based on a subtle misunderstanding of mine regarding the IDC and Pattern_Syntax properties. I hate looking dumb twice in row. :)

--tom

malemburg commented 13 years ago

Tom Christiansen wrote:

I'm pretty sure that anything that claims to be UTF-{8,16,32} needs
to reject both surrogates *and* noncharacters. Here's something from the published Unicode Standard's p.24 about noncharacter code points:
• Noncharacter code points are reserved for internal use, such as for 
  sentinel values. They should never be interchanged. They do, however,
  have well-formed representations in Unicode encoding forms and survive
  conversions between encoding forms. This allows sentinel values to be
  preserved internally across Unicode encoding forms, even though they are
  not designed to be used in open interchange.
And here from the Unicode Standard's chapter on Conformance, section 3.2, p. 59:
C2 A process shall not interpret a noncharacter code point as an 
   abstract character.

    • The noncharacter code points may be used internally, such as for 
      sentinel values or delimiters, but should not be exchanged publicly.

You have to remember that Python is used to build applications. It's up to the applications to conform to Unicode or not and the application also defines what "exchange" means in the above context.

Python itself needs to be able to deal with assigned non-character code points as well as unassigned code points or code points that are part of special ranges such as the surrogate ranges.

I'm +1 on not allowing e.g. lone surrogates in UTF-8 data, because we have a way to optionally allow these via an error handler, but -1 on making changes that cause full range round-trip safety of the UTF encodings to be lost without a way to turn the functionality back on.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

No good news on the Java front. They do all kinds of things wrong.
For example, they allow intermixed CESU-8 and UTF-8 in a real UTF-8 input stream, which is illegal. There's more they do wrong, including in their documentation, but I won't bore you with their errors.

I'm going to seek clarification on some matters here.

--tom

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

It appears that I'm right about surrogates, but wrong about noncharacters. I'm seeking a clarification there.

--tom

pitrou commented 13 years ago

Closing this bug as PEP-393 is now implemented and makes so-called "narrow builds" obsolete. Python now has an adaptative internal representation that is able to fit all unicode characters.

python / cpython

Python lib re cannot handle Unicode properly due to narrow/wide bug #56938