Python lib re cannot handle Unicode properly due to narrow/wide bug

python / cpython

The Python programming language

https://www.python.org

Other

63.63k stars 30.48k forks source link

Python lib re cannot handle Unicode properly due to narrow/wide bug #56938

Closed 5c59cbd7-8186-4351-8391-b403f3a3a73f closed 13 years ago

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

BPO	12729
Nosy	@malemburg, @gvanrossum, @terryjreedy, @abalkin, @pitrou, @vstinner, @jkloth, @ezio-melotti, @bitdancer
Files	utf16.py: Revised prototype w/ codepoint indexing/slicing utf16.py: 3rd version with improved iteration

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['expert-regex', 'type-feature'] title = 'Python lib re cannot handle Unicode properly due to narrow/wide bug' updated_at = user = 'https://bugs.python.org/tchrist' ``` bugs.python.org fields: ```python activity = actor = 'pitrou' assignee = 'none' closed = True closed_date = closer = 'pitrou' components = ['Regular Expressions'] creation = creator = 'tchrist' dependencies = [] files = ['23006', '23025'] hgrepos = [] issue_num = 12729 keywords = [] message_count = 59.0 messages = ['141917', '141930', '141992', '141995', '142005', '142024', '142036', '142037', '142038', '142039', '142041', '142042', '142043', '142044', '142047', '142053', '142054', '142064', '142066', '142069', '142070', '142076', '142079', '142084', '142085', '142089', '142091', '142093', '142096', '142097', '142098', '142102', '142107', '142121', '142749', '142877', '143033', '143054', '143055', '143056', '143061', '143088', '143111', '143123', '143392', '143432', '143433', '143446', '143702', '143720', '143735', '144256', '144260', '144266', '144287', '144289', '144307', '144312', '147776'] nosy_count = 15.0 nosy_names = ['lemburg', 'gvanrossum', 'terry.reedy', 'belopolsky', 'pitrou', 'vstinner', 'jkloth', 'ezio.melotti', 'mrabarnett', 'Arfrever', 'v+python', 'r.david.murray', 'zbysz', 'abacabadabacaba', 'tchrist'] pr_nums = [] priority = 'normal' resolution = 'out of date' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue12729' versions = ['Python 3.3'] ```

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Python is in flagrant violation of the very most basic premises of Unicode Technical Report #18 on Regular Expressions, which requires that a regex engine support Unicode characters as "basic logical units independent of serialization like UTF‑*". Because sometimes you must specify ".." to match a single Unicode character -- whenever those code points are above the BMP and you are on a narrow build -- Python regexes cannot be reliably used for Unicode text.

 % python3.2
 Python 3.2 (r32:88445, Jul 21 2011, 14:44:19)
 [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import re
 >>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
 >>> print(g)
ᾲ
 >>> print(re.search(r'\w', g))
 <_sre.SRE_Match object at 0x10051f988>
 >>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}"
 >>> print(p)
𝒫
 >>> print(re.search(r'\w', p))
None
 >>> print(re.search(r'..', p))   # ← 𝙏𝙃𝙄𝙎 𝙄𝙎 𝙏𝙃𝙀 𝙑𝙄𝙊𝙇𝘼𝙏𝙄𝙊𝙉 𝙍𝙄𝙂𝙃𝙏 𝙃𝙀𝙍𝙀 
<_sre.SRE_Match object at 0x10051f988>
 >>> print(len(chr(0x1D4AB)))
2

That is illegal in Unicode regular expressions.

bitdancer commented 13 years ago

This is an acknowledged problem with Python narrow builds, and applies to much more than just regex processing.

terryjreedy commented 13 years ago

Does the regex module handle these particular issues better?

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

"Terry J. Reedy" \report@bugs.python.org\ wrote on Fri, 12 Aug 2011 22:21:59 -0000:

Does the regex module handle these particular issues better?

No, it currently does not. One would have to ask Matthew directly, but I believe it was because he was trying to stay compatible with re, sometimes apparently even if that means being bug compatible. I have brought it to his attention though, and at last report he was pondering the matter.

In contrast to how Python behaves on narrow builds, even though Java uses UTF-16 for its internal representation of strings, its Java Pattern is quite adamant about treating with logical code points alone. Besides running afoul of tr18, it is senseless to do otherwise. A dot is one Unicode code point, no matter whether you have 8-bit code units, 16-bit code units, or 32-bit code units. Similarly, character classes and their negations only match entire code points, never pieces of the same.

ICU's regexes work the same way the normal Java Pattern library does.
So too do Perl, Ruby, and Go. Python is really the odd man out here.

Almost.

One interesting counterexample is the vim editor. It has dot match a complete grapheme no matter how many code points that requires, because we're dealing with user-visible characters now, not programmer-visible one.

It is an unreasonable burden to make the programmer deal with the fine-grained details of low-level serialization schemes instead of at least() the code point level of operations, which is the minimum for getting real work done. (Note that tr18 admits that accessing text at the code point level meets only programmer expectations, not those of the user, and therefore to meet user expectations much more elaborate patterns must necessarily be constructed than if logical groups of coarser granularity than code points alone are supported.)

Python should not be subject to changing its behavior from one build to the next. This astonishing narrow-vs-wide build behavior makes it virtually impossible to write portable code to work on arbitrary Unicode text. You cannot even know whether you need to match one dot or two to get a single code point, and similarly for character indexing, etc. Even identifiers come into play. Surrogates should be utterly nonexistent/invisible at this, the normal level of operation.

An API that minimally but uniformly deals with logical code points and nothing finer in granularity is the only way to go here. Please trust me on this one. Graphemes (tr18 Level 2) and collation elements (Level 3) will someday build on that, but one must first support code points properly. That's why it's a Level 1 requirement.

--tom

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 13 years ago

In a narrow build, a codepoint in the astral plane is encoded as surrogate pair.

I could implement a workaround for it in the regex module, but I think that the proper place to fix it is in the language as a whole, perhaps by implementing PEP-393 ("Flexible String Representation").

bitdancer commented 13 years ago

Tom, note that nobody is arguing that what you are requesting is a bad thing :)

As far as I know, Matthew is the only one currently working on the regex support in Python. (Other developers will commit small fixes if someone proposes a patch, but no one that I've seen other than Matthew is working on the deeper issues.) If you want to help out that would be great.

And as far as this particular issue goes, yes the difference between the narrow and wide build has been a known issue for a long time, but has become less and less ignorable as unicode adoption has grown. Martin's PEP that Matthew references is the only proposed fix that I know of. There is a GSoc project working on it, but I have no idea what the status is.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

David Murray \report@bugs.python.org\ wrote:

Tom, note that nobody is arguing that what you are requesting is a bad thing :)

There looked to be minor some resistance, based on absolute backwards compatibility even if wrong, regarding changing anything *at all* in re, even things that to my jaded seem like actual bugs.

There are bugs, and then there are bugs.

In my survey of Unicode support across 7 programming languages for OSCON

http://training.perl.com/OSCON/index.html

I came across a lot of weirdnesses, especially as first when the learning curve was high. Sure, I found it odd that unlike Java, Perl, and Ruby, Python didn't offer regular casemapping on strings, only the simple character-based mapping. But that doesn't make it a bug, which is why I filed it as an feature/enhancement request/wish, not as a bug.

I always count as bugs not handling Unicode text the way Unicode says it must be handled. Such things would be:

Emitting CESU-8 when told to emit UTF-8.

Violating the rule that UTF-8 must be in the shortest possible encoding.

Not treating a code point as a letter when the supported version of the
UCD says it is.  (This can happen if internal rules get out of sync
with the current UCD.)

Claiming one does "the expected thing on Unicode" for case-insensitive
matches when not doing what Unicode says you must minimally do: use at
least the simple casefolds, if not in fact the full ones.

Saying \w matches Unicode word characters when one's definition of
word characters differs from that of the supported version of the UCD.

Supporting Unicode vX.Y.Z is more than adding more characters. All the behaviors specified in the UCD have to be updated too, or else you are just ISO 10646. I believe some of Python's Unicode bugs happened because folks weren't aware which things in Python were defined by the UCD or by various UTS reports yet were not being directly tracked that way. That's why its important to always fully state which version of these things you follow.

Other bugs, many actually, are a result of the narrow/wide-build untransparency.

There is wiggle room in some of these. For example, which is the one that applies to re, in that you could -- in a sense -- remove the bug by no longer claiming to do case-insensitive matches on Unicode. I do not find that very useful. Javascript works this way: it doesn't do Unicode casefolding. Java you have to ask nicely with the extra UNICODE_CASE flag, aka "(?u)", used with the CASE_INSENSITIVE, aka "(?i)".

Sometimes languages provide different but equivalent interfaces to the same functionality. For example, you may not support the Unicode property \p{NAME=foobar} in patterns but instead support \N{foobar} in patterns and hopefully also in strings. That's just fine. On slightly shakier ground but still I think defensible is how one approaches support for the standard UCD properties:

      Case_Folding    Simple_Case_Folding
 Titlecase_Mapping    Simple_Titlecase_Mapping
 Uppercase_Mapping    Simple_Uppercase_Mapping
 Lowercase_Mapping    Simple_Lowercase_Mapping

One can support folding, for example, via (?i) and not have to directly supporting a Case_Folding property like \p{Case_Folding=s}, since "(?i)s" should be the same thing as "\p{Case_Folding=s}".

As far as I know, Matthew is the only one currently working on the regex support in Python. (Other developers will commit small fixes if someone proposes a patch, but no one that I've seen other than Matthew is working on the deeper issues.) If you want to help out that would be great.

Yes, I actually would. At least as I find time for it. I'm a competent C programmer and Matthew's C code is very well documented, but that's very time consuming. For bang-for-buck, I do best on test and doc work, making sure things are actually working the way they say do.

I was pretty surprised and disappointed by how much trouble I had with Unicode work in Python. A bit of that is learning curve, a bit of it is suboptimal defaults, but quite a bit of it is that things either don't work the way Unicode says, or because something is altogether missing. I'd like to help at least make the Python documentation clearer about what it is or is not doing in this regard.

But be warned: one reason that Java 1.7 handles Unicode more according to the published Unicode Standard in its Character, String, and Pattern classes is because when they said they'd be supporting Unicode 6.0.0, I went through those classes and every time I found something in violation of that Standard, I filed a bug report that included a documentation patch explaining what they weren't doing right. Rather than apply my rather embarrassing doc patches, they instead fixed the code. :)

And as far as this particular issue goes, yes the difference between the narrow and wide build has been a known issue for a long time, but has become less and less ignorable as unicode adoption has grown.

Little would make me happier than for Python to move to a logical character model the way Go and Perl treat them. I find getting bogged down by code units to be abhorrently low level, and it doesn't matter whether these are 8-bit code units like in PHP or 16-bit code units like in Java. The Nth character of a string should always be its Nth logical code point not its Nth physical code units.

In regular expressions, this is a clearly stated requirement in the Unicode Standard (see tr18 RL1.1 and RL1.7). However, it is more than that. The entire character processing model really really should be logical not physical. That's because you need to have whole code points not broken-up code units before you can build on the still higher-level components needed to meet user expectations. These include user-visible graphemes (like an E with both a combining acute and a combining up tack below) and linguistic collating elements (like the letter \<ch> in Spanish, \<dzs> in Hungarian, or \<dd> in Welsh).

Martin's PEP that Matthew references is the only proposed fix that I know of. There is a GSoc project working on it, but I have no idea what the status is.

Other possible fixes are using UTF-8 or UTF-32.

One reason I don't like that PEP is because if you are really that concerned wiht storage space, it is too prone to spoilers. Neither UTF-8 nor UTF-32 have any spoilers, but that PEP does. What I mean is that just one lone code point in a huge string is enough to change the representation of everything in that string. I think of these as spoilers; strings that are mostly non-spoilers with just a bit of spoiler in them are super super common. Often it's all ASCII plus just a few ISO-8859-1 or other non-ASCII Latin characters. Or it's all Latin with a couple of non-BMP mathematical alphanumerics thrown in. That kind of thing.

Consider this mail message. It contains exactly six non-ASCII code points.

% uniwc `mhpath cur +drafts`
Paras    Lines    Words   Graphs    Chars    Bytes File
   79      345     2796    16899    16899    16920 /home/tchrist/Mail/drafts/1183

Because it is in UTF-8, its memory profile in bytes grows only very slightly over its character count. However, if you adopt the PEP, then you pay and pay and pay, very nearly quadrupling the memory profile for six particular characters. Now it takes 67596 bytes intead of 16920, just for the sake of six code points. Ouch!!

Why would you want to do that? You say you are worried about memory, but then you would do this sort of thing. I just don't understand.

I may be wrong here, not least because I can think of possible extenuating circumstances, but it is my impression that there there is an underlying assumption in the Python community and many others that being able to access the Nth character in a string in constant time for arbitrary N is the most important of all possible considerations. I

I don't believe that makes as much sense as people think, because I don't believe character strings really are accessed in that fashion very often at all. Sure, if you have a 2D matrix of strings where a given row-column pair yields one character and you're taking the diagonal you might want that, but how often does that actually happen? Virtually never: these are strings and not matrices we're running FFTs on after all. We don't need to be able to load them into vector registers or anything the way the number-crunching people do.

That's because strings are a sequence of characters: they're text. Whether reading text left to right, right to left, or even boustrophedonically, you're always going one past the character you're currently at. You aren't going to the Nth character forward or back for arbitrary N. That isn't how people deal with text. Sometimes they do look at the end, or a few in from the far end, but even that can be handled in other fashions.

I need to see firm use-case data justifying this overwhelming need for O(1) access to the the Nth character before I will believe it. I think it is too rare to be as concerned with as so many people bizarrely appear to be. This attachment has serious consequences. It is because of this attachment that the whole narrow/wide build thing occurs, where people are willing to discard a clean, uniform processing model in search of what I do not believe a reasonable or realistic goal. Even if they *were correct, *far\ more bugs are caused by unreliability than by performance.

If you were truly concerned with memory use, you would simply use UTF-8.
If you were truly concerned with O(1) access time, you would always use UTF-32.
Anything that isn't one of these two is some sort of messy compromise.

I promise that nobody ever had a BMP vs non-BMP bug who was working with either UTF-8 or UTF-32. This only happens with UTF-16 and UCS-2, which have all the disadvantages of both UTF-8 and UTF-32 combined yet none of the advantages of either. It's the worst of both worlds.

Because you're using UTF-16, you're already paying quite a bit of memory for text processing in Python compared to doing so in Go or in Perl, which are both UTF-8 languages. Since you're already used to paying extra, what's so wrong with going to purely UTF-32? That too would solve things.

UTF-8 is not the memory pig people allege it is on Asian text. Consider:

I saved the Tokyo Wikipedia page for each of these languages as NFC text and generated the following table comparing them. I've grouped the languages into Western Latin, Western non-Latin, and Eastern.

Paras Lines Words Graphs Chars UTF16 UTF8 8:16 16:8 Language

 519  1525  6300  43120 43147  86296 44023   51% 196%  English
 343   728  1202   8623  8650  17302  9173   53% 189%  Welsh
 541  1722  9013  57377 57404 114810 59345   52% 193%  Spanish
 529  1712  9690  63871 63898 127798 67016   52% 191%  French
 321   837  2442  18999 19026  38054 21148   56% 180%  Hungarian

 202   464   976   7140  7167  14336 11848   83% 121%  Greek
 348   937  2938  21439 21467  42936 36585   85% 117%  Russian

 355   788   613   6439  6466  12934 13754  106%  94%  Chinese, simplified
 209   419   243   2163  2190   4382  3331   76% 132%  Chinese, traditional
 461  1127  1030  25341 25368  50738 65636  129%  77%  Japanese
 410   925  2955  13942 13969  27940 29561  106%  95%  Korean

Where:

Paras is the number of blankline-separated text spans.
Lines is the number of linebreak-separated text spans.
Words is the number of whitespace-separated text spans.
Graphs is the number of Unicode extended grapheme clusters.
Chars is the number of Unicode code points.
UTF16 is how many bytes it takes up stored as UTF-16.
UTF8 is how many bytes it takes up stored as UTF-8.
8:16 is the ratio of UTF-8 size to UTF-16 size as a percentage.
16:8 is the ratio of UTF-16 size to UTF-8 size as a percentage.
Language is which version of the Tokyo page we're talking about here.

Here are my observations:

Western languages that use the Latin script suffer terribly upon conversion from UTF-8 to UTF-16, with English suffering the most by expanding by 96% and Hungarian the least by expanding by 80%. All are huge.
Western languages that do not use the Latin script still suffer, but only 15-20%.
Eastern languages DO NOT SUFFER in UTF-8 the way everyone claims that they do!

To expand on the last point:

In Korean and in (simplified) Chinese, you get only 6% bigger in UTF-8 than in UTF-16.
In Japanese, you get only 29% bigger in UTF-8 than in UTF-16.
The traditional Chinese actually got smaller in UTF-8 than in UTF-16! In fact, it costs 32% to use UTF-16 over UTF-8 for this sample. If you look at the Lines and Words columns, it looks that this might be due to white space usage.

So UTF-8 isn't even too bad on Asian languages.

But you howl that it's variable width. So? You're already using a variable-width encoding in Python on narrow builds. I know you think otherwise, but I'll prove this below.

Variable width isn't as bad as people claim, partly because fixed width is not as good as they claim. Think of the kind of operations that you normally do on strings. You want to to go to the next user-visible grapheme, or to the end of the current word, or go to the start of the next line. UTF-32 helps you with none of those, and UTF-8 does not hurt them. You cannot instantly go to a particular address in memory for any of those unless you build up a table of offsets as some text editors sometimes do, especially for lines. You simply have to parse it out as you go.

Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds. Perhaps someone could tell me why the Python documentation says it uses UCS-2 on a narrow build. I believe this to be an error, because otherwise I cannot explain how you can have non-BMP code points in your UTF-8 literals in your source code. And you clearly can.

#!/usr/bin/env python3.2
# -*- coding: UTF-8 -*-
super = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢"
print(super)

This is with a narrow build on Darwin:

% python3.2 -c 'import sys; print(sys.maxunicode)'
65535

% export PYTHONIOENCODING=UTF-8 PERLUNICODE=S

% python3.2 supertest | uniwc
   Paras    Lines    Words   Graphs    Chars    Bytes File
       0        1        1        8        8       29 standard input

% python3.2 supertest | uniquote -x
\x{1D518}\x{1D52B}\x{1D526}\x{1D520}\x{1D52C}\x{1D521}\x{1D522}

Observations and conclusion:

You are emitting 8 code points, 7 in the SMP not in the BMP.
You clearly understand code points above your alleged maxunicode value.
If you were actually using UCS-2, those would not be possible.
I submit that this proves you are actually using UTF-16. Q.E.D.

Yet you are telling people you are using UCS-2. Why is that? Since you are already using a variable-width encoding, why the supercilious attitude toward UTF-8? UTF-16 has the same properties but costs you a lot more. As I said before, UTF-16 puts you in the worst of all worlds.

If you were truly concerned with memory use, you would simply use UTF-8.
If you were truly concerned with O(1) access time, you would always use UTF-32.
Anything that isn't one of these two is some sort of messy compromise.

But even with UTF-16 you *can present to the user a view of logical characters that doesn't care about the underlying clunkish representation. The Java regex engine proves that, since "." always matches a single code point no matter whether it is in the BMP or not. Similarly, ICU's classes operate on logical characters -- code points not units -- even though they use UTF-16 languages. The Nth code point does not care and should not care how many units it takes to get there. It is fine to have both a byte interface *and a character interface, but I don't believe having something that falls in between those two is of any use whatsoever. And if you don't have a code point interface, you don't have a character interface.

This is my biggest underlying complaint about Python's string model, but I believe it fixable, even if doing so exceeds my own personal background.

--tom

pitrou commented 13 years ago

Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds. Perhaps someone could tell me why the Python documentation says it uses UCS-2 on a narrow build.

There's a disagreement on that point between several developers. See an example sub-thread at: http://mail.python.org/pipermail/python-dev/2010-November/105751.html

Since you are already using a variable-width encoding, why the supercilious attitude toward UTF-8?

I think you are reading too much into these decisions. It's simply that no-one took the time to write an alternative implementation and demonstrate its superiority. I also believe the original implementation was UCS-2 and surrogate support was added progressively during the years. Hence the terminological mess and the ad-hoc semantics.

I agree that going with UTF-8 and a clever indexing scheme would be a better solution.

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 13 years ago

There are occasions when you want to do string slicing, often of the form:

pos = my_str.index(x)
endpos = my_str.index(y)
substring = my_str[pos : endpos]

To me that suggests that if UTF-8 is used then it may be worth profiling to see whether caching the last 2 positions would be beneficial.

pitrou commented 13 years ago

There are occasions when you want to do string slicing, often of the form:

pos = my_str.index(x) endpos = my_str.index(y) substring = my_str[pos : endpos]

To me that suggests that if UTF-8 is used then it may be worth profiling to see whether caching the last 2 positions would be beneficial.

And/or a lookup table giving the byte offset of, say, every 16th character. It gives you a O(1) lookup with a relatively reasonable constant cost (you have to scan for less than 16 characters after the lookup).

On small strings (\< 256 UTF-8 bytes) the space overhead for the lookup table would be 1/16. It could also be constructed lazily whenever more than 2 positions are cached.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Matthew Barnett \report@bugs.python.org\ wrote on Sat, 13 Aug 2011 20:57:40 -0000:

There are occasions when you want to do string slicing, often of the form:

pos = my_str.index(x) endpos = my_str.index(y) substring = my_str[pos : endpos]

Me, I would probably give the second call to index the first
index position to guarantee the end comes after the start:

    str  = "for finding the biggest of all the strings"
    x_at = str.index("big")
    y_at = str.index("the", x_at)
    some = str[x_at:y_at]
    print("GOT", some)

But here's a serious question: is that *actually a common usage pattern for accessing strings in Python? I ask because it wouldn't even *occur to me to go at such a problem in that way. I would have always just written it this way instead:

    import re
    str  = "for finding the biggest of all the strings"
    some = re.search("(big.*?)the", str).group(1)
    print("GOT", some)

I know I would use the pattern approach, just because that's how I always do such things in Perl:

$str  = "for finding the biggest of all the strings";
($some) = $str =~ /(big.*?)the/;
print "GOT $some\n";

Which is obviously a *whole* lot simpler than the index approach:

$str  = "for finding the biggest of all the strings";
$x_at = index($str, "big");
$y_at = index($str, "the", $x_at);
$len  = $y_at - $x_at;
$some = substr($str, $x_at, $len);
print "GOT $some\n";

With no arithmetic and no need for temporary variables (you can't really escape needing x_at to pass to the second call to index), it's all a lot more WYSIWIG. See how much easier that is?

Sure, it's a bit cleaner and less noisy in Perl than it is in Python by virtue of Perl's integrated pattern matching, but I would still use patterns in Python for this, not index.

I honestly find the equivalent pattern operations a lot easier to read and write and maintain than I find the index/substring version. It's a visual thing.
I find patterns a win in maintainability over all that busy index monkeywork.
The index/rindex and substring approach is one I almost never ever turn to. I bet I use pattern matching 100 or 500 times for each time I use index, and maybe even more.

I happen to think in patterns. I don't expect other people to do so. But because of this, I usually end up picking patterns even if they might be a little bit slower, because I think the gain in flexibility and especially maintability more than makes up for any minor performance concerns.

This might also show you why patterns are so important to me: they're one of the most important tools we have for processing text. Index isn't, which is why I really don't care about whether it has O(1) access.

To me that suggests that if UTF-8 is used then it may be worth profiling to see whether caching the last 2 positions would be beneficial.

Notice how with the pattern approach, which is inherently sequential, you don't have all that concern about running over the string more than once. Once you have the first piece (here, "big"), you proceed directly from there looking for the second piece in a straightforward, WYSIWIG way. There is no need to keep an extra index or even two around on the string structure itself, going at it this way.

I would be pretty surprised if Perl could gain any speed by caching a pair of MRU index values against its UTF-8 [but see footnote], because again, I think the normal access pattern wouldn't make use of them. Maybe Python programmers don't think of strings the same way, though. That, I really couldn't tell you.

But here's something to think about:

If it *is* true that you guys do all this index stuff that Perl programmers just never see or do because of our differing comfort levels with regexes, and so you think Python that might still benefit from that sort of caching because its culture has promoted a different access pattern, then that caching benefit would still apply even if you were retain the current UTF-16 representation instead of going to UTF-8 (which might want it) or to UTF-32 (which wouldn't).

After all, you have the same variable-width caching issue with UTF-16 as with UTF-8, so if it makes sense to have an MRU cache mapping character indices to byte indices, then it doesn't matter whether you use UTF-8 or UTF-16!

However, I'd want some passive comparative benchmarks using real programs with real data, because I would be suspicious of incurring the memory cost of two more pointers in every string in the whole program. That's serious.

--tom

FOOTNOTE: The Perl 6 people are thinking about clever ways to set up byte offset indices. You have to do this if you want O(1) access to the Nth element for elements that are not simple code points even if you use UTF-32. That's because they want the default string element to be a user visible grapheme, not a code point. I know they have clever ideas, but I don't know how critical O(1) access truly is, nor whether it's worth the overhead this would require. But perhaps it would be extensible for other sorts of string elements, like locale-based alphabetic units (like \<ch>, \<dz>, \<ll>) or even words, and so would prove interesting to try nonetheless.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Antoine Pitrou \report@bugs.python.org\ wrote on Sat, 13 Aug 2011 21:09:52 -0000:

And/or a lookup table giving the byte offset of, say, every 16th character. It gives you a O(1) lookup with a relatively reasonable constant cost (you have to scan for less than 16 characters after the lookup).

On small strings (\< 256 UTF-8 bytes) the space overhead for the lookup table would be 1/16. It could also be constructed lazily whenever more than 2 positions are cached.

You really should talk to the Perl 6 people to see whether their current strategy for caching offset maps for grapheme positions might be of use to you. Larry explained it to me once but I no longer recall any details.

I notice though that they don't seem to think it worth doing for UTF-8 or UTF-16, just for their synthetic "NFG" (Grapheme Normalization Form) strings, where it would be needed even if they used UTF-32 underneath.

--tom

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 13 years ago

You're right about starting the second search from where the first finished. Caching the position would be an advantage there.

The memory cost of extra pointers wouldn't be so bad if UTF-8 took less space than the current format.

Regex isn't used as much as in Perl. BTW, the current re module was introduced in Python 1.5, the previous regex and regsub modules being removed in Python 2.5.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

> Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds. > Perhaps someone could tell me why the Python documentation says it uses > UCS-2 on a narrow build.

There's a disagreement on that point between several developers. See an example sub-thread at:

http://mail.python.org/pipermail/python-dev/2010-November/105751.html

Some of those folks know what they're talking about, and some do not.

Most of the postings miss the mark.

Python uses UTF-16 for its narrow builds. It does not use UCS-2.

The argument that it must be UCS-2 because it can store lone surrogates in memory is spurious.

You have to read The Unicode Standard very *very* closely, but it is not necessary that all internal buffers always be in well-formed UTF-whatever. Otherwise it would be impossible to append a code unit at a time to buffer. I could pull out the reference if I worked at it, because I've had to find it before. It's in there. Trust me. I know.

It is also spurious to pretend that because you can produce illegal output when telling it to generate something in UTF-16 that it is somehow not using UTF-16. You have simply made a mistake. You have generated something that you have promised you would not generate. I have more to say about this below.

Finally, it is spurious to argue against UTF-16 because of the code unit interface. Java does exactly the same thing as Python does *in all regards* here, and no one pretends that Java is UCS-2. Both are UTF-16.

It is simply a design error to pretend that the number of characters is the number of code units instead of code points. A terrible and ugly one, but it does not mean you are UCS-2.

You are not. Python uses UTF-16 on narrow builds.

The ugly terrible design error is digusting and wrong, just as much in Python as in Java, and perhaps moreso because of the idiocy of narrow builds even existing. But that doesn't make them UCS-2.

If I could wave a magic wand, I would have Python undo its code unit blunder and go back to code points, no matter what. That means to stop talking about serialization schemes and start talking about logical code points. It means that slicing and index and length and everything only report true code points. This horrible code unit botch from narrow builds is most easily cured by moving to wide builds only.

However, there is more.

I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is broken in a bunch of ways. You should be raising as exception in all kinds of places and you aren't. I can see I need to bug report this stuff to.
I don't to be mean about this. HONEST! It's just the way it is.

Unicode currently reserves 66 code points as noncharacters, which it guarantees will never be in a legal UTF-anything stream. I am not talking about surrogates, either.

To start with, no code point which when bitwise added with 0xFFFE returns 0xFFFE can never appear in a valid UTF-* stream, but Python allow this without any error.

That means that both 0xNN_FFFE and 0xNN_FFFF are illegal in all planes, where NN is 00 through 10 in hex. So that's 2 noncharacters times 17 planes = 34 code points illegal for interchange that Python is passing through illegally.

The remaining 32 nonsurrogate code points illegal for open interchange are 0xFDD0 through 0xFDEF. Those are not allowed either, but Python doesn't seem to care.

You simply cannot say you are generating UTF-8 and then generate a byte sequence that UTF-8 guarantees can never occur. This is a violation.

***SIGH***

--tom

ezio-melotti commented 13 years ago

It is simply a design error to pretend that the number of characters is the number of code units instead of code points. A terrible and ugly one, but it does not mean you are UCS-2.

If you are referring to the value returned by len(unicode_string), it is the number of code units. This is a matter of "practicality beats purity". Returning the number of code units is O(1) (num_of_bytes/2). To calculate the number of characters it's instead necessary to scan all the string looking for surrogates and then count any surrogate pair as 1 character. It was therefore decided that it was not worth to slow down the common case just to be 100% accurate in the "uncommon" case.

That said it would be nice to have an API (maybe in unicodedata or as new str methods?) able to return the number of code units, code points, graphemes, etc, but I'm not sure that it should be the default behavior of len().

The ugly terrible design error is digusting and wrong, just as much in Python as in Java, and perhaps moreso because of the idiocy of narrow builds even existing.

Again, wide builds use twice as much the space than narrow ones, but one the other hand you can have fast and correct behavior with e.g. len(). If people don't care about/don't need to use non-BMP chars and would rather use less space, they can do so. Until we agree that the difference in space used/speed is no longer relevant and/or that non-BMP characters become common enough to prefer the "correct behavior" over the "fast-but-inaccurate" one, we will probably keep both.

I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is broken in a bunch of ways. You should be raising as exception in all kinds of places and you aren't.

I am aware of some problems of the UTF-8 codec on Python 2.  It used to follow RFC 2279 until last year and now it's been updated to follow RFC 3629.
However, for backward compatibility, it still encodes/decodes surrogate pairs.  This broken behavior has been kept because on Python 2, you can encode every code point with UTF-8, and decode it back without errors:
>>> x = [unichr(c).encode('utf-8') for c in range(0x110000)]
>>>
and breaking this invariant would probably make more harm than good.  I proposed to add a "real" utf-8 codec on Python 2, but no one seems to care enough about it.

Also note that this is fixed in Python3:
>>> x = [chr(c).encode('utf-8') for c in range(0x110000)]
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

I can see I need to bug report this stuff to.

If you find other places where it's broken (both on Python 2 and/or Python 3), please do and feel free to add me to the nosy. If you can also provide a failing test case and/or point to the relevant parts of the Unicode standard, it would be great.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Ezio Melotti \ezio.melotti@gmail.com\ added the comment:

> It is simply a design error to pretend that the number of characters > is the number of code units instead of code points. A terrible and > ugly one, but it does not mean you are UCS-2.

If you are referring to the value returned by len(unicode_string), it is the number of code units. This is a matter of "practicality beats purity". Returning the number of code units is O(1) (num_of_bytes/2). To calculate the number of characters it's instead necessary to scan all the string looking for surrogates and then count any surrogate pair as 1 character. It was therefore decided that it was not worth to slow down the common case just to be 100% accurate in the "uncommon" case.

If speed is more important than correctness, I can make any algorithm infinitely fast. Given the choice between correct and quick, I will take correct every single time.

Plus your strings our immutable! You know how long they are and they never change. Correctness comes at a negligible cost.

It was a bad choice to return the wrong answer.

That said it would be nice to have an API (maybe in unicodedata or as new str methods?) able to return the number of code units, code points, graphemes, etc, but I'm not sure that it should be the default behavior of len().

Always code points, never code units. I even use a class whose length method returns the grapheme count, because even code points aren't good enough. Yes of course graphemes have to be counted. Big deal. How would you like it if you said to move three to the left in vim and it *didn't* count each graphemes as one position? Madness.

> The ugly terrible design error is digusting and wrong, just as much > in Python as in Java, and perhaps moreso because of the idiocy of > narrow builds even existing.

Again, wide builds use twice as much the space than narrow ones, but one the other hand you can have fast and correct behavior with e.g. len(). If people don't care about/don't need to use non-BMP chars and would rather use less space, they can do so. Until we agree that the difference in space used/speed is no longer relevant and/or that non- BMP characters become common enough to prefer the "correct behavior" over the "fast-but-inaccurate" one, we will probably keep both.

Which is why I always put loud warnings in my Unicode-related Python programs that they do not work right on Unicode if running under a narrow build. I almost feel I should just exit.

> I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is > broken in a bunch of ways. You should be raising as exception in > all kinds of places and you aren't.

I am aware of some problems of the UTF-8 codec on Python 2. It used to follow RFC 2279 until last year and now it's been updated to follow RFC 3629.

Unicode says you can't put surrogates or noncharacters in a UTF-anything stream. It's a bug to do so and pretend it's a UTF-whatever.

Perl has an encoding form, which it does not call "UTF-8", that you can use the UTF-8 algorithm on for any code point, include non-characters and surrogates and even non-Unicode code points far above 0x10_FFFF, up to in fact 0xFFFF_FFFF_FFFF_FFFF on 64-bit machines. It's the internal format we use in memory. But we don't call it real UTF-8, either.

It sounds like this is the kind of thing that would be useful to you.

However, for backward compatibility, it still encodes/decodes surrogate pairs. This broken behavior has been kept because on Python 2, you can encode every code point with UTF-8, and decode it back without errors:

No, that's not UTF-8 then. By definition. See the Unicode Standard.

>>> x = [unichr(c).encode('utf-8') for c in range(0x110000)] >>>

and breaking this invariant would probably make more harm than good.

Why? Create something called utf8-extended or utf8-lax or utf8-nonstrict or something. But you really can't call it UTF-8 and do that.

We actually equate "UTF-8" and "utf8-strict". Our internal extended UTF-8 is something else. It seems like you're still doing the old relaxed version we used to have until 2003 or so. It seems useful to be able to have both flavors, the strict and the relaxed one, and to call them different things.

Perl defaults to the relaxed one, which gives warnings not exceptions, if you do things like setting PERLUNICODE to S or SD and such for the default I/I encoding. If you actually use "UTF-8" as the encoding on the stream, though, you get the version that gives exceptions instead.

"UTF-8" = "utf8-strict"     strictly by the standard, raises exceptions otherwise
"utf8"          loosely only, emits warnings on encoding illegal things

We currently only emit warnings or raise exceptions on I/O, not on chr operations and such. We used to raise exceptions on things like chr(0xD800), but that was a mistake caused by misunderstanding the in- memory requirements being different from stream requirements. It's really really subtle and you have to read the standard very closely to realize this.

So you are perfectly free to use chr(0x20FFFF) in your own code. This is really useful for out-of-band sentinels and such. However, if you try to send it out a loose utf8 stream, you get a mandatory warning, and if you try to send it out a strict UTF-8 stream, you get an exception.

In fact, if you remember the old "migrate ASCII trick" from the tr program, doing something like this to turn on the high bit to mark characters in some way:

tr[\0-\x7F][\x80-\xFF]

    (that's what killed WordStar BTW, as they used that trick
     on their ASCII internally so couldn't port to 8-bit
     encodings.  Ooops.)

Given the full 32- or 64-bit (or higher) character range for internal use, you can actually do this as a sort of corresponding transform:

tr[\0-\x{10_FFFF}][\x{20_0000}-\x{3F_FFFF}]

Just don't try to output it. :) For internal use only. Blah blah.

(Hm, I just realized you couldn't ever do that sort of thing at all on a narrow build because you're stuck with UTF-16. On a wide build, though, you could, because you'd have UTF-32. Not that I'm suggesting it!!!)

Yes, that's not necessarily the best way to do most of what one might naively try using it for, but there are all kinds of intersting things you can do when your characters' internal code points don't have the same upper bound as Unicode.

It's taken us years to unravel all this Unicode stuff so it's usable. We used to a lot of really um unfortunate things, whether too many errors or too few. I'm certainly not suggesting you go down those roads. In some ways, Python's Unicode support reminds me of ours from rather a long time ago.

We've worked pretty hard at Unicode in Perl for the last few years, although even ten years ago we already supported \X, all regex properties, and full casemapping and full casefolding. So there's always been a strong Unicode sensitivity in Perl. It's just taken us a long long long long time to get all the kinks out.

I don't imagine most of the Python devel team knows Perl very well, and maybe not even Java or ICU. So I get the idea that there isn't as much awareness of Unicode in your team as there tends to be in those others. From my point of view, learning from other people's mistakes is a way to get ahead without incurring all the learning-bumps oneself, so if there's a way to do that for you, that could be to your benefit, and I'm very happy to share some of our blunders so you can avoid them yourselves.

I proposed to add a "real" utf-8 codec on Python 2, but no one seems to care enough about it.

Hm. See previous paragraph. :)

Also note that this is fixed in Python3: >>> x = [chr(c).encode('utf-8') for c in range(0x110000)] UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

Yes, I've noticed that Python3 is better about some of this, but it doesn't detect the 66 noncharacter code points.

I haven't checked on decoding yet, but I bet it's the same.

I think having something that does the lax Python2 way and something else that does the stricter Standard way makes the most sense.

> I can see I need to bug report this stuff to.

If you find other places where it's broken (both on Python 2 and/or Python 3), please do and feel free to add me to the nosy. If you can also provide a failing test case and/or point to the relevant parts of the Unicode standard, it would be great.

I'll wait to report it till I have all my references at ready.

I can probably pretty easily find the part of the Unicode Standard where it says no UTF can contain code points that are illegal for interchange. Finding the part that explains that/why you can and indeed must be able to have them internally is going to be harder, but I know it's there.

Also, there is a tr18 update that adds a bit of clarification about how it is sometimes ok to allow a regex engine in a UTF-16 language to find unpaired surrogates, like checking whether "foo\x{D800}bar" matches the pattern /\p{Cs}/. You could never have that string read in from a valid UTF-{8,16,32} stream, but because it can happen in your program, you have to be able to match it. So they finally admit this in the next tr18 update. But it's still a bit odd, eh? (And no, that doesn't make it UCS-2! :)

--tom

ezio-melotti commented 13 years ago

If speed is more important than correctness, I can make any algorithm infinitely fast. Given the choice between correct and quick, I will take correct every single time.

It's a trade-off. Using non-BMP chars is fairly unusual (many real-world applications hardly use non-ASCII chars). Slowing everything down just to allow non-BMP chars on narrow builds is not a good idea IMHO. Wide builds can be used if one really wants len() and other methods to work properly with non-BMP chars.

Plus your strings our immutable! You know how long they are and they never change. Correctness comes at a negligible cost.

Sure, we can cache the len, but we still have to compute it at least once. Also it's not just len(), but many other operations like slicing that are affected.

Unicode says you can't put surrogates or noncharacters in a UTF-anything stream. It's a bug to do so and pretend it's a UTF-whatever.

The UTF-8 codec described by RFC 2279 didn't say so, so, since our codec was following RFC 2279, it was producing valid UTF-8. With RFC 3629 a number of things changed in a non-backward compatible way. Therefore we couldn't just change the behavior of the UTF-8 codec nor rename it to something else in Python 2. We had to wait till Python 3 in order to fix it.

Perl has an encoding form, which it does not call "UTF-8", that you can use the UTF-8 algorithm on for any code point, include non-characters and surrogates and even non-Unicode code points far above 0x10_FFFF, up to in fact 0xFFFF_FFFF_FFFF_FFFF on 64-bit machines. It's the internal format we use in memory. But we don't call it real UTF-8, either.

This sounds like RFC 2279 UTF-8. It allowed up to 6 bytes (following the same encoding scheme) and had no restrictions about surrogates (at the time I think only BMP chars existed, so there were no surrogates and the Unicode consortium didn't decide that the limit was 0x10FFFF).

It sounds like this is the kind of thing that would be useful to you.

I believe this is what the surrogateescape error handler does (up to 0x10FFFF).

Why? Create something called utf8-extended or utf8-lax or utf8-nonstrict or something. But you really can't call it UTF-8 and do that.

That's what we did in Python 3, but on Python 2 is too late to fix it, especially in a point release. (Just to clarify, I don't think any of these things will be fixed in 2.7. There won't be any 2.8, and major changes (especially backwards-incompatible ones) are unlikely to happen in a point release (e.g. 2.7.3), so it's better to focus on Python 3. Minor bug fixes can still be done even in 2.7 though.)

Perl defaults to the relaxed one, which gives warnings not exceptions, if you do things like setting PERLUNICODE to S or SD and such for the default I/I encoding. If you actually use "UTF-8" as the encoding on the stream, though, you get the version that gives exceptions instead.

In Python we don't usually use warnings for this kind of things (also we don't have things like "use strict").

I don't imagine most of the Python devel team knows Perl very well, and maybe not even Java or ICU. So I get the idea that there isn't as much awareness of Unicode in your team as there tends to be in those others.

I would say there are at least 5-10 Unicode "experts" in our team. It might be true though that we don't always follow closely what other languages and the Unicode consortium do, but if people reports problem we are willing to fix them (so thanks for reporting them!).

From my point of view, learning from other people's mistakes is a way to get ahead without incurring all the learning-bumps oneself, so if there's a way to do that for you, that could be to your benefit, and I'm very happy to share some of our blunders so you can avoid them yourselves.

While I really appreciate the fact that you are sharing with us your experience, the solution found and applied in Perl might not always be the best one for Python (but it's still good to learn from others' mistakes). For example I don't think removing the 0x10FFFF upper limit is going to happen -- even if it might be useful for other things. Also regular expressions are not part of the core and are not used that often, so I consider problems with narrow/wide builds, codecs and the unicode type much more important than problems with the re/regex module (they should be fixed too, but have lower priority IMHO).

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Ezio Melotti \report@bugs.python.org\ wrote on Sun, 14 Aug 2011 07:15:09 -0000:

> Unicode says you can't put surrogates or noncharacters in a > UTF-anything stream. It's a bug to do so and pretend it's a > UTF-whatever.

The UTF-8 codec described by RFC 2279 didn't say so, so, since our codec was following RFC 2279, it was producing valid UTF-8. With RFC 3629 a number of things changed in a non-backward compatible way. Therefore we couldn't just change the behavior of the UTF-8 codec nor rename it to something else in Python 2. We had to wait till Python 3 in order to fix it.

I'm a bit confused on this. You no longer fix bugs in Python 2?

I've dug out the references that state that you are not allowed to do things the way you are doing them. This is from the published Unicode Standard version 6.0.0, chapter 3, Conformance. It is a very important chapter.

http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Python is in violation of that published Standard by interpreting noncharacter code points as abstract characters and tolerating them in character encoding forms like UTF-8 or UTF-16. This explains that conformant processes are forbidden from doing this.

Code Points Unassigned to Abstract Characters

 C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point
     as an abstract character.
   · The high-surrogate and low-surrogate code points are designated for surrogate
     code units in the UTF-16 character encoding form. They are unassigned to any
     abstract character.

==> C2 A process shall not interpret a noncharacter code point as an abstract character. · The noncharacter code points may be used internally, such as for sentinel val- ues or delimiters, but should not be exchanged publicly.

 C3 A process shall not interpret an unassigned code point as an abstract character.
   · This clause does not preclude the assignment of certain generic semantics to
     unassigned code points (for example, rendering with a glyph to indicate the
     position within a character block) that allow for graceful behavior in the pres-
     ence of code points that are outside a supported subset.
   · Unassigned code points may have default property values. (See D26.)
   · Code points whose use has not yet been designated may be assigned to abstract
     characters in future versions of the standard. Because of this fact, due care in
     the handling of generic semantics for such code points is likely to provide better
     robustness for implementations that may encounter data based on future ver-
     sions of the standard.

Next we have exactly how something you call UTF-{8,16-32} must be formed. *This* is the Standard against which these things are measured; it is not the RFC.

You are of course perfectly free to say you conform to this and that RFC, but you must not say you conform to the Unicode Standard when you don't. These are different things. I feel it does users a grave disservice to ignore the Unicode Standard in this, and sheer casuistry to rely on an RFC definition while ignoring the Unicode Standard whence it originated, because this borders on being intentionally misleading.

Character Encoding Forms

 C8 When a process interprets a code unit sequence which purports to be in a Unicode char-
     acter encoding form, it shall interpret that code unit sequence according to the corre-
     sponding code point sequence.

==> · The specification of the code unit sequences for UTF-8 is given in D92. · The specification of the code unit sequences for UTF-16 is given in D91. · The specification of the code unit sequences for UTF-32 is given in D90.

 C9 When a process generates a code unit sequence which purports to be in a Unicode char-
     acter encoding form, it shall not emit ill-formed code unit sequences.
   · The definition of each Unicode character encoding form specifies the ill-
     formed code unit sequences in the character encoding form. For example, the
     definition of UTF-8 (D92) specifies that code unit sequences such as <C0 AF>
     are ill-formed.

==> C10 When a process interprets a code unit sequence which purports to be in a Unicode char- acter encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters. · For example, in UTF-8 every code unit of the form 110xxxx2 must be followed by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2 is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant pro- cess must treat the first code unit 110xxxxx2 as an illegally terminated code unit sequence--for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character. · Conformant processes cannot interpret ill-formed code unit sequences. How- ever, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89. · Utility programs are not prevented from operating on "mangled" text. For example, a UTF-8 file could have had CRLF sequences introduced at every 80 bytes by a bad mailer program. This could result in some UTF-8 byte sequences being interrupted by CRLFs, producing illegal byte sequences. This mangled text is no longer UTF-8. It is permissible for a conformant program to repair such text, recognizing that the mangled text was originally well-formed UTF-8 byte sequences. However, such repair of mangled data is a special case, and it must not be used in circumstances where it would cause security problems. There are important security issues associated with encoding conversion, espe- cially with the conversion of malformed text. For more information, see Uni- code Technical Report #36, "Unicode Security Considerations."

Here is the part that explains why Python narrow builds are actually UTF-16 not UCS-2, and why its documentation needs to be updated:

D89 In a Unicode encoding form: A Unicode string is said to be in a particular Unicode
       encoding form if and only if it consists of a well-formed Unicode code unit sequence
       of that Unicode encoding form.
    · A Unicode string consisting of a well-formed UTF-8 code unit sequence is said
       to be in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a
       UTF-8 string for short.
    · A Unicode string consisting of a well-formed UTF-16 code unit sequence is said
       to be in UTF-16. Such a Unicode string is referred to as a valid UTF-16 string,
       or a UTF-16 string for short.
    · A Unicode string consisting of a well-formed UTF-32 code unit sequence is said
       to be in UTF-32. Such a Unicode string is referred to as a valid UTF-32 string,
       or a UTF-32 string for short.

==> Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form.

    · For example, it is perfectly reasonable to talk about an operation that takes the
       two Unicode 16-bit strings, <004D D800> and <DF02 004D>, each of which
       contains an ill-formed UTF-16 code unit sequence, and concatenates them to
       form another Unicode string <004D D800 DF02 004D>, which contains a well-
       formed UTF-16 code unit sequence. The first two Unicode strings are not in
       UTF-16, but the resultant Unicode string is.

[...]

 D14 Noncharacter: A code point that is permanently reserved for internal use and that
       should never be interchanged. Noncharacters consist of the values U+nFFFE and
       U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
     · For more information, see Section 16.7, Noncharacters.
     · These code points are permanently reserved as noncharacters.

 D15 Reserved code point: Any code point of the Unicode Standard that is reserved for
       future assignment. Also known as an unassigned code point.
     · Surrogate code points and noncharacters are considered assigned code points,
       but not assigned characters.
     · For a summary classification of reserved and other types of code points, see
       Table 2-3.

In general, a conforming process may indicate the presence of a code point whose use has
not been designated (for example, by showing a missing glyph in rendering or by signaling
an appropriate error in a streaming protocol), even though it is forbidden by the standard
from interpreting that code point as an abstract character.

Here's how I read all that.

The noncharacters and the unpaired surrogates are illegal for interchange, and their presence in a UTF means that that UTF is not conformant to the requirements for what a UTF shall contain. Nonetheless, internally it is necessary that all code points, even noncharacter code points and surrogates, be representable, and doing so does not mean that you are no longer are in that encoding form. However, you must not allow such things into a UTF stream, because doing so means that that stream is no longer a UTF stream.

That's why I say that you are of conformance by having encoders and decoders of UTF streams tolerate noncharacters. You are not allowed to call something a UTF and do non-UTF things with it, because this in violation of conformance requirement C2. Therefore you must either (1) change what you are calling the thing you doing the nonconforming thing to, or you must (2) change it to no longer do the nonconforming thing. If you do neither, then Python no longer conforms to the formal requirements for handling such things as these are defined by the Unicode Standard, and therefore that version of Python is no longer conformant to the version of the Unicode Standard that it purports conformance to. And yes, that's a long way of saying it's lying.

It's also why having noncharacters including surrogates in memory does *not* suddenly mean that there are not stored in a UTF, because you have to be able to do that to build up buffers per the concatenation example in conformance requirement D89. Therefore, Python uses UTF-16 internally and should not say it uses UCS-2, because that is inaccurate and incorrect; in short, it's wrong. That doesn't help anybody.

At least, that's how I read the Unicode Standard. Perhaps a more careful reading than mine would admit alternate interpretations. If you have not reread its Chapter 3 of late in its entirety, you probably want to do so. There is quite a bit of material there that is fundamental to any process that claims to be conformant with the Unicode Standard.

I hope that makes sense. These things can be extremely difficult to read, for they are subtle and quick to anger. :)

--tom

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Ezio Melotti \report@bugs.python.org\ wrote on Sun, 14 Aug 2011 07:15:09 -0000:

For example I don't think removing the 0x10FFFF upper limit is going to happen -- even if it might be useful for other things.

I agree entirely. That's why I appended a triple exclamation point to where I said I certainly do not expect this. It can only work fully on UTF-8ish systems and up to 32 bits on UTF-32, and it is most emphatically *not* Unicode. Yes, there are things you can do with it, but it risks serious misunderstanding and even noncomformance if not done very carefully. The Standard does not forbid such things internally, but you are not allowed to pass them around in noninternal streams claiming they are real UTF streams.

Also regular expressions are not part of the core and are not used that often, so I consider problems with narrow/wide builds, codecs and the unicode type much more important than problems with the re/regex module (they should be fixed too, but have lower priority IMHO).

One advantage of having an external library is the ability to update it asynchronously. Another is the possibility to swap in out altogether. Perl only gained that ability, which Python has always had, some four years ago with its 5.10 release. To my knowledge, the only thing people tend to use this for is to get Russ Cox's re2 library, which has very different performance characteristics and guarantees that allow it to be used in potential starvation denial-of-service situations that the normal Perl, Python, Java, etc regex engine cannot be safely used for.

-tom

pitrou commented 13 years ago

> The UTF-8 codec described by RFC 2279 didn't say so, so, since our > codec was following RFC 2279, it was producing valid UTF-8. With RFC > 3629 a number of things changed in a non-backward compatible way. > Therefore we couldn't just change the behavior of the UTF-8 codec nor > rename it to something else in Python 2. We had to wait till Python 3 > in order to fix it.

I'm a bit confused on this. You no longer fix bugs in Python 2?

In general, we try not to introduce changes that have a high probability of breaking existing code, especially when what is being "fixed" is a minor issue which almost nobody complains about.

This is even truer for stable branches, and Python 2 is very much a stable branch now (no more feature releases after 2.7).

That's why I say that you are of conformance by having encoders and decoders of UTF streams tolerate noncharacters. You are not allowed to call something a UTF and do non-UTF things with it, because this in violation of conformance requirement C2.

Perhaps, but it is not Python's fault if the IETF and the Unicode consortium have disagreed on what UTF-8 should be. I'm not sure what people called "UTF-8" when support for it was first introduced in Python, but you can't blame us for maintaining a consistent behaviour across releases.

ezio-melotti commented 13 years ago

I'm a bit confused on this. You no longer fix bugs in Python 2?

We do, but it's unlikely that we will introduce major changes in behavior. Even if we had to get rid of narrow builds and/or fix len(), we would probably only do it in the next 3.x version (i.e. 3.3), and not in the next bug fix release of 3.2 (i.e. 3.2.2).

That's why I say that you are of conformance by having encoders and decoders of UTF streams tolerate noncharacters. You are not allowed to call something a UTF and do non-UTF things with it, because this in violation of conformance requirement C2.

This IMHO should be fixed, but it's another issue.

If you have not reread its Chapter 3 of late in its entirety, you probably want to do so. There is quite a bit of material there that is fundamental to any process that claims to be conformant with the Unicode Standard.

I am familiar with the Chapter 3, but admittedly I only read the parts that were relevant to the bugs I was fixing. I never went through it checking that everything in Python matches the described behavior. Thanks for pointing out the parts were Python doesn't follow the specs.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Ezio Melotti \report@bugs.python.org\ wrote on Sun, 14 Aug 2011 17:46:55 -0000:

> I'm a bit confused on this. You no longer fix bugs in Python 2?

We do, but it's unlikely that we will introduce major changes in behavior.

Even if we had to get rid of narrow builds and/or fix len(), we would probably only do it in the next 3.x version (i.e. 3.3), and not in the next bug fix release of 3.2 (i.e. 3.2.2).

Antoine Pitrou \report@bugs.python.org\ wrote on Sun, 14 Aug 2011 17:36:42 -0000:

This is even truer for stable branches, and Python 2 is very much a stable branch now (no more feature releases after 2.7).

Does that mean you now go to 2.7.1, 2.7.2, etc?

I had thought that 2.6 was going to be the last, but then 2.7 ame out. I think I remember Guido said something about there never being a 2.10, so I wasn't too surprised to see 2.7.

--tom

ezio-melotti commented 13 years ago

2.7 is the last 2.x. There won't be any 2.8 (also I never heard that 2.6 was supposed to be the last). We already have 2.7.2, and we will continue with 2.7.3, 2.7.4, etc for a few more years. Eventually 2.7 will only get security fixes and the development will be focused on 3.x only.

terryjreedy commented 13 years ago

Tom, I appreciate your taking the time to help us improve our Unicode story. I agree that the compromises made a decade ago need to be revisited and revised.

I think it will help if you better understand our development process. Our current *intent* is that 'Python x.y' be a fixed language and that 'CPython x.y.0', '.1', '.2', etc be increasingly (and strictly -- no regressions) better implementations of Python x.y. (Of course, the distribution and installation names and up-to-now dropping of '.0' may confuse the distinction, but it is real.) As a consequence, correct Python x.y code that runs correctly on the CPython x.y.z implementation should run correctly on x.y.(z+1).

For the purpose of this tracker, a behavior issue ('bug') is a discrepancy between the documented intent of a supported Python x.y and the behavior of the most recent CPython x.y.z implementation thereof. A feature request is a design issue, a request for a change in the language definition (and in the corresponding .0 implementation). Most people (including you, obviously) that file feature requests regard them as addressing design bugs. But still, language definition bugs are different from implementation bugs.

Of course, this all assumes that the documents are correct and unambiguous. But accomplishing that can be as difficult as correct code. Obvious mistakes are quickly corrected. Ambiguities in relation to uncontroversial behavior are resolved by more exactly specifying the actual behavior. But ambiguities about behavior that some consider wrong, are messy. We can consult the original author if available, consult relevant tests if present, take votes, but some fairly arbitrary decision may be needed. A typical response may be to clarify behavior in the docs for the current x.y release and consider behavior changes for the next x.(y+1) release.

So the answer to your question, "Do we fix bugs?", is that we fix doc and implementation behavior bugs in the next micro x.y.z behavior bug-fix release and language design bugs in the next minor x.y language release. But note that language changes merely have to be improvements for Python in the future without necessarily worrying about whether a design decision made years ago was or is a 'bug'.

The purpose of me discussing or questioning the 'type' of some of your issues is to *facilitate* change by getting the issue on the right track, in relation to our development process, as soon as possible.

terryjreedy commented 13 years ago

This is off-topic, but there was discussion on whether or not to have a 2.7. The decision was to focus on back-porting things that would make the eventual transition to 3.x easier.

terryjreedy commented 13 years ago

Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. They support non-BMP chars but only partially, because, BY DESIGN*, indexing and len are by code units, not codepoints. They are documented as being UCS-2 because that is what M-A Lemburg, the original designer and writer of Python's unicode type and the unicode-capable re module, wants them to be called. The link to msg142037, which is one of 50+ in the thread (and many or most other disagree), pretty well explains his viewpoint. The positive side is that we deliver more than we promise. The negative side is that by not promising what perhaps we should allows is not to deliver what perhaps we should.

*While I think this design decision may have been OK a decade ago for a first implementation of an *optional text type, I do not think it so for the future for revised implementations of what is now *the text type. I think narrow builds can and should be revised and upgraded to index, slice, and measure by codepoints. Here is my current idea:

If the code unit stream contains any non-BMP characters (ie, surrogate pair of 16-bit code units), construct a sequence of *indexes* of such characters (pairs). The fixed length of the string in codepoints is n-k, where n is the number of code units (the current length) and k is the length of the auxiliary sequence and the number of pairs. For indexing, look up the character index in the list of indexes by binary search and increment the codepoint index by the index of the index found to get the corresponding code unit index. (I have omitted the details needed avoid off-by-1 errors.)

This would make indexing O(log(k)) when there are surrogates. If that is really a problem because k is a substantial fraction of a 'large' n, then one should use a wide build. By using a separate internal class, there would be no time or space penalty for all-BMP text. I will work on a prototype in Python.

PS: The OSCON link in msg142036 currently gives me 404 not found

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 13 years ago

Have a look here: http://98.245.80.27/tcpc/OSCON2011/gbu/index.html

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

"Terry J. Reedy" \report@bugs.python.org\ wrote on Mon, 15 Aug 2011 00:26:53 -0000:

PS: The OSCON link in msg142036 currently gives me 404 not found

Sorry, I wrote

 http://training.perl.com/OSCON/index.html

but meant

 http://training.perl.com/OSCON2011/index.html

I'll fix it on the server in a short spell.

I am trying to keep the document up to date as I learn more, so it isn't precisely the talk I gave in Portland.

Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16.

So I'm finding. Perhaps that's why I keep getting confused. I do have a pretty firm notion of what UCS-2 and UTF-16 are, and so I get sometimes self-contradictory results. Can you think of anywhere that Python acts like UCS-2 and not UTF-16? I'm not sure I have found one, although the regex thing might count.

Thank you guys for being so helpful and understanding.

They support non-BMP chars but only partially, because, BY DESIGN*, indexing and len are by code units, not codepoints.

That's what Java did, too, and for the same reason. Because they had a UCS-2 implementation for Unicode 1.1 so when Unicode 2.0 came out and they learned that they would need more than 16 bits, they piggybacked UTF-16 onto the top of it instead of going for UTF-8 or UTF-32, and they're still paying that price, and to my mind, heavily and continually.

Do you use Java? It is very like Python in many of its 16-bit character issues. Most of the length and indexing type functions address things by code unit only, not copepoint. But they would never claim to be UCS-2.

Oh, I realize why they did it. For one thing, they had bytecode out there that they had to support. For another, they had some pretty low-level APIs that didn't have enough flexibility of abstraction, so old source had to keep working as before, even though this penalized the future. Forever, kinda.

While I wish they had done better, and kinda think they could have, it isn't my place to say. I wasn't there (well, not paying attention) when this was all happening, because I was so underwhelmed by the how annoyingly overhyped it was. A billion dollars of marketing can't be wrong, you know? I know that smart people looked at it, seriously. I just find the cure they devised to be more in the problem set than the solution set.

I like how Python works on wide builds, especially with Python3. I was pretty surprised that the symbolic names weren't working right on the earlier version of the 2.6 wide build I tried them on.

I know have both wide and narrow builds installed of both 2.7 and 3.2, so that shouldn't happen again.

They are documented as being UCS-2 because that is what M-A Lemburg, the original designer and writer of Python's unicode type and the unicode- capable re module, wants them to be called. The link to msg142037, which is one of 50+ in the thread (and many or most other disagree), pretty well explains his viewpoint.

Count me as one of those many/most others who disagree. :)

The positive side is that we deliver more than we promise. The negative side is that by not promising what perhaps we should allows is not to deliver what perhaps we should.

It is always better to deliver more than you say than to deliver less.

While I think this design decision may have been OK a decade ago for a first implementation of an *optional text type, I do not think it so for the future for revised implementations of what is now *the\ text type. I think narrow builds can and should be revised and upgraded to index, slice, and measure by codepoints.

Yes, I think so, too. If you look at the growth curve of UTF-8 alone, it has followed a mathematically exponential growth curve in the first decade of this century. I suspect that will turn into an S surve with with aymtoptotic shoulders any time now. I haven't looked at it lately, so maybe it already has. I know that huge corpora I work with at work are all absolutely 100% Unicode now. Thank XML for that.

Here is my current idea:

If the code unit stream contains any non-BMP characters (ie, surrogate pair of 16-bit code units), construct a sequence of *indexes* of such characters (pairs). The fixed length of the string in codepoints is n-k, where n is the number of code units (the current length) and k is the length of the auxiliary sequence and the number of pairs. For indexing, look up the character index in the list of indexes by binary search and increment the codepoint index by the index of the index found to get the corresponding code unit index. (I have omitted the details needed avoid off-by-1 errors.)

This would make indexing O(log(k)) when there are surrogates. If that is really a problem because k is a substantial fraction of a 'large' n, then one should use a wide build. By using a separate internal class, there would be no time or space penalty for all-BMP text. I will work on a prototype in Python.

You are a brave man, and good. Bravo!

It may be that that was the sort of thing that Larry was talking to me about 6-8 months ago regarding how to construct a better way to access strings by grapheme index.

Everyone always talks about important they're sure O(1) access must be, and how they therefore abosolutely have to have it no matter the tradeoffs. But without two implementations to compare real-world access patterns against, one really can't know.
I know that index/rindex and substring operations are very rare in how I myself process strings, but I have seen how Python people turn to those all the time when I would reflexively use a pattern match. So usage patterns may very; hence the desire for real comparisons. I'm perfectly willing to be convinced, but I want to see real data.

If I get time, I'll check into whether the Perl 6 people have any real data about that. I had thought that that Parrot was currently using ICU4C for its string handling, which may mean they're afflicted by UTF-16, something I wouldn't think they would tolerate, especially since they need code points above 0x10_FFFF for their Normalization Form G (Grapheme). Piggy packing that on UTF-16 would require stealing some Private Use code point to act as multilevel surrogates so that UTF-16 is infinitely extensible the way UTF-8 is. Not sure what I think about that, but it's been mentioned as a loophole escape for when Unicode has to renege on its 21-bit promise. I sure hope everyone has stopped using UTF-16 by then myself. It's trouble enough right now.

Hm, now that I think ICU about it, ICU just might use int sequences interally, so UTF- 32, for its own strings, so that might be it. Yes, I see they too are going for O(1) access. Nonetheless, a careful enough UTF-16 implementation with a rich enough API is able to access all of Unicode with no trouble. It's just that the Java API from Sun is not one such. The Perl 6 spec is all about graphemes, and graphemes are all about code points, which means an implementation could work around 16-bit code units so the user never has to think about them. That would be just like the Perl 5 implmentation works around 8-bit code units and never let the user notice it.

Hey, if you can build TCP out of IP, anything is possible. :)

--tom

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

I wrote:

> Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16.

So I'm finding. Perhaps that's why I keep getting confused. I do have a pretty firm notion of what UCS-2 and UTF-16 are, and so I get sometimes self-contradictory results. Can you think of anywhere that Python acts like UCS-2 and not UTF-16? I'm not sure I have found one, although the regex thing might count.

I just thought of one. The casemapping functions don't work right on Deseret, which is a non-BMP case-changing scripts. That's one I submitted as a bug, because I figure if the the UTF-8 decoder can decode the non-BMP code points into paired UTF-16 surrogates, then the casing functions had jolly well be able to deal with it. If the UTF-8 decoder knows it is only going to UCS-2, then it should have raised on exception on my non-BMP source. Since it went to UTF-16, the rest of the language should have behaved accordingly. Java does to this right, BTW, despite its UTF-16ness.

--tom

terryjreedy commented 13 years ago

It is always better to deliver more than you say than to deliver less. Except when promising too little is a copout.

Everyone always talks about important they're sure O(1) access must be,

I thought that too until your challenge. But now that you mention it, indexing is probably not the bottleneck in most document processing. We are optimizing without measuring! We all know that is bad.

If done transparently, non-O(1) indexing should only be done when it is *needed*. And if it is a bottleneck, switch to a wide build -- or get a newer, faster machine.

I first used Python 1.3 on a 10 megahertz DOS machine. I just got a multicore 3.+ gigahertz machine. Tradeoffs have changed and just as we use cycles (and space) for nice graphical interfaces, we should use some for global text support. In the same pair of machines, core memory jumped from 2 megabytes to 24 gigabytes. (And the new machine cost perhaps as much in adjusted dollars.) Of course, better unicode support should come standard with the OS and not have to be re-invented by every language and app.

Having promised to actually 'work on a prototype in Python', I decided to do so before playing. I wrote the following test:

tucs2 = 'A\U0001043cBC\U0001042f\U00010445DE\U00010428H'
tutf16= UTF16(tucs2)
tlist = ['A', '\U0001043c','B','C','\U0001042f','\U00010445',
         'D','E','\U00010428','H']
tlis2 = [tutf16[i] for i in range(len(tlist))]
assert tlist == tlis2

and in a couple hours wrote and debugged the class to make it pass (and added a couple of length tests). See the uploaded file.

Adding an __iter method to iterate by characters (with hi chars returned as wrapped length-1 surrogate pairs) instead of code units would be trivial. Adding the code to __getitem to handle slices should not be too hard. Slices containing hi characters should be wrapped. The cpdex array would make that possible without looking at the whole slice.

The same idea could be used to index by graphemes. For European text that used codepoints for pre-combined (accented) characters as much as possible, the overhead should not be too much.

This may not be the best issue to attach this to, but I believe that improving the narrow build would allow fixing of the re/regex problems reported here.

ezio-melotti commented 13 years ago

Keep in mind that we should be able to access and use lone surrogates too, therefore: s = '\ud800' # should be valid len(s) # should this raise an error? (or return 0.5 ;)? s[0] # error here too? list(s) # here too?

p = s + '\udc00'
len(p)  # 1?
s[0]  # '\U00010000' ?
s[1]  # IndexError?
list(p + 'a')  # ['\ud800\udc00', 'a']?

We can still decide that strings with lone surrogates work only with a limited number of methods/functions but: 1) it's not backward compatible; 2) it's not very consistent

Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined "on the fly":
>>> p
'\ud800\udc00'
>>> len(p)
2
>>> p.encode('utf-16').decode('utf-16')
'𐀀'
>>> len(_)
1

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Ezio Melotti \report@bugs.python.org\ wrote on Mon, 15 Aug 2011 04:56:55 -0000:

Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined "on the fly": >>> p '\ud800\udc00' >>> len(p) 2 >>> p.encode('utf-16').decode('utf-16') '𐀀' >>> len(_) 1

(For those who may not immediately realize from reading the surrogates,
 '𐀀' is code point 0x10000, the first non-BMP code point.  I piped it 
 through `uniquote -x` just to make sure.)

Yes, that makes perfect sense. It's something of a buggy feature or featureful bug that UTF-16 does this.

When you are thinking of arbitrary sequences of code points, which is something you have be able to do in memory but not in a UTF stream, then one can say that one has four code points of anything in the 0 .. 0x10FFFF range. Those can be any arbitrary code points only (1) *while in memory, *and assuming a (2) non-UTF16, ie UTF-32 or UTF-8 representation. You cannot do that with UTF-16, which is why it works only on a Python wide build. Otherwise they join up.

The reason they "join up" in UTF-16 is also the reason why unlike in regular memory where you might be able to use an alternate representation like UTF-8 or UTF-32, UTF streams cannot contain unpaired surrogates: because if that stream were in UTF-16, you would never be able to tell the difference between a sequence of a lead surrogate followed by a tail surrogate and the same thing meaning just one non-BMP code point. Since you would not be able to tell the difference, it always only means the latter, and the former sense is illegal. This is why lone surrogates are illegal in UTF streams.

In case it isn't obvious, *this* is the source of the [𝒜--𝒵] bug in all the UTF-16 or UCS-2 regex languages. It is why Java 7 added \x{...}, so that they can rewrite that as [\x{1D49C}--\x{1D4B5}] to pass the regex compiler, so that it seems something indirect, not just surrogates.

That's why I always check it in my cross-language regex tests. A 16-bit language has to have a workaround, somehow, or it will be in trouble.

The Java regex compiler doesn't generate UTF-16 for itself, either. It generates UTF-32 for its pattern. You can see this right at the start of the source code. This is from the Java Pattern class:

    /**
     * Copies regular expression to an int array and invokes the parsing
     * of the expression which will create the object tree.
     */
    private void compile() {
        // Handle canonical equivalences
        if (has(CANON_EQ) && !has(LITERAL)) {
            normalize();
        } else {
            normalizedPattern = pattern;
        }
        patternLength = normalizedPattern.length();

    // Copy pattern to int array for convenience
    // Use double zero to terminate pattern
    temp = new int[patternLength + 2];

        hasSupplementary = false;
        int c, count = 0;
        // Convert all chars into code points
        for (int x = 0; x < patternLength; x += Character.charCount(c)) {
            c = normalizedPattern.codePointAt(x);
            if (isSupplementary(c)) {
                hasSupplementary = true;
            }
            temp[count++] = c;
        }

        patternLength = count;   // patternLength now in code points

See how that works? They use an int(-32) array, not a char(-16) array! It's reasonably clever, and necessary. Because it does that, it can now compile \x{1D49C} or erstwhile embedded UTF-8 non-BMP literals into UTF-32, and not get upset by the stormy sea of troubles that surrogates are. You can't have surrogates in ranges if you don't do something like this in a 16-bit language.

Java couldn't fix the [𝒜--𝒵] bug except by doing the \x{...} indirection trick, because they are stuck with UTF-16. However, they actually can match the string "𝒜" against the pattern "^.$", and have it fail on "^..$". Yes, I know: the code-unit length of that string is 2, but its regex count is just one dot worth.

I *believe they did it that way because tr18 says it has to work that way, but they may also have done it just because it makes sense. My current contact at Oracle doing regex support is not the guy who originally wrote the class, so I am not sure. (He's very good, BTW. For Java 7, he also added named captures, script properties, *and brought the class up to conformance with tr18's "level 1" requirements.)

I'm thinking Python might be able to do in the regex engine on narrow builds the sort of thing that Java does. However, I am also thinking that that might be a lot of work for a situation more readily addressed by phasing out narrow builds or at least telling people they should use wide builds to get that thing to work.

--tom

\======================================================================

    \============================================
    ===>  QUASI OFF TOPIC ADDENDUM FOLLOWS  <\===
    \============================================

The rest of this message is just a comparison demo of how Perl treats the same sort of surrogate stuff that you were showing with Python. I'm not sure that it as relevant to your problem as the Java part just was, since we don't have UTF-16 the way Java does.

One place it *may* be relevant to Python, which is the only reason I'm including it, is that it shows what sort of elastic boundaries you have with the old loose form utf8 that you no longer have with the current strict definition of UTF-8 from the Unicode Standard.

It turns out that you can't play those surrogate halvsies games in Perl, because it thinks you're nuts. :)

(Whether this is a bug or a feature may depend on where you're standing, but we're trying to conform with our reading of the Standard. Things like the Unicode Collation Algorithm have third-party conformance suites, but I don't know of any for encoders/decoders, so one is back to deciding whether one has correctly understood the Standard. As we all know, this isn't always all that easy.)

This is the clincher at the far end, so you can quit reading now if you want:

% perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}")' | uniquote -b 
UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
\x00\x00

% perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}\x{DC00}")' | uniquote -b
UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
\x00\x00\x00\x00

% perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}"), encode("UTF-16LE", "\x{DC00}")' | uniquote -b 
UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
\x00\x00\x00\x00

(Meanwhile, back to the quasi-off-topic monologue.)

Oh, you can have isolated surrogates in memory just fine. We use abstract code points not code units, and since we aren't UTF-16, there's no problem there:

% perl -le 'print length("\x{D800}")'
1
% perl -le 'print length("\x{D800}\x{DC00}")'
2
% perl -le 'print length("\x{DC00}\x{D800}")'
2

But you aren't allowed to *encode* those abstract characters into a UTF encoding form.

    (The -CS switch is like having PERLUNICODE=S or PYTHONIOENCODING=utf8)

% perl -CS -le 'print "\x{D800}\x{DC00}"'
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
??????

Now, you might think my encoder did that, but it didn't. Those "??????" are is from my dumb Mac Terminal, *not* from the Perl utf8 encoder. This is yet another reason, BTW, why using "?" for the replacement for a non-encodable code point is strongly discouraged by Unicode. Otherwise it is too hard whether it is a "?" because it was in the original data or a "?" because it was a replacment char. Some of our encoders do use

 �  FFFD        REPLACEMENT CHARACTER
    * used to replace an incoming character whose value is unknown or unrepresentable in Unicode
    * compare the use of 001A as a control character to indicate the substitute function

But that isn't triggering here. Run through uniquote shows that you got illegal surrogates in your (loose) utf8 stream:

% perl -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -x
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
\x{D800}\x{DC00}

Use -b on uniquote to get bytes not hex chars:

% perl -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -b
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
\xED\xA0\x80\xED\xB0\x80

Yes, I'm not fond of the dribbled-out warning, either. It should of course be an exception. Part of what is going on here is that Perl's regular "utf8" encoding is the (I believe) the same loose one that Python also uses, the old one from before the restrictions. If you use the strict "UTF-8", then you get...

% perl -le 'binmode(STDOUT, "encoding(UTF-8)"); print "\x{D800}\x{DC00}"' 
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
"\x{d800}" does not map to utf8.
"\x{dc00}" does not map to utf8.
\x{D800}\x{DC00}

Which is pretty whacky. I'm *not* uniquoting that this time around, the encoder is. It did the \x{} substitution because it is absolutely forbidden from ever emitting surrogates in a valid strict UTF-8 stream, the way it will begrudgingly do in a loose utf8 stream (although the result is not valid as strict UTF-8).

If you run with all warnings promoted into exceptions (as I almost always do), then you get more reasonable behavior, maybe. I'm going to use a command-line switch that's just like saying

use warnings "FATAL" => "all".

in the top-level scope. First with loose utf8:

% perl -Mwarnings=FATAL,all -CS -le 'print "\x{D800}\x{DC00}"' 
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
Exit 255

And next with strict UTF-8:

% perl -Mwarnings=FATAL,all -le 'binmode(STDOUT, "encoding(UTF-8)"); print "\x{D800}\x{DC00}"' 
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
Exit 25

So when you make them fatal, you get an exception -- which by not being caught, causes the program to exit.

(The "Exit" is printed by my shell, and no, I'm sorry but I don't
 know why the status value differs.  Well, kinda I know.  It's
 because errno was 25 (ENOTTY) in the second case, but had been 0 in
 the first case so that turned into an unsigned char -1.)

If you *promise Perl you know what you're doing and full speed ahead, you can disable utf8 warnings, which will work to get you your two surrogates in the outbound *loose utf8 stream, as six bytes that look like UTF-8 but actually "can't" be (called that) due to the Standard's rules:

% perl -M-warnings=utf8  -CS -le 'print "\x{D800}\x{DC00}"'
??????
% perl -M-warnings=utf8 -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -x
\x{D800}\x{DC00}
% perl -M-warnings=utf8 -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -b
\xED\xA0\x80\xED\xB0\x80

(The weird module import is like saying 
    no warnings "utf8";
as the outer scope.)

You can't get it with strict UTF-8 no matter how hard you try though. Only the loose one will put out invalid UTF-8.

So what about UTF-16? Can we build things up piecemeal? Turns out that no, we can't. I *think* this is a feature, but I'm not 100.000% sure.

% perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}")' | uniquote -b 
UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
\x00\x00

% perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}\x{DC00}")' | uniquote -b
UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
\x00\x00\x00\x00

% perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}"), encode("UTF-16LE", "\x{DC00}")' | uniquote -b 
UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
\x00\x00\x00\x00

You get all nulls because it refused to give you anything for surrogates. Although you can force it for UTF-8 if you use the loose utf8 version that is out of spec, there is just nothing you can to do encode into UTF-16 code points that are surrogates. Just can't. There is no loose UTF-16.

Or that's what it looks like to me. You can have surrogates in memory, but those cannot make into a UTF-16 stream. Or a UTF-32 stream either.

% perl -MEncode -le 'print encode("UTF-32LE", "\x{D800}")' | uniquote -b
UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
\x00\x00\x00\x00

You have to encode integral code points, not halvsies:

% perl -C0 -MEncode -le 'print encode("UTF-16BE", "\x{10000}")' | uniquote -b
\xD8\x00\xDC\x00
% perl -C0 -MEncode -le 'print encode("UTF-16LE", "\x{10000}")' | uniquote -b
\x00\xD8\x00\xDC
% perl -C0 -MEncode -le 'print encode("UTF-16",   "\x{10000}")' | uniquote -b
\xFE\xFF\xD8\x00\xDC\x00

% perl -C0 -MEncode -le 'print encode("UTF-32BE", "\x{10000}")' | uniquote -b
\x00\x01\x00\x00
% perl -C0 -MEncode -le 'print encode("UTF-32LE", "\x{10000}")' | uniquote -b
\x00\x00\x01\x00
% perl -C0 -MEncode -le 'print encode("UTF-32",   "\x{10000}")' | uniquote -b
\x00\x00\xFE\xFF\x00\x01\x00\x00

% perl -C0 -MEncode -le 'print encode("UTF-8",    "\x{10000}")' | uniquote -b
\xF0\x90\x80\x80
% perl -C0 -MEncode -le 'print encode("UTF-8",    "\x{10000}")' | uniquote -x
\x{10000}

% perl -C0 -MEncode -le 'print encode("UTF-8",    "\x{10000}")' 
𐀀
% perl -CS          -le 'print                    "\x{10000}" ' 
𐀀

tom++

malemburg commented 13 years ago

Keep in mind that we should be able to access and use lone surrogates too, therefore: s = '\ud800' # should be valid len(s) # should this raise an error? (or return 0.5 ;)? s[0] # error here too? list(s) # here too?

p = s + '\udc00' len(p) # 1? s[0] # '\U00010000' ? s[1] # IndexError? list(p + 'a') # ['\ud800\udc00', 'a']?

We can still decide that strings with lone surrogates work only with a limited number of methods/functions but: 1) it's not backward compatible; 2) it's not very consistent

Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined "on the fly": >>> p '\ud800\udc00' >>> len(p) 2 >>> p.encode('utf-16').decode('utf-16') '𐀀' >>> len(_) 1

Hi Tom,

welcome to Python land :-) Here's some more background information on how Python's Unicode implementation works:

You need to differentiate between Unicode code points stored in Unicode objects and ones encoded in transfer formats by codecs.

We generally do allow lone surrogates, unassigned code points, lone combining code points, etc. in Unicode objects since Python needs to be able to work on all Unicode code points and build strings with them.

The transfer format codecs do try to combine surrogates on decoding data on UCS4 builds. On UCS2 builds they create surrogate pairs as necessary. On output, those pairs will again be joined to get round-trip safety.

It helps if you think of Python's Unicode objects using UCS2 and UCS4 instead of UTF-16/32. Python does try to make working with UCS2 easy and in many cases behaves as if it were using UTF-16 internally, but there are, of course, limits to this. In practice, you only rarely get to see any of these special cases, since non-BMP code points are usually not found in everyday use. If they do become a problem for you, you have the option of switching to a UCS4 build of Python.

You also have to be aware of the fact that Python started Unicode in 1999/2000 with Unicode 2.0/3.0, so it uses the terminology of those versions, some of which has changed in more recent versions of Unicode.

For more background information, you might want take a look at this talk from 2002:

http://www.egenix.com/library/presentations/#PythonAndUnicode

Related to the other tickets you opened You'll also find that collation and compression was already on the plate back then, but since no one step forward, it wasn't implemented.

Cheers, -- Marc-Andre Lemburg eGenix.com

2011-10-04: PyCon DE 2011, Leipzig, Germany 50 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 13 years ago

For what it's worth, I've had idea about string storage, roughly based on how *nix stores data on disk.

If a string is small, point to a block of codepoints.

If a string is medium-sized, point to a block of pointers to codepoint blocks.

If a string is large, point to a block of pointers to pointer blocks.

This means that a large string doesn't need a single large allocation.

The level of indirection can be increased as necessary.

For simplicity, all codepoint blocks contain the same number of codepoints, except the final codepoint block, which may contain fewer.

A codepoint block may use the minimum width necessary (1, 2 or 4 bytes) to store all of its codepoints.

This means that there are no surrogates and that different sections of the string can be stored in different widths to reduce memory usage.

terryjreedy commented 13 years ago

I improved UTF16.__getitem to handle negative indexes and slices. The later uses the same adjustment as for indexes. An __iter method is not needed as str.__iter used __getitem. I will take further discussion of this prototype to python-ideas list.

b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

In msg142098 Ezio said:

Keep in mind that we should be able to access and use lone surrogates too, therefore: s = '\ud800' # should be valid len(s) # should this raise an error? (or return 0.5 ;)?

I say: For streams and data types in which lone surrogates are permitted, a lone surrogate should be treated as and counted as a character (codepoint).

For streams and data types in which lone surrogates are not permitted, the assigned should be invalid, and raise an error; len would then never see it, and has no quandary.

gvanrossum commented 13 years ago

Wow. A very educational discussion. We will be referencing this issue for many years to come.

As long as the buck stops with me, I feel strongly that *today* changing indexing from O(1) to O(log N) is a bad idea, partly for technical reasons, partly because the Python culture isn't ready. In 5 or 10 years we need to revisit this, and it wouldn't hurt if in the mean time we started seriously thinking about how to change our APIs so that O(1) indexing is not relied upon so much. This may include rewriting tutorials to nudge users in the direction of using different idioms for text processing.

In the meantime, I think our best option is to switch CPython to the PEP-393 string implementation. Despite its disadvantages (I understand the "spoiler" issue) is is generally no worse than a wide build, and there is working code today that we can optimize before 3.3 is released.

For Python implementations where this is not an option (I'm thinking Jython and IronPython, both of which are closely tied to a system string type that behaves like UTF-16) I hope that at least the regular expression behavior can be fixed so that "." matches a surrogate pair. (Possibly they already behave that way, if they use a native regex library.)

In all cases, for future Python versions, we should tighten the codecs to reject data that the Unicode standard considers invalid (and we should offer separate non-strict codecs for situations where such invalid data needs to be processed).

I wish we could fix the codecs and the regex "." issue on narrow builds for Python versions before 3.3 (esp. 3.2 and 2.7), but I fear that this is considered too backwards incompatible (though for each specific fix we should consider this carefully).

terryjreedy commented 13 years ago

My proposal is better than log(N) in 2 respects.

1) There need only be a time penalty when there are non-BMP chars and indexing currently gives the 'wrong' answer and therefore when a time-penalty should be acceptable. Lookup for normal all-BMP strings could remain the same.

2) The penalty is log(K), where K in the number of non-BMP chars. In theory, O(logK) is as 'bad' as O(logN), for any fixed ratio K/N. In practice, the difference should be noticeable when there are just a few (say .01%) extended-range chars.

I am aware that this is an idea for the future, not now. ---

Fixing string iteration on narrow builds to produce code points the same as with wide builds is easy and costs O(1) per code point (character), which is the same as the current cost. Then

>>> from unicodedata import name
>>> name('\U0001043c')
'DESERET SMALL LETTER DEE'
>>> for c in 'a\U0001043c': name(c)
'LATIN SMALL LETTER A'
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    for c in 'a\U0001043c': name(c)
ValueError: no such name

would work like it does on wide builds instead of failing.

I admit that it would be strange to have default iteration produce different items than default indexing (and indeed, str currently iterates by sequential indexing). But keeping them in sync means that buggy iteration is another cost of O(1) indexing.

gvanrossum commented 13 years ago

To me, making (default) iteration deviate from indexing is anathema.

However, there is nothing wrong with providing a library function that takes a string and returns an iterator that iterates over code points, joining surrogate pairs as needed. You could even have one that iterates over characters (I think Tom calls them graphemes), if that is well-defined and useful.

terryjreedy commented 13 years ago

PEP-393 will take care of iterating by code points. Where would you have other iterators go? The string module? Something else I have not thought of? Or something new?

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Guido van Rossum \report@bugs.python.org\ wrote on Sat, 27 Aug 2011 03:26:21 -0000:

To me, making (default) iteration deviate from indexing is anathema.

So long is there's a way to interate through a string some other way that by code unit, that's fine. However, the Java way of 16-bit code units is so annoying because there often aren't code point APIs, and so you get a lot of niggling errors creeping in. This is part of why I strongly prefer wide builds, so that code point and code unit are the same thing again.

However, there is nothing wrong with providing a library function that takes a string and returns an iterator that iterates over code points, joining surrogate pairs as needed. You could even have one that iterates over characters (I think Tom calls them graphemes), if that is well-defined and useful.

"Character" can sometimes be a confusing term when it means something different to us programmers as it does to users. Code point to mean the integer is a lot clearer to us but to no one else. At work I often just give in and go along with the crowd and say character for the number that sits in a char or wchar_t or Character variable, even though of course that's a code point. I only rebel when they start calling code units characters, which (inexperienced) Java people tend to do, because that leads to surrogate splitting and related errors.

By grapheme I mean something the user perceives as a single character. In full Unicodese, this is an extended grapheme cluster. These are code point sequences that start with a grapheme base and have zero or more grapheme extenders following it. For our purposes, that's *mostly* like saying you have a non-Mark followed by any number of Mark code points, the main excepting being that a CR followed by a LF also counts as a single grapheme in Unicode.

If you are in an editor and wanted to swap two "characters", the one under the user's cursor and the one next to it, you have to deal with graphemes not individual code points, or else you'd get the wrong answer. Imagine swapping the last two characters of the first string below, or the first two characters of second one:

contrôlée    contro\x{302}le\x{301}e
élève        e\x{301}le\x{300}ve

While you can sometimes fake a correct answer by considering things in NFC not NFD, that's doesn't work in the general case, as there are only a few compatibility glyphs for round-tripping for legacy encodings (like ISO 8859-1) compared with infinitely many combinations of combining marks. Particularly in mathematics and in phonetics, you often end up using marks on characters for which no pre-combined variant glyph exists. Here's the IPA for a couple of Spanish words with their tight (phonetic, not phonemic) transcriptions:

    anécdota    [a̠ˈne̞ɣ̞ð̞o̞t̪a̠]
    rincón      [rĩŋˈkõ̞n]

NFD:
    ane\x{301}cdota    [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
    rinco\x{301}n      [ri\x{303}\x{14B}\x{2C8}ko\x{31E}\x{303}n]

NFD:
    an\x{E9}cdota    [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
    rinc\x{F3}n      [r\x{129}\x{14B}\x{2C8}k\x{F5}\x{31E}n]

So combining marks don't "just go away" in NFC, and you really do have to deal with them. Notice that to get the tabs right (your favorite subject :), you have to deal with print widths, which is another place that you get into trouble if you only count code points.

BTW, did you know that the stress mark used in the phonetics above is actually a (modifier) letter in Unicode, not punctuation?

# uniprops -a 2c8
U+02C8 ‹ˈ› \N{MODIFIER LETTER VERTICAL LINE}
    \w \pL \p{L_} \p{Lm}
All Any Alnum Alpha Alphabetic Assigned InSpacingModifierLetters Case_Ignorable CI Common Zyyy Dia Diacritic L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Modifier_Letter Print Spacing_Modifier_Letters Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=Spacing_Modifier_Letters Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=BB Line_Break=Break_Before LB=BB Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE Word_Break=LE _Case_Ignorable _X_Begin

That means those would all be matched by \w+, as unlike \p{alpha}, \p{word} includes not just \pL etc but also all the combining marks. That's how you want it to work, although I think you have to use regex not re in Python to get that.

Iterating by grapheme is easy in a regex engine that supports \X. Instead of using "." to match a code point, you use a \X to match a grapheme. So the swapping problem goes away, and many others. To capture a pair of graphemes for swapping you'd use (\X)(\X), and to grab the first 6 graphemes without breaking them up you'd use \X{6}. That means to interate by grpaheme you just split up your string one \X at a time.

Here's a real-world example:

In the vim editor, when you're editing UTF-8 as I am this mail message, because it is all about user-perceived characters, they actually use "." to match an entire grapheme. This is different form th eayw perl and everybody else uses "." for a code point, not a grapheme. If I did s/^.// or s/.$// in vim, I would need s/^\X// or s/\X$// for in perl. Similarly, to swap "characters" with the "xp" command, it will grab the entire \X. Put some of those phonetic transcriptions above into a vim buffer and play with them to see what I mean.

Imagine using a format like "%-6.6s" on "contrôlée": that should produce "contrô" not "contro". That's because code points with the property Bidi_Class=Non_Spacing_Mark (BC=NSM) do not advance the cursor, they just stack up.

It gets even worse in that some code points advance the cursor by two not by zero or one. These include those with the East_Asian_Width property value Full or Wide. And they aren't always Asian characters, either. For example, these code points all have the EA=W property, so take up to print columns:

 〈  U+2329 LEFT-POINTING ANGLE BRACKET
 〉  U+232A RIGHT-POINTING ANGLE BRACKET
 〃  U+3003 DITTO MARK
 〜  U+301C WAVE DASH
 〝  U+301D REVERSED DOUBLE PRIME QUOTATION MARK
 〞  U+301E DOUBLE PRIME QUOTATION MARK
 〟  U+301F LOW DOUBLE PRIME QUOTATION MARK

Perl's built-in string indexing, and hence its substrings, is strictly by code point and not by grapheme. This is really frustrating at times, because something like this screws up:

printf "%-6.6", "contrôlée";
printf "%-6.6", "a̠ˈne̞ɣ̞ð̞o̞t̪a̠";

Logically, those should produce "contrô" and "a̠ˈne̞ɣ̞ð̞", but of course when considering only code points, they won't. Well, not unless the 1st is in NFC, but there's no hope for the second.

Perl does have a grapheme cluster string class which provides a way to figure out the columns and also allows for substring operation by grapheme. But it is not at all integrated into anything, which makes it tedious to use.

use Unicode::GCString;  # on CPAN only, not yet in core

    my $string   = "a̠ˈne̞ɣ̞ð̞o̞t̪a̠";
    my $gcstring = Unicode::GCString->new($string);
    my $colwidth = $gcstring->columns;
    if ($colwidth > 6) {
        print $gcstring->substr(0,6);
    } else {
        print " " x (6 - $colwidth);
        print $gcstring;
    }

Isn't that simply horrible? You *will* get the right answer that way, but what a pain! Really, there needs to be a way for the built-in formatters to understand graphemes. But first, I think, you have to have the regex engine understand them. Matthew's regex does, because it supports \X.

There's a lot more to dealing with Unicode text than just extending the character repertoire. How much should fundamental to the language and how much should be relegated to modules isn't always clear. I do know I've had to rewrite a *lot* of standard Unix tools to deal with Unicode properly. For the wc(1) rewrite I only needed to consider graphemes with \X and Unicode line break sequences with \R, but other tools need better smarts. For example, just getting the fmt(1) rewrite to wrap lines in paragraphs correctly requires understanding not just graphemes but the Unicode Linebreak Algorithm, which in turn relies upon understanding the print widths for grapheme cluster strings and East Asian wide or full characters.

It's something you only want to do once and never think about again. :(

--tom

terryjreedy commented 13 years ago

Python makes it easy to transform a sequence with a generator as long as no look-ahead is needed. utf16.UTF16.__iter__ is a typical example. Whenever a surrogate is found, grab the matching one.

However, grapheme clustering does require look-ahead, which is a bit trickier. Assume s is a sanitized sequence of code points with unicode database entries. Ignoring line endings the following should work (I tested it with a toy definition of mark()):

def graphemes(s):
  sit = iter(s)
  try: graph = [next(sit)]
  except StopIteration: graph = []

  for cp in sit:
    if mark(cp):  
      graph.append(cp)
    else:
      yield combine(graph)
      graph = [cp]

yield combine(graph)

I tested this with several input with def mark(cp): return cp == '.' def combine(l) return ''.join(l)

Python's object orientation makes formatting easy for the user. Assume someone does the hard work of writing (once ;-) a GCString class with a .__format method that interprets the format mini-language for graphemes, using a generalized version of your 'simply horrible' code. The might be done by adapting str.__format to use the grapheme iterator above. Then users should be able to write

>>> '{:6.6}'.format(GCString("a̠ˈne̞ɣ̞ð̞o̞t̪a̠"))
"a̠ˈne̞ɣ̞ð̞"
(Note: Thunderbird properly displays characters with the marks beneath even though FireFox does not do so above or in its display of your message.)

gvanrossum commented 13 years ago

PEP-393 will take care of iterating by code points.

Only for CPython. IronPython/Jython will still need a separate solution.

Where would you have other iterators go? The string module? Something else I have not thought of? Or something new?

Undecided. But I think we may want to create a new module which provides various APIs specifically for apps that need care when dealing with Unicode.

terryjreedy commented 13 years ago

But I think we may want to create a new module which provides various APIs specifically for apps that need care when dealing with Unicode.

I have started thinking that way too -- perhaps "unitools"? It could contain the code point iterator for the benefit of other implementations. Actually, since 393 still allows insertion of surrogate values, it might not be completely useless for CPython either.

ezio-melotti commented 13 years ago

To start with, no code point which when bitwise added with 0xFFFE returns 0xFFFE can never appear in a valid UTF-* stream, but Python allow this without any error.

That means that both 0xNN_FFFE and 0xNN_FFFF are illegal in all planes, where NN is 00 through 10 in hex. So that's 2 noncharacters times 17 planes = 34 code points illegal for interchange that Python is passing through illegally.

The remaining 32 nonsurrogate code points illegal for open interchange are 0xFDD0 through 0xFDEF. Those are not allowed either, but Python doesn't seem to care.

It's not entirely clear to me what the UTF-8 codec is supposed to do with this.

For example U+FFFE is \<EF BF BE> in UTF-8, and this is valid according to table 3-7, Chapter 030: """ Code points 1st byte 2nd byte 3rd byte U+E000..U+FFFF EE..EF 80..BF 80..BF """

Chapter 16, section 16.7 "Noncharacters" says1: """ Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data. """

and """ Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. """ seem to suggest that encoding them is forbidden.

""" If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters. """ here decoding seems allowed, possibly with a replacement (that would depend on the error handler used though, so the default 'strict' would turn this in an error).

Chapter 03, after D14, says: """ In general, a conforming process may indicate the presence of a code point whose use has not been designated (for example, by showing a missing glyph in rendering or by signaling an appropriate error in a streaming protocol), even though it is forbidden by the standard from interpreting that code point as an abstract character. """

and in C7: """ If a noncharacter that does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or replace the noncharacter with U+FFFD replacement character. If the implementation chooses to replace, delete or ignore a noncharacter, such an action constitutes a modification in the interpretation of the text. In general, a noncharacter should be treated as an unassigned code point. """

This doesn't mention clearly what the codec is supposed to do. On one hand, it suggests that an error can be raised, i.e. consider the noncharacter invalid like out-of-range codepoints (>U+10FFFF) or lone surrogates. On the other hand it says that they should be treated as an unassigned code point, i.e. encoded/decoded normally, and doesn't list them as invalid in table 3-7.

terryjreedy commented 13 years ago

Ezio, that is a lot of nice work to track down those pieces of the standard. I think the operative phrase in many of those quotes is 'open interchange'. Codecs are also used for private storage. If I use the unassigned or private-use code points in a private project, I would use utf-8 to save the work between active sessions. That is quite fine under the standard. But I should not put files with such codings on the net for 'open interchange'. And if I receive them, the one thing I should not do is interpret them as meaningful abstract characters.

So the codec should allow for both public and private use. I have the impression that is does so now. A Python programmer should know whether the codec is being used for private (or local group) files or random stuff from the net, and therefore, what the appropriate error handling is. If they do not now, the docs could suggest that public text should normally be decoded with 'strict' or 'replace' and that 'ignore' should normally be reserved for local text that is known to intentionally have 'errors'.

I am pretty sure that the intent of prohibiting non-standard interpretation of code points as abstract characters is to prevent 'organic' evolution of the code point -- abstract character mapping in which anyone (or any company) who wants to do so creates a new pairing and promotes its wide recognition around the world. Conforming implementations are strict in both what they produce (publicly) *and* in what they accept (from public sources). Many now think that liberal acceptance of sloppy html promoted sloppy production of html.

gvanrossum commented 13 years ago

So the codec should allow for both public and private use.

IIUC we have (or are planning) codecs that support the private use. They are not called "utf-8" though.

ezio-melotti commented 13 years ago

Or they are still called UTF-8 but used in combination with different error handlers, like surrogateescape and surrogatepass. The "plain" UTF-* codecs should produce data that can be used for "open interchange", rejecting all the invalid data, both during encoding and decoding.

Chapter 03, D79 also says: """ To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values. """

and this seems to imply that the only unencodable codepoint are the non-scalar values, i.e. surrogates and codepoints >U+10FFFF. Noncharacters shouldn't thus receive any special treatment (at least during encoding).

Tom, do you agree with this? What does Perl do with them?

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Ezio Melotti \report@bugs.python.org\ wrote on Sat, 03 Sep 2011 00:28:03 -0000:

Ezio Melotti \ezio.melotti@gmail.com\ added the comment:

Or they are still called UTF-8 but used in combination with different error handlers, like surrogateescape and surrogatepass. The "plain" UTF-* codecs should produce data that can be used for "open interchange", rejecting all the invalid data, both during encoding and decoding.

Chapter 03, D79 also says:
   To ensure that the mapping for a Unicode encoding form is one-to-one,
   all Unicode scalar values, including those corresponding to
   noncharacter code points and unassigned code points, must be mapped to
   unique code unit sequences. Note that this requirement does not extend
   to high-surrogate and low-surrogate code points, which are excluded by
   definition from the set of Unicode scalar values.
and this seems to imply that the only unencodable codepoint are the non-scalar values, i.e. surrogates and codepoints >U+10FFFF. Noncharacters shouldn't thus receive any special treatment (at least during encoding).

Tom, do you agree with this? What does Perl do with them?

I agree that one needs to be able to encode any scalar value and store it in memory in a designated character encoding form.

This is different from streams, though.

The 3 different Unicode "character encoding *forms*" -- UTF-8, UTF-16, and UTF-32 -- certainly need to support all possible scalar values. These are the forms used to store code points in memory. They do not have BOMs, because one knows one's memory layout. These are specifically allowed to contain the noncharacters:

http://www.unicode.org/reports/tr17/#CharacterEncodingForm

The third type is peculiar to the Unicode Standard: the noncharacter.
This is a kind of internal-use user-defined character, not intended for
public interchange.

The problem is that one must make a clean distinction between character encoding *forms and character encoding *schemes.

http://www.unicode.org/reports/tr17/#CharacterEncodingScheme

It is important not to confuse a Character Encoding Form (CEF) and a CES.

1. The CEF maps code points to code units, while the CES transforms
   sequences of code units to byte sequences.
2. The CES must take into account the byte-order serialization of
   all code units wider than a byte that are used in the CEF.
3. Otherwise identical CESs may differ in other aspects, such as the
   number of user-defined characters allowed.

Some of the Unicode encoding schemes have the same labels as the three
Unicode encoding forms. [...]

As encoding schemes, UTF-16 and UTF-32 refer to serialized bytes, for
example the serialized bytes for streaming data or in files; they may have
either byte orientation, and a single BOM may be present at the start of the
data. When the usage of the abbreviated designators UTF-16 or UTF-32 might
be misinterpreted, and where a distinction between their use as referring to
Unicode encoding forms or to Unicode encoding schemes is important, the full
terms should be used. For example, use UTF-16 encoding form or UTF-16
encoding scheme. They may also be abbreviated to UTF-16 CEF or UTF-16 CES,
respectively.

The Unicode Standard has seven character encoding schemes: UTF-8, UTF-16,
UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

* UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are simple CESs.

    * UTF-16 and UTF-32 are compound CESs, consisting of an single, optional
      byte order mark at the start of the data followed by a simple CES.

I believe that what this comes down to is that you can have noncharacters in memory as a CEF, but that you cannot have them in a CES meant for open interchange. And what you do privately is a different, third matter.

What Perl does differs somewhat depending on whether you are just playing around with encodings in memory verus using streams that have particular encodings associated with them. I belive that you can think of this as the first being for CEF stuff and the second is for CES stuff.

Streams are strict. Memory isn't.

Perl will never ever produce nor accept one of the 66 noncharacers on any stream marked as one of the 7 character encoding schemes. However, we aren't always good about whether we generate an exception or whether we return replacement characters.

Here the first process created a (for the nonce, nonfatal) warning, whereas the second process raised an exception:

 %   perl -wle 'binmode(STDOUT, "encoding(UTF-16)")|| die; print chr(0xFDD0)' | 
 perl -wle 'binmode(STDIN, "encoding(UTF-16)")||die; print ord \<STDIN\>'
Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
UTF-16:Unicode character fdd0 is illegal at -e line 1.
Exit 255

Here the first again makes a warning, and the second returns a replacement string because:

% perl -wle 'binmode(STDOUT, "encoding(UTF-8)")|| die; print chr(0xFDD0)' | 
perl -wle 'binmode(STDIN, "encoding(UTF-8)")||die; print ord \<STDIN\>'
Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
"\\x{fdd0}" does not map to utf8.
92

If you call encode() manually, you have a lot clearer control over this, beause you can specify what to do with invalid characters (exceptions, replacements, etc).

We have a flavor of non-strict utf8, spelled "utf8" instead of "UTF-8", that can produce and accept illegal characters, although by default it is still going to generate a warning:

%   perl -wle 'binmode(STDOUT, "encoding(utf8)")|| die; print chr(0xFDD0)' | 
perl -wle 'binmode(STDIN, "encoding(utf8)")||die; print ord \<STDIN\>'
Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
64976

I could talk about ways to control whether it's a warning or an exception or a replacement string or nothing at all, but suffice to say such mechanisms do exist. I just don't know that I agree with the defaults.

I think a big problem here is that the Python culture doesn't use stream encodings enough. People are always making their own repeated and tedious calls to encode and then sending stuff out a byte stream, by which time it is too late to check. This is a real problem, because now you cannot be permissive for the CES but conservative for the CEF.

In Perl this doesn't in practice happen because in Perl people seldom send the result of encode() out a byte stream; they send things out character streams that have proper encodings affiliated with them. Yes, you can do it, but then you lose the checks. That's not a good idea.

Anything that deals with streams should have an encoding argument. But often/many? things in Python don't. For example, subprocess.Popen doesn't even seem to take an encoding argument. This makes people do things by hand too often. In fact, subprocess.Popen won't even accept normal (Python 3 Unicode) strings, which is a real pain. I do think the culture of calling .encode("utf8") all over the place needs to be replaced with a more stream-based approach in Python. I had another place where this happens too much in Python besides subprocess.Popen but I can't remember where it is right now.

Perl's internal name for the strict utf stuff is for example "utf-8-strict". I think you probably want to distingish these, and make the default strict the way we do with "UTF-8". We do not ever allow nonstrict UTF-16 or UTF-32, only sometimes nonstrict UTF-8 if you call it "utf8".

I quote a bit of the perlunicode manpage below which talks about this a bit.

Sorry it's taken me so long to get back to you on this. I'd be happy to answer any further questions you might have.

--tom

PERLUNICODE(1)   Perl Programmers Reference Guide  PERLUNICODE(1)

   Non-character code points
       66 code points are set aside in Unicode as "non-character code
       points".  These all have the Unassigned (Cn) General Category, and
       they never will be assigned.  These are never supposed to be in
       legal Unicode input streams, so that code can use them as sentinels
       that can be mixed in with character data, and they always will be
       distinguishable from that data.  To keep them out of Perl input
       streams, strict UTF-8 should be specified, such as by using the
       layer ":encoding('UTF-8')".  The non-character code points are the
       32 between U+FDD0 and U+FDEF, and the 34 code points U+FFFE,
       U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.  Some people are
       under the mistaken impression that these are "illegal", but that is
       not true.  An application or cooperating set of applications can
       legally use them at will internally; but these code points are
       "illegal for open interchange". Therefore, Perl will not accept
       these from input streams unless lax rules are being used, and will
       warn (using the warning category "nonchar", which is a sub-category
       of "utf8") if an attempt is made to output them.

   Beyond Unicode code points
       The maximum Unicode code point is U+10FFFF.  But Perl accepts code
       points up to the maximum permissible unsigned number available on
       the platform.  However, Perl will not accept these from input
       streams unless lax rules are being used, and will warn (using the
       warning category "non_unicode", which is a sub-category of "utf8")
       if an attempt is made to operate on or output them.  For example,
       "uc(0x11_0000)" will generate this warning, returning the input
       parameter as its result, as the upper case of every non-Unicode
       code point is the code point itself.

perl v5.14.0                2011-05-07                         26

ezio-melotti commented 13 years ago

So to summarize a bit, there are different possible level of strictness: 1) all the possible encodable values, including the ones >10FFFF; 2) values in range 0..10FFFF; 3) values in range 0..10FFFF except surrogates (aka scalar values); 4) values in range 0..10FFFF except surrogates and noncharacters;

and this is what is currently available in Python: 1) not available, probably it will never be; 2) available through the 'surrogatepass' error handler; 3) default behavior (i.e. with the 'strict' error handler); 4) currently not available.

(note: this refers to the utf-8 codec in Python 3, but it should be true for the utf-16/32 codecs too once bpo-12892 is fixed. This whole message refers to codecs only and what they should (dis)allow. What we use internally seems to work fine and doesn't need to be changed.)

Now, assume that we don't care about option 1 and want to implement the missing option 4 (which I'm still not 100% sure about). The possible options are:

add a new codec (actually one for each UTF encoding);
add a new error handler that explicitly disallows noncharacters;
change the meaning of 'strict' to match option 4;

This depends on what should be the default behavior while dealing with noncharacters. If they are rejected by default, then 'strict' should reject them. However this would leave us without option 3 (something to encode all and only scalar values), and surrogatepass will be misnamed if it also allows noncharacters (and if it doesn't we will end up without option 2 too). This is apparently what Perl does:

Perl will never ever produce nor accept one of the 66 noncharacers on any stream marked as one of the 7 character encoding schemes.

Implementation-wise, I think the 'strict' error handler should be the strictest one, because the codec must detects all the "problematic" chars and send them to the error handler that might then decide what to do with them. I.e. if the codec detects noncharacters, sends them to the error handler, and the error handler is strict, an error will be raised; if it doesn't detect them, the error handler won't be able to do anything with them.
Another option is to provide another codec that specifically detects them, but this means re-implementing a slightly different version of each codec (or possibly add an extra argument to the PyUnicode_{Encode,Decode}UTF* functions).

We could also decide to leave the handling of noncharacters as it is -- after all the Unicode standard doesn't seem to explicitly forbid them as it does with e.g. surrogates.

We have a flavor of non-strict utf8, spelled "utf8" instead of "UTF-8", that can produce and accept illegal characters, although by default it is still going to generate a warning

How did Perl implement this? With two (or more) slightly different version of the same codec? And how does Perl handle errors? With some global options that turns (possibly specific) warnings into error (like python -We)?

Python has different codecs that encode/decode str/bytes and whenever they find a "problematic" char they send it to the error handler that might decide to raise an error, remove the char, replace it with something else, sending it back unchanged, generate a warning and so on. In this way you can have different combinations of codecs and error handlers to get the desired behaviors. (and FWIW in Python 'utf8' is an alias for 'UTF-8'.)

I think a big problem here is that the Python culture doesn't use stream encodings enough. People are always making their own repeated and tedious calls to encode and then sending stuff out a byte stream, by which time it is too late to check. [...] Anything that deals with streams should have an encoding argument. But often/many? things in Python don't. Several objects have an .encoding and .error attributes (e.g. sys.stdin/out), and they are used to encode/decode the text/bytes sent/read to/from them. In other places we prefer the "explicit is better than implicit" approach and require the user (or some other higher-level layer) to encode/decode manually.

I'm not sure why you are saying that it's too late to check, and since the encoding/decoding happens only in a few places I don't think it's tedious at all (and often it's automatic too).