Py_UNICODE_NEXT and other macros for surrogates

python / cpython

The Python programming language

https://www.python.org/

Other

60.54k stars 29.26k forks source link

Py_UNICODE_NEXT and other macros for surrogates #54751

Closed abalkin closed 12 years ago

abalkin commented 13 years ago

BPO	10542
Nosy	@malemburg, @loewis, @doerwalter, @birkenfeld, @rhettinger, @amauryfa, @abalkin, @pitrou, @vstinner, @ericvsmith, @benjaminp, @ezio-melotti
Files	unicode-next.diff issue10542-put-next.diff issue10542.diff issue10542a.diff unicode_macros.patch issue10542b.diff: Patch against 3.3

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/abalkin' closed_at = created_at = labels = ['extension-modules', 'interpreter-core', 'type-feature', 'expert-unicode'] title = 'Py_UNICODE_NEXT and other macros for surrogates' updated_at = user = 'https://github.com/abalkin' ``` bugs.python.org fields: ```python activity = actor = 'benjamin.peterson' assignee = 'belopolsky' closed = True closed_date = closer = 'benjamin.peterson' components = ['Extension Modules', 'Interpreter Core', 'Unicode'] creation = creator = 'belopolsky' dependencies = [] files = ['19825', '19845', '20186', '20190', '22915', '23000'] hgrepos = [] issue_num = 10542 keywords = ['patch'] message_count = 88.0 messages = ['122464', '122489', '122490', '122492', '122494', '122495', '122497', '122500', '122501', '122502', '122503', '122504', '122518', '122564', '122567', '122568', '122571', '122573', '122578', '122588', '122591', '122592', '122594', '122595', '123283', '123290', '123569', '123570', '123757', '124174', '124839', '124842', '124849', '124852', '124854', '124856', '124860', '124864', '124866', '124868', '124869', '124874', '124883', '124897', '124902', '124903', '124910', '124911', '124914', '142117', '142133', '142134', '142173', '142175', '142177', '142178', '142183', '142184', '142185', '142187', '142188', '142189', '142222', '142223', '142224', '142227', '142230', '142231', '142253', '142256', '142258', '142259', '142260', '142261', '142262', '142263', '142265', '142267', '142268', '142269', '142270', '142317', '142731', '142732', '142735', '144629', '144631', '150692'] nosy_count = 16.0 nosy_names = ['lemburg', 'loewis', 'doerwalter', 'georg.brandl', 'rhettinger', 'amaury.forgeotdarc', 'belopolsky', 'Rhamphoryncus', 'pitrou', 'vstinner', 'eric.smith', 'benjamin.peterson', 'stutzbach', 'ezio.melotti', 'python-dev', 'tchrist'] pr_nums = [] priority = 'normal' resolution = 'out of date' stage = 'commit review' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue10542' versions = ['Python 3.3'] ```

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

A PEP-393 draft implementation is available at https://bitbucket.org/t0rsten/pep-393/ (branch pep-393); if this gets into 3.3, this issue will be outdated: there won't be "narrow" builds of Python anymore (nor will there be "wide" builds).

ezio-melotti commented 12 years ago

That's a really good news. Some Unicode issues can still be fixed on 2.7 and 3.2 though. FWIW I was planning to look at this and bpo-9200 in the following days and see if I can fix them.

malemburg commented 12 years ago

Martin v. Löwis wrote:

A PEP-393 draft implementation is available at https://bitbucket.org/t0rsten/pep-393/ (branch pep-393); if this gets into 3.3, this issue will be outdated: there won't be "narrow" builds of Python anymore (nor will there be "wide" builds).

Even if PEP-393 should go into Py4k one day (I don't believe that such major changes can be done in a minor release), we will still have to deal with surrogates in codecs, which is where these macros will get used, so I don't see how PEP-393 relates to the idea of adding helper macros to simplify the code.

ezio-melotti commented 12 years ago

I think the 4 macros:
 #define _Py_UNICODE_ISSURROGATE
 #define _Py_UNICODE_ISHIGHSURROGATE
 #define _Py_UNICODE_ISLOWSURROGATE
 #define _Py_UNICODE_JOIN_SURROGATES
are quite straightforward and can avoid using the trailing _.

Since I would like to see bpo-9200 fixed on 3.2 (and possibly 2.7 too), would it be ok to: 1) commit the patch with the trailing for all the macros on 3.2(/2.7); 2) commit the patch with the trailing only for the _NEXT macros in 3.3; 3) fix bpo-9200 on all these branches using the new macros (with or without _); 4) remove the trailing _ from the _NEXT macros in 3.4 if it turns out to work well;

we will still have to deal with surrogates in codecs, which is where these macros will get used

They will also be used in many str methods and afaiu PEP-393 should address that. I'm not sure it addresses codecs and builtin functions like chr() and ord() too.

pitrou commented 12 years ago

I think the 4 macros:

define _Py_UNICODE_ISSURROGATE

define _Py_UNICODE_ISHIGHSURROGATE

define _Py_UNICODE_ISLOWSURROGATE

define _Py_UNICODE_JOIN_SURROGATES

are quite straightforward and can avoid using the trailing _.

I don't want to bikeshed, but can we have proper consistent word separation? _Py_UNICODE_IS_HIGH_SURROGATE, not _Py_UNICODE_ISHIGHSURROGATE (etc.)

> we will still have to deal with surrogates in codecs, > which is where these macros will get used

They will also be used in many str methods and afaiu PEP-393 should address that. I'm not sure it addresses codecs and builtin functions like chr() and ord() too.

AFAIU, PEP-393 avoids producing surrogate pairs in the canonical internal representation (that's one of its selling points). Only the UTF-16 codecs would need to deal with surrogate pairs, in the encoded form.

ezio-melotti commented 12 years ago

All the other macros0 follow the same convention, e.g. Py_UNICODE_ISLOWER and Py_UNICODE_TOLOWER. I agree that keeping the words separate makes them more readable though.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 12 years ago

Ezio Melotti \ezio.melotti@gmail.com\ added the comment:

I think the 4 macros:

define _Py_UNICODE_ISSURROGATE

define _Py_UNICODE_ISHIGHSURROGATE

define _Py_UNICODE_ISLOWSURROGATE

define _Py_UNICODE_JOIN_SURROGATES

are quite straightforward and can avoid using the trailing _.

For what it's worth, I've seen Unicode documentation that talks about that prefers the terms "lead surrogate" and "trail surrogate" as being clearer than the terms "high surrgoate" and "low surrogate".

For example, from the Unicode BOM FAQ at http://unicode.org/faq/utf_bom.html

Q: What are surrogates?

A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and
   trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D800₁₆ to DBFF₁₆,
   and trailing, or low, surrogates are from DC00₁₆ to DFFF₁₆. They are called surrogates, since they do not
   represent characters directly, but only as a pair.

BTW, considering recent discussions, you might want to read:

Q: Are there any 16-bit values that are invalid?

A: The two values FFFE₁₆ and FFFF₁₆ as well as the 32 values from FDD0₁₆ to FDEF₁₆ represent noncharacters. They are
   invalid in interchange, but may be freely used internal to an implementation. Unpaired surrogates are invalid as
   well, i.e. any value in the range D800₁₆ to DBFF₁₆ not followed by a value in the range DC00₁₆ to DFFF₁₆, or any
   value in the range DC00₁₆ to DFFF₁₆ not preceded by a value in the range D800₁₆ to DBFF₁₆. [AF]

and also the answer to:

Q: Are there any paired surrogates that are invalid?

whose answer I here omit for brevity, as it is a table.

I suspect that you guys are now increasingly sold on the answer to the next FAQ right after that one, now. :)

Q: Because supplementary characters are uncommon, does that mean I can ignore them?

A: Just because supplementary characters (expressed with surrogate pairs in UTF-16) are uncommon does 
   not mean that they should be neglected. They include:

    * emoji symbols and emoticons, for interoperating with Japanese mobile phones
    * uncommon (but not unused) CJK characters, important for personal and place names
    * variation selectors for ideographic variation sequences
    * important symbols for mathematics
    * numerous minority scripts and historic scripts, important for some user communities

Another example of using "lead" and "trail" surrogates is in the first sentence from http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UTF16.html

* Naming: For clarity, High and Low surrogates are called Lead and Trail in the API, which gives a better sense of
  their ordering in a string. offset16 and offset32 are used to distinguish offsets to UTF-16 boundaries vs offsets
  to UTF-32 boundaries. int char32 is used to contain UTF-32 characters, as opposed to char16, which is a UTF-16
  code unit.
* Roundtripping Offsets: You can always roundtrip from a UTF-32 offset to a UTF-16 offset and back. Because of the
  difference in structure, you can roundtrip from a UTF-16 offset to a UTF-32 offset and back if and only if
  bounds(string, offset16) != TRAIL.
* Exceptions: The error checking will throw an exception if indices are out of bounds. Other than than that, all
  methods will behave reasonably, even if unmatched surrogates or out-of-bounds UTF-32 values are present.
  UCharacter.isLegal() can be used to check for validity if desired.
* Unmatched Surrogates: If the string contains unmatched surrogates, then these are counted as one UTF-32 value.
  This matches their iteration behavior, which is vital. It also matches common display practice as missing glyphs
  (see the Unicode Standard Section 5.4, 5.5).
* Optimization: The method implementations may need optimization if the compiler doesn't fold static final methods.
  Since surrogate pairs will form an exceeding small percentage of all the text in the world, the singleton case
  should always be optimized for.

You can also see this reflected in the utf.h file from the ICU project as part of their C API in ICU4C:

    #define     U_SENTINEL   (-1)
            This value is intended for sentinel values for APIs that (take or) return single code points (UChar32). 
    #define     U_IS_UNICODE_NONCHAR(c)
            Is this code point a Unicode noncharacter? 
    #define     U_IS_UNICODE_CHAR(c)
            Is c a Unicode code point value (0..U+10ffff) that can be assigned a character? 
    #define     U_IS_BMP(c)   ((uint32_t)(c)<=0xffff)
            Is this code point a BMP code point (U+0000..U+ffff)? 
    #define     U_IS_SUPPLEMENTARY(c)   ((uint32_t)((c)-0x10000)<=0xfffff)
            Is this code point a supplementary code point (U+10000..U+10ffff)? 
    #define     U_IS_LEAD(c)   (((c)&0xfffffc00)==0xd800)
            Is this code point a lead surrogate (U+d800..U+dbff)? 
    #define     U_IS_TRAIL(c)   (((c)&0xfffffc00)==0xdc00)
            Is this code point a trail surrogate (U+dc00..U+dfff)? 
    #define     U_IS_SURROGATE(c)   (((c)&0xfffff800)==0xd800)
            Is this code point a surrogate (U+d800..U+dfff)? 
    #define     U_IS_SURROGATE_LEAD(c)   (((c)&0x400)==0)
            Assuming c is a surrogate code point (U_IS_SURROGATE(c)), is it a lead surrogate? 
    #define     U_IS_SURROGATE_TRAIL(c)   (((c)&0x400)!=0)
            Assuming c is a surrogate code point (U_IS_SURROGATE(c)), is it a trail surrogate?

Another one is:

http://www.opensource.apple.com/source/WebCore/WebCore-1C25/icu/unicode/utf16.h

which contains:

    #define U16_IS_SINGLE(c) !U_IS_SURROGATE(c)
    #define U16_IS_LEAD(c) (((c)&0xfffffc00)==0xd800)
    #define U16_IS_TRAIL(c) (((c)&0xfffffc00)==0xdc00)
    #define U16_IS_SURROGATE(c) U_IS_SURROGATE(c)
    #define U16_IS_SURROGATE_LEAD(c) (((c)&0x400)==0)
    #define U16_SURROGATE_OFFSET ((0xd800<<10UL)+0xdc00-0x10000)
    #define U16_GET_SUPPLEMENTARY(lead, trail) \
    #define U16_LEAD(supplementary) (UChar)(((supplementary)>>10)+0xd7c0)
    #define U16_TRAIL(supplementary) (UChar)(((supplementary)&0x3ff)|0xdc00)
    #define U16_LENGTH(c) ((uint32_t)(c)<=0xffff ? 1 : 2)

In fact, you might want to read over that file, as it has embedded documentation for these, and has other macros for being careful about surrogates. For example, here's one in full:

    /**
     * Get a code point from a string at a random-access offset,
     * without changing the offset.
     * "Unsafe" macro, assumes well-formed UTF-16.
     *
     * The offset may point to either the lead or trail surrogate unit
     * for a supplementary code point, in which case the macro will read
     * the adjacent matching surrogate as well.
     * The result is undefined if the offset points to a single, unpaired surrogate.
     * Iteration through a string is more efficient with U16_NEXT_UNSAFE or U16_NEXT.
     *
     * @param s const UChar * string
     * @param i string offset
     * @param c output UChar32 variable
     * @see U16_GET
     * @stable ICU 2.4
     */
    #define U16_GET_UNSAFE(s, i, c) { \
    (c)=(s)[i]; \
    if(U16_IS_SURROGATE(c)) { \
        if(U16_IS_SURROGATE_LEAD(c)) { \
        (c)=U16_GET_SUPPLEMENTARY((c), (s)[(i)+1]); \
        } else { \
        (c)=U16_GET_SUPPLEMENTARY((s)[(i)-1], (c)); \
        } \
    } \
    }

So keeping your preamble bits, I might have considered doing it this way if it were me doing it:

    #define _Py_UNICODE_IS_SURROGATE
    #define _Py_UNICODE_IS_LEAD_SURROGATE
    #define _Py_UNICODE_IS_TRAIL_SURROGATE
    #define _Py_UNICODE_JOIN_SURROGATES

But I also come from a culture that uses more underscores than you guys tend to, as shown in some of the macro names shown below from utf8.h file. I find that most projects use more underscores in uppercase names than Python does. :)

--tom

#define UTF_START_MARK(len) (((len) >  7) ? 0xFF : (0xFE << (7-(len))))
#define UTF_START_MASK(len) (((len) >= 7) ? 0x00 : (0x1F >> ((len)-2)))
#define UTF_CONTINUATION_MARK           0x80
#define UTF_ACCUMULATION_SHIFT          6
#define UTF_CONTINUATION_MASK           ((U8)0x3f)
#define UNISKIP(uv) ( (uv) < 0x80           ? 1 : \
#define UNISKIP(uv) ( (uv) < 0x80           ? 1 : \
#define NATIVE_IS_INVARIANT(c)          UNI_IS_INVARIANT(NATIVE8_TO_UNI(c))
#define IN_BYTES (CopHINTS_get(PL_curcop) & HINT_BYTES)
#define UNICODE_SURROGATE_FIRST         0xD800
#define UNICODE_SURROGATE_LAST          0xDFFF
#define UNICODE_REPLACEMENT             0xFFFD
#define UNICODE_BYTE_ORDER_MARK         0xFEFF
#define PERL_UNICODE_MAX        0x10FFFF
#define UNICODE_WARN_SURROGATE     0x0001       /* UTF-16 surrogates */
#define UNICODE_WARN_NONCHAR       0x0002       /* Non-char code points */
#define UNICODE_WARN_SUPER         0x0004       /* Above 0x10FFFF */
#define UNICODE_WARN_FE_FF         0x0008       
#define UNICODE_DISALLOW_SURROGATE 0x0010
#define UNICODE_DISALLOW_NONCHAR   0x0020
#define UNICODE_DISALLOW_SUPER     0x0040
#define UNICODE_DISALLOW_FE_FF     0x0080
#define UNICODE_WARN_ILLEGAL_INTERCHANGE \
#define UNICODE_DISALLOW_ILLEGAL_INTERCHANGE \
#define UNICODE_ALLOW_SURROGATE 0
#define UNICODE_ALLOW_SUPER     0
#define UNICODE_ALLOW_ANY       0
#define UNICODE_IS_SURROGATE(c)         ((c) >= UNICODE_SURROGATE_FIRST && \
#define UNICODE_IS_REPLACEMENT(c)       ((c) == UNICODE_REPLACEMENT)
#define UNICODE_IS_BYTE_ORDER_MARK(c)   ((c) == UNICODE_BYTE_ORDER_MARK)
#define UNICODE_IS_NONCHAR(c)           ((c >= 0xFDD0 && c <= 0xFDEF) \
#define UNICODE_IS_SUPER(c)             ((c) > PERL_UNICODE_MAX)
#define UNICODE_IS_FE_FF(c)             ((c) > 0x7FFFFFFF)
#define UNICODE_GREEK_CAPITAL_LETTER_SIGMA      0x03A3
#define UNICODE_GREEK_SMALL_LETTER_FINAL_SIGMA  0x03C2
#define UNICODE_GREEK_SMALL_LETTER_SIGMA        0x03C3
#define GREEK_SMALL_LETTER_MU                   0x03BC
#define GREEK_CAPITAL_LETTER_MU 0x039C  /* Upper and title case of MICRON */
#define LATIN_CAPITAL_LETTER_Y_WITH_DIAERESIS 0x0178    /* Also is title case */
#define LATIN_CAPITAL_LETTER_SHARP_S    0x1E9E
#define UNI_DISPLAY_ISPRINT     0x0001
#define UNI_DISPLAY_BACKSLASH   0x0002
#define UNI_DISPLAY_QQ          (UNI_DISPLAY_ISPRINT|UNI_DISPLAY_BACKSLASH)
#define UNI_DISPLAY_REGEX       (UNI_DISPLAY_ISPRINT|UNI_DISPLAY_BACKSLASH)
#define LATIN_SMALL_LETTER_SHARP_S      0x00DF
#define LATIN_SMALL_LETTER_Y_WITH_DIAERESIS 0x00FF
#define MICRO_SIGN 0x00B5
#define LATIN_CAPITAL_LETTER_A_WITH_RING_ABOVE 0x00C5
#define LATIN_SMALL_LETTER_A_WITH_RING_ABOVE 0x00E5
#define ANYOF_FOLD_SHARP_S(node, input, end)    \
#define SHARP_S_SKIP 2

PS: Those won't always make sense for lack of continuation lines and enclosing ifdefs.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 12 years ago

I now see there are lots of good things in the BOM FAQ that have come up lately regarding surrogates and other illegal characters, and about what can go in data streams.

I quote a few of these from http://unicode.org/faq/utf_bom.html below:

Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? 

A: A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. 
   By represented such an *unpaired* surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream
   would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires
   that encoding form conversion always results in valid data stream. Therefore a converter *must* treat this
   as an error.

Q: How do I convert an unpaired UTF-16 surrogate to UTF-32? 

A: If an unpaired surrogate is encountered when converting ill-formed UTF-16 data, any conformant converter must
   treat this as an error. By representing such an unpaired surrogate on its own, the resulting UTF-32 data stream
   would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that
   encoding form conversion always results in valid data stream.

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining
   UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8
   always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise
   unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8
   is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format
   that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix
   shell scripts.

Q: What should I do with U+FEFF in the middle of a file?

A: In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF
   should normally not occur. For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE
   (ZWNBSP), and is then part of the content of the file or string. The use of U+2060 WORD JOINER is strongly
   preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When
   designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In
   that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character.

Q: How do I tag data that does not interpret U+FEFF as a BOM?

A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate little-endian UTF-16 text. 
   If you do use a BOM, tag the text as simply UTF-16. 

Q: Why wouldn’t I always use a protocol that requires a BOM?

A: Where the data has an associated type, such as a field in a database, a BOM is unnecessary. In particular, 
   if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary *nor
   permitted*. Any U+FEFF would be interpreted as a ZWNBSP.  Do not tag every string in a database or set of fields
   with a BOM, since it wastes space and complicates string concatenation. Moreover, it also means two data fields
   may have precisely the same content, but not be binary-equal (where one is prefaced by a BOM).

Somewhat frustratingly, I am now almost more confused than ever by the last two sentences here:

Q: What is a UTF?

A: A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate
   code points) to a unique byte sequence. The ISO/IEC 10646 standard uses the term “UCS transformation format” for
   UTF; the two terms are merely synonyms for the same concept.

   Each UTF is reversible, thus every UTF supports *lossless round tripping*: mapping from any Unicode coded
   character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF
   mapping *must also* map all code points that are not valid Unicode characters to unique byte sequences. These
   invalid code points are the 66 *noncharacters* (including FFFE and FFFF), as well as unpaired surrogates.

My confusion is about the invalid code points. The first two FAQs I cite at the top are quite clear that it is illegal to have unpaired surrogates in a UTF stream. I don’t understand therefore what it saying about “must also” mapping all code points that aren’t valid Unicode characters to “unique byte sequences” to ensure roundtripping. At first reading, I’d almost say those appear to contradict each other. I must just be being boneheaded though. It’s very early morning yet, and maybe it will become clearer upon a fifth or sixth reading. Maybe it has to with replacement characters? No, that can’t be right. Muddle muddle. Sigh.

Important material is also found in http://www.unicode.org/faq/basic_q.html:

Q: Are surrogate characters the same as supplementary characters?

A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range
   U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate
   code points are reserved for use, in pairs, in representing supplementary code points in UTF-16.

   There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but
   there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate
   code point).

Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code
   points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.

   UCS-2 does not define a distinct data format, because UTF-16 and UCS-2 are identical for purposes of data exchange.
   Both are 16-bit, and have exactly the same code unit representation.

   Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary
   characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not
   handle processing of character properties, code point boundaries, collation, etc. for supplementary characters.

And in reference to UTF-16 being slower by code point than by code unit:

Q: How about using UTF-32 interfaces in my APIs?

A: Except in some environments that store text as UTF-32 in memory, most Unicode APIs are using UTF-16. With UTF-16
   APIs  the low level indexing is at the storage or code unit level, with higher-level mechanisms for graphemes or
   words specifying their boundaries in terms of the code units. This provides efficiency at the low levels, and the
   required functionality at the high levels.

    If its [sic] ever necessary to locate the nᵗʰ character, indexing by character can be implemented as a high
    level operation. However, while converting from such a UTF-16 code unit index to a character index or vice versa
    is fairly straightforward, it does involve a scan through the 16-bit units up to the index point. In a test run,
    for example, accessing UTF-16 storage as characters, instead of code units resulted in a 10× degradation. While
    there are some interesting optimizations that can be performed, it will always be slower on average. Therefore
    locating other boundaries, such as grapheme, word, line or sentence boundaries proceeds directly from the code
    unit index, not indirectly via an intermediate character code index.

I am somewhat amused by this summary:

Q: What does Unicode conformance require?

A: Chapter 3, Conformance discusses this in detail. Here's a very informal version: 

    * Unicode characters don't fit in 8 bits; deal with it.
    * 2 [sic] Byte order is only an issue in I/O.
    * If you don't know, assume big-endian.
    * Loose surrogates have no meaning.
    * Neither do U+FFFE and U+FFFF.
    * Leave the unassigned codepoints alone.
    * It's OK to be ignorant about a character, but not plain wrong.
    * Subsets are strictly up to you.
    * Canonical equivalence matters.
    * Don't garble what you don't understand.
    * Process UTF-* by the book.
    * Ignore illegal encodings.
    * Right-to-left scripts have to go by bidi rules.

And don’t know what I think about this, except that there sure a lot of screw‐ups out there if it is truly as easy as they would would have you believe:

Given that any industrial-strength text and internationalization support API has to be able to handle sequences of
characters, it makes little difference whether the string is internally represented by a sequence of [...] code
units, or by a sequence of code-points [...]. Both UTF-16 and UTF-8 are designed to make working with substrings
easy, by the fact that the sequence of code units for a given code point is unique.

Take this all with a grain of salt, since I found various typos in these FAQs and occasionally also language that seems to reflect an older nomenclature than is now seen in the current published Unicode Standard, meaning 6.0.0. Probably best then to take only general directives from their FAQs and leave language‐ lawyering to the formal printed Standard, insofar as that is possible — which sometimes it is not, because they do make mistakes from time to time, and even less frequently, correct these. :)

--tom

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 12 years ago

Antoine Pitrou \report@bugs.python.org\ wrote on Tue, 16 Aug 2011 09:18:46 -0000:

> I think the 4 macros: > #define _Py_UNICODE_ISSURROGATE > #define _Py_UNICODE_ISHIGHSURROGATE > #define _Py_UNICODE_ISLOWSURROGATE > #define _Py_UNICODE_JOINSURROGATES > are quite straightforward and can avoid using the trailing \.

I don't want to bikeshed, but can we have proper consistent word separation? _Py_UNICODE_IS_HIGH_SURROGATE, not _Py_UNICODE_ISHIGHSURROGATE (etc.)

Oh good, I thought it was only me whohadtroublereadingthose. :)

--tom

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 12 years ago

Ezio Melotti \report@bugs.python.org\ wrote on Tue, 16 Aug 2011 09:23:50 -0000:

All the other macros0 follow the same convention, e.g. Py_UNICODE_ISLOWER and Py_UNICODE_TOLOWER. I agree that keeping the words separate makes them more readable though.

I am guessing that that is not quite why those don't have underscores in them. I bet it is actually something else. Watch:

    % unigrep '^\s*#\s*define\s+Py_[\p{Lu}_]+\b' unicodeobject.h
    #define Py_UNICODEOBJECT_H
    #define Py_USING_UNICODE
    #define Py_UNICODE_WIDE
    #define Py_UNICODE_ISSPACE(ch) \
    #define Py_UNICODE_ISLOWER(ch) _PyUnicode_IsLowercase(ch)
    #define Py_UNICODE_ISUPPER(ch) _PyUnicode_IsUppercase(ch)
    #define Py_UNICODE_ISTITLE(ch) _PyUnicode_IsTitlecase(ch)
    #define Py_UNICODE_ISLINEBREAK(ch) _PyUnicode_IsLinebreak(ch)
    #define Py_UNICODE_TOLOWER(ch) _PyUnicode_ToLowercase(ch)
    #define Py_UNICODE_TOUPPER(ch) _PyUnicode_ToUppercase(ch)
    #define Py_UNICODE_TOTITLE(ch) _PyUnicode_ToTitlecase(ch)
    #define Py_UNICODE_ISDECIMAL(ch) _PyUnicode_IsDecimalDigit(ch)
    #define Py_UNICODE_ISDIGIT(ch) _PyUnicode_IsDigit(ch)
    #define Py_UNICODE_ISNUMERIC(ch) _PyUnicode_IsNumeric(ch)
    #define Py_UNICODE_ISPRINTABLE(ch) _PyUnicode_IsPrintable(ch)
    #define Py_UNICODE_TODECIMAL(ch) _PyUnicode_ToDecimalDigit(ch)
    #define Py_UNICODE_TODIGIT(ch) _PyUnicode_ToDigit(ch)
    #define Py_UNICODE_TONUMERIC(ch) _PyUnicode_ToNumeric(ch)
    #define Py_UNICODE_ISALPHA(ch) _PyUnicode_IsAlpha(ch)
    #define Py_UNICODE_ISALNUM(ch) \
    #define Py_UNICODE_COPY(target, source, length)                         \
    #define Py_UNICODE_FILL(target, value, length) \
    #define Py_UNICODE_MATCH(string, offset, substring) \
    #define Py_UNICODE_REPLACEMENT_CHARACTER ((Py_UNICODE) 0xFFFD)

It looks like what is actually happening there is that you started out with names of the normal ctype(3) macroish thingies:

 isalpha isupper islower isdigit isxdigit isalnum isspace ispunct
 isprint isgraph iscntrl isblank isascii  toupper isblank isascii
 toupper tolower toascii

and wanted to preserve those, which would lead to Py_UNICODE_TOLOWER and Py_UNICODE_TOUPPER, since there are no functions in the original C versions those seem to mirror. Then when you wanted more of that ilk, you sensibly kept to the same naming convention.

I eyeball few exceptions to that style here:

% perl -nle '/^\s*#\s*define\s+(Py_[\p{Lu}_]+)\b/ and print $1' Include/*.h | sort -dfu | fmt -150
Py_ABSTRACTOBJECT_H Py_ALIGNED Py_ALLOW_RECURSION Py_ARITHMETIC_RIGHT_SHIFT Py_ASDL_H Py_AST_H Py_ATOMIC_H Py_BEGIN_ALLOW_THREADS Py_BITSET_H
Py_BLOCK_THREADS Py_BLTINMODULE_H Py_BOOLOBJECT_H Py_BYTEARRAYOBJECT_H Py_BYTES_CTYPE_H Py_BYTESOBJECT_H Py_CAPSULE_H Py_CELLOBJECT_H Py_CEVAL_H
Py_CHARMASK Py_CLASSOBJECT_H Py_CLEANUP_SUPPORTED Py_CLEAR Py_CODECREGISTRY_H Py_CODE_H Py_COMPILE_H Py_COMPLEXOBJECT_H Py_CURSES_H Py_DECREF
Py_DEPRECATED Py_DESCROBJECT_H Py_DICTOBJECT_H Py_DTSF_ALT Py_DTSF_SIGN Py_DTST_FINITE Py_DTST_INFINITE Py_DTST_NAN Py_END_ALLOW_RECURSION
Py_END_ALLOW_THREADS Py_ENUMOBJECT_H Py_EQ Py_ERRCODE_H Py_ERRORS_H Py_EVAL_H Py_FILEOBJECT_H Py_FILEUTILS_H Py_FLOATOBJECT_H Py_FORCE_DOUBLE
Py_FORCE_EXPANSION Py_FORMAT_PARSETUPLE Py_FRAMEOBJECT_H Py_FUNCOBJECT_H Py_GCC_ATTRIBUTE Py_GE Py_GENOBJECT_H Py_GETENV Py_GRAMMAR_H Py_GT
Py_HUGE_VAL Py_IMPORT_H Py_INCREF Py_INTRCHECK_H Py_INVALID_SIZE Py_ISALNUM Py_ISALPHA Py_ISDIGIT Py_IS_FINITE Py_IS_INFINITY Py_ISLOWER Py_IS_NAN
Py_ISSPACE Py_ISUPPER Py_ISXDIGIT Py_ITEROBJECT_H Py_LE Py_LISTOBJECT_H Py_LL Py_LOCAL Py_LOCAL_INLINE Py_LONGINTREPR_H Py_LONGOBJECT_H Py_LT
Py_MARSHAL_H Py_MARSHAL_VERSION Py_MATH_E Py_MATH_PI Py_MEMCPY Py_MEMORYOBJECT_H Py_METAGRAMMAR_H Py_METHODOBJECT_H Py_MODSUPPORT_H Py_MODULEOBJECT_H
Py_NAN Py_NE Py_NODE_H Py_OBJECT_H Py_OBJIMPL_H Py_OPCODE_H Py_OSDEFS_H Py_OVERFLOWED Py_PARSETOK_H Py_PGEN_H Py_PGENHEADERS_H Py_PRINT_RAW
Py_PYARENA_H Py_PYDEBUG_H Py_PYFPE_H Py_PYGETOPT_H Py_PYMATH_H Py_PYMEM_H Py_PYPORT_H Py_PYSTATE_H Py_PYTHON_H Py_PYTHONRUN_H Py_PYTHREAD_H
Py_PYTIME_H Py_RANGEOBJECT_H Py_REFCNT Py_REF_DEBUG Py_RETURN_FALSE Py_RETURN_INF Py_RETURN_NAN Py_RETURN_NONE Py_RETURN_TRUE Py_SAFE_DOWNCAST
Py_SET_ERANGE_IF_OVERFLOW Py_SET_ERRNO_ON_MATH_ERROR Py_SETOBJECT_H Py_SIZE Py_SLICEOBJECT_H Py_STRCMP_H Py_STRTOD_H Py_STRUCTMEMBER_H Py_STRUCTSEQ_H
Py_SYMTABLE_H Py_SYSMODULE_H Py_TOKEN_H Py_TOLOWER Py_TOUPPER Py_TPFLAGS_BASE_EXC_SUBCLASS Py_TPFLAGS_BASETYPE Py_TPFLAGS_BYTES_SUBCLASS
Py_TPFLAGS_DEFAULT Py_TPFLAGS_DICT_SUBCLASS Py_TPFLAGS_HAVE_GC Py_TPFLAGS_HAVE_STACKLESS_EXTENSION Py_TPFLAGS_HAVE_VERSION_TAG Py_TPFLAGS_HEAPTYPE
Py_TPFLAGS_INT_SUBCLASS Py_TPFLAGS_IS_ABSTRACT Py_TPFLAGS_LIST_SUBCLASS Py_TPFLAGS_LONG_SUBCLASS Py_TPFLAGS_READY Py_TPFLAGS_READYING
Py_TPFLAGS_TUPLE_SUBCLASS Py_TPFLAGS_TYPE_SUBCLASS Py_TPFLAGS_UNICODE_SUBCLASS Py_TPFLAGS_VALID_VERSION_TAG Py_TRACEBACK_H Py_TRACE_REFS
Py_TRASHCAN_SAFE_BEGIN Py_TRASHCAN_SAFE_END Py_TUPLEOBJECT_H Py_TYPE Py_UCNHASH_H Py_ULL Py_UNBLOCK_THREADS Py_UNICODE_COPY Py_UNICODE_FILL
Py_UNICODE_ISALNUM Py_UNICODE_ISALPHA Py_UNICODE_ISDECIMAL Py_UNICODE_ISDIGIT Py_UNICODE_ISLINEBREAK Py_UNICODE_ISLOWER Py_UNICODE_ISNUMERIC
Py_UNICODE_ISPRINTABLE Py_UNICODE_ISSPACE Py_UNICODE_ISTITLE Py_UNICODE_ISUPPER Py_UNICODE_MATCH Py_UNICODEOBJECT_H Py_UNICODE_REPLACEMENT_CHARACTER
Py_UNICODE_TODECIMAL Py_UNICODE_TODIGIT Py_UNICODE_TOLOWER Py_UNICODE_TONUMERIC Py_UNICODE_TOTITLE Py_UNICODE_TOUPPER Py_UNICODE_WIDE Py_USING_UNICODE
Py_VA_COPY Py_VISIT Py_WARNINGS_H Py_WEAKREFOBJECT_H Py_XDECREF Py_XINCREF

See what I mean? Most of them that remain tend to be things that one could construe as compound words, like "RANGEOBJECT" or "CODEREGISTRY", though some people might find a few a bit on the longish side to read unaided by underscores, like "BYTEARRAYOBJECT".

'Nuff bikeshedding. :)

--tom

malemburg commented 12 years ago

Tom Christiansen wrote:

So keeping your preamble bits, I might have considered doing it this way if it were me doing it:
#define \_Py_UNICODE_IS_SURROGATE
#define \_Py_UNICODE_IS_LEAD_SURROGATE
#define \_Py_UNICODE_IS_TRAIL_SURROGATE
#define \_Py_UNICODE_JOIN_SURROGATES
But I also come from a culture that uses more underscores than you guys tend to, as shown in some of the macro names shown below from utf8.h file. I find that most projects use more underscores in uppercase names than Python does. :)

The reasoning behind e.g. "ISSURROGATE" is that those names originate from and are consistent with the already existing ISLOWER/ISUPPER/ISTITLE macros which in return stem from the C APIs of the same names (see unicodeobject.h for reference).

Regarding low/high vs. lead/trail: The Unicode database uses the terms low/high and we do in Python as well, so let's stick with those.

What I don't understand is why those macros should be declared private to Python (with the leading underscore). They are quite useful for extensions implementing codecs or other transformations as well.

BTW: I think the other issues mentioned in the discussion are more important to get right, than the names of those macros.

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 12 years ago

Marc-Andre Lemburg \report@bugs.python.org\ wrote on Tue, 16 Aug 2011 12:11:22 -0000:

The reasoning behind e.g. "ISSURROGATE" is that those names originate from and are consistent with the already existing ISLOWER/ISUPPER/ISTITLE macros which in return stem from the C APIs of the same names (see unicodeobject.h for reference).

I eventually figured that part out in the larger context.
Makes sense looked at that way.

Regarding low/high vs. lead/trail: The Unicode database uses the terms low/high and we do in Python as well, so let's stick with those.

Yes, those are their block assignments, Block=High_Surrogates and Block=Low_Surrogates. I just thought I should mention that in the time since those were invented (which cannot be changed), after using them in real code for some years, their lingo seems to have evolved away from those initial names and toward lead/trail as less confusing.

What I don't understand is why those macros should be declared private to Python (with the leading underscore). They are quite useful for extensions implementing codecs or other transformations as well.

I was wondering about that myself. Beyond there being a lot fewer of those private macros in the Python *.h files, they also seem to be of rather different character than the iswhatever() macros:

    $ perl -nle '/^\s*#\s*define\s+(_Py_[\p{Lu}_]+)\b/ and print $1' *.h | sort -dfu | fmt -160
    _Py_ANNOTATE_BARRIER_DESTROY _Py_ANNOTATE_BARRIER_INIT _Py_ANNOTATE_BARRIER_WAIT_AFTER _Py_ANNOTATE_BARRIER_WAIT_BEFORE _Py_ANNOTATE_BENIGN_RACE
    _Py_ANNOTATE_BENIGN_RACE_SIZED _Py_ANNOTATE_BENIGN_RACE_STATIC _Py_ANNOTATE_CONDVAR_LOCK_WAIT _Py_ANNOTATE_CONDVAR_SIGNAL _Py_ANNOTATE_CONDVAR_SIGNAL_ALL
    _Py_ANNOTATE_CONDVAR_WAIT _Py_ANNOTATE_ENABLE_RACE_DETECTION _Py_ANNOTATE_EXPECT_RACE _Py_ANNOTATE_FLUSH_STATE _Py_ANNOTATE_HAPPENS_AFTER
    _Py_ANNOTATE_HAPPENS_BEFORE _Py_ANNOTATE_IGNORE_READS_AND_WRITES_BEGIN _Py_ANNOTATE_IGNORE_READS_AND_WRITES_END _Py_ANNOTATE_IGNORE_READS_BEGIN
    _Py_ANNOTATE_IGNORE_READS_END _Py_ANNOTATE_IGNORE_SYNC_BEGIN _Py_ANNOTATE_IGNORE_SYNC_END _Py_ANNOTATE_IGNORE_WRITES_BEGIN _Py_ANNOTATE_IGNORE_WRITES_END
    _Py_ANNOTATE_MUTEX_IS_USED_AS_CONDVAR _Py_ANNOTATE_NEW_MEMORY _Py_ANNOTATE_NO_OP _Py_ANNOTATE_PCQ_CREATE _Py_ANNOTATE_PCQ_DESTROY _Py_ANNOTATE_PCQ_GET
    _Py_ANNOTATE_PCQ_PUT _Py_ANNOTATE_PUBLISH_MEMORY_RANGE _Py_ANNOTATE_PURE_HAPPENS_BEFORE_MUTEX _Py_ANNOTATE_RWLOCK_ACQUIRED _Py_ANNOTATE_RWLOCK_CREATE
    _Py_ANNOTATE_RWLOCK_DESTROY _Py_ANNOTATE_RWLOCK_RELEASED _Py_ANNOTATE_SWAP_MEMORY_RANGE _Py_ANNOTATE_THREAD_NAME _Py_ANNOTATE_TRACE_MEMORY
    _Py_ANNOTATE_UNPROTECTED_READ _Py_ANNOTATE_UNPUBLISH_MEMORY_RANGE _Py_AS_GC _Py_CHECK_REFCNT _Py_COUNT_ALLOCS_COMMA _Py_DEC_REFTOTAL _Py_DEC_TPFREES
    _Py_INC_REFTOTAL _Py_INC_TPALLOCS _Py_INC_TPFREES _Py_PARSE_PID _Py_REF_DEBUG_COMMA _Py_SET_EDOM_FOR_NAN

BTW: I think the other issues mentioned in the discussion are more important to get right, than the names of those macros.

Yup. Just paint it red. :)

--tom

vstinner commented 12 years ago

I'm reposting my patch from bpo-12751. I think that it's simpler than belopolsky's patch: it doesn't add public macros in unicodeobject.h and don't add the complex Py_UNICODE_NEXT() macro. My patch only adds private macros in unicodeobject.c to factorize the code.

I don't want to add public macros because with the stable API and with the PEP-393, we are trying to hide the Py_UNICODE type and PyUnicodeObject internals. In belopolsky's patch, only Py_UNICODE_NEXT() is used outside unicodeobject.c.

Copy/paste of the initial message of my issue bpo-12751 (msg142108): --------------- A lot of code is duplicated in unicodeobject.c to manipulate ("encode/decode") surrogates. Each function has from one to three different implementations. The new decode_ucs4() function adds a new implementation. Attached patch replaces this code by macros.

I think that only the implementations of IS_HIGH_SURROGATE and IS_LOW_SURROGATE are important for speed. ((ch & 0xFFFFFC00UL) == 0xD800) (from decode_ucs4) is *a little bit* faster than (0xD800 \<= ch && ch \<= 0xDBFF) on my CPU (Atom Z520 @ 1.3 GHz): running test_unicode 4 times takes \~54 sec instead of \~57 sec (-3%).

These 3 macros have to be checked, I wrote the first one:

#define IS_SURROGATE(ch) (((ch) & 0xFFFFF800UL) == 0xD800)
#define IS_HIGH_SURROGATE(ch) (((ch) & 0xFFFFFC00UL) == 0xD800)
#define IS_LOW_SURROGATE(ch) (((ch) & 0xFFFFFC00UL) == 0xDC00)

I added cast to Py_UCS4 in COMBINE_SURROGATES to avoid integer overflow if Py_UNICODE is 16 bits (narrow build). It's maybe useless.

#define COMBINE_SURROGATES(ch1, ch2) \
 (((((Py_UCS4)(ch1) & 0x3FF) << 10) | ((Py_UCS4)(ch2) & 0x3FF)) + 0x10000)

HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument has been preproceed to fit in [0; 0xFFFF]. I added this requirement in the comment of these macros. It would be better to have only one macro to do the two operations, but because "*p++" (dereference and increment) is usually used, I prefer to avoid one unique macro (I don't like passing *p++ in a macro using its argument more than once).

Or we may add a third macro using HIGH_SURROGATE and LOW_SURROGATE.

I rewrote the main loop of PyUnicode_EncodeUTF16() to avoid an useless test on ch2 on narrow build.

I also added a IS_NONBMP macro just because I prefer macro over hardcoded constants. ---------------

malemburg commented 12 years ago

STINNER Victor wrote:

STINNER Victor \victor.stinner@haypocalc.com\ added the comment:

I'm reposting my patch from bpo-12751. I think that it's simpler than belopolsky's patch: it doesn't add public macros in unicodeobject.h and don't add the complex Py_UNICODE_NEXT() macro. My patch only adds private macros in unicodeobject.c to factorize the code.

I don't want to add public macros because with the stable API and with the PEP-393, we are trying to hide the Py_UNICODE type and PyUnicodeObject internals. In belopolsky's patch, only Py_UNICODE_NEXT() is used outside unicodeobject.c.

PEP-393 is an optional feature for extension writers. If they don't need PEP-393 style stable ABIs and want to use the macros, they should be able to. I'm therefore -1 on making them private.

Regarding separating adding the various surrogate macros and the next-macros: I don't see a problem with adding both in Python 3.3.

malemburg commented 12 years ago

Marc-Andre Lemburg wrote:

Marc-Andre Lemburg \mal@egenix.com\ added the comment:

STINNER Victor wrote: > > STINNER Victor \victor.stinner@haypocalc.com\ added the comment: > > I'm reposting my patch from bpo-12751. I think that it's simpler than belopolsky's patch: it doesn't add public macros in unicodeobject.h and don't add the complex Py_UNICODE_NEXT() macro. My patch only adds private macros in unicodeobject.c to factorize the code. > > I don't want to add public macros because with the stable API and with the PEP-393, we are trying to hide the Py_UNICODE type and PyUnicodeObject internals. In belopolsky's patch, only Py_UNICODE_NEXT() is used outside unicodeobject.c.

PEP-393 is an optional feature for extension writers. If they don't need PEP-393 style stable ABIs and want to use the macros, they should be able to. I'm therefore -1 on making them private.

Sorry, I mean PEP-384, not PEP-393. Whether PEP-393 will turn out to be a workable solution has yet to be seen, but that's a different subject. In any case, Py_UNICODE and access macros for PyUnicodeObject are in wide-spread use, so trying to hide them won't work until we reach Py4k.

Regarding separating adding the various surrogate macros and the next-macros: I don't see a problem with adding both in Python 3.3.

vstinner commented 12 years ago

(oops, msg142225 was for issue bpo-12326)

abalkin commented 12 years ago

The code review links point to something weird. Victor, can you upload your patch for review?

My first impression is that your patch does not accomplish much beyond replacing some literal expressions with macros. What I wanted to achieve with this issue was to enable writing code without #ifdef Py_UNICODE_WIDE branches. In your patch these branches seem to still be there and in fact it appears that new code is longer than the old one (I am not sure why, but I see more '+' than '-'s in your patch.)

vstinner commented 12 years ago

The code review links point to something weird.

That's because I posted a patch for another issue. It's the patch set 5, not the patch set 6 :-)

Direct link: http://bugs.python.org/review/10542/patch/3174/9874

My first impression is that your patch does not accomplish much beyond replacing some literal expressions with macros.

Yes, and it avoids the duplication of some code patterns, as explained in my message. I would like to avoid constants in the code. Some macros are *a little bit* faster than the current code.

What I wanted to achieve with this issue was to enable writing code without #ifdef Py_UNICODE_WIDE branches.

Yes, and I think that it's better to split this issue in two steps:

1- add macros for the surrogates (test, join, ...) 2- Py_UNICODE_NEXT()

In your patch these branches seem to still be there and in fact it appears that new code is longer than the old one

Yes, the code adds more lines than it removes. Is it a problem? My goal is to have more readable code (easier to maintain).

ezio-melotti commented 12 years ago

As I said in msg142175 I think the Py_UNICODE_IS{HIGH|LOW|}SURROGATE and Py_UNICODE_JOINSURROGATES can be committed without trailing in 3.3 and with trailing _ in 2.7/3.2. They should go in unicodeobject.h and be public in 3.3+.

Regarding the name, it would be fine with me to use PyUNICODE_IS_HIGH_SURROGATE. Other IS* macros don't use spaces, but JOIN_SURROGATES and other proposed macros (e.g. PUT_NEXT/WRITE_NEXT) do. Also these macros are not related to any existing API like e.g. isalpha. I think HIGH/LOW are fine, we can mention lead/trail in the doc.

Regarding the implementation, we could use Victor's one if it's faster and it has no other side effects.

Regarding the other macros:

_Py_UNICODE_NEXT and _Py_UNICODE_PUT_NEXT are useful, so once we have agreed about the name they can go in. They can be private in all the 3 branches and made public in 3.4 if they work well;
IS_NONBMP doesn't simplify much the code but makes it more readable. ICU has U_IS_BMP, but in most of the cases we want to check for non-BMP, so if we add this macro it might be ok to check for non-BMP;
I'm not sure HIGH_SURROGATE/LOW_SURROGATE are useful with _Py_UNICODE_NEXT. If they are they should get a better name because the current one is not clear about what they do.

Unless someone disagrees I'll prepare a patch with PyUNICODEIS{HIGH|LOW|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for unicodeobject.h, using them where necessary, using with Victor implementation and commit it (after a review).

We can think about the rest later.

vstinner commented 12 years ago

Le 17/08/2011 07:04, Ezio Melotti a écrit :

As I said in msg142175 I think the Py_UNICODE_IS{HIGH|LOW|}SURROGATE and Py_UNICODE_JOINSURROGATES can be committed without trailing in 3.3 and with trailing _ in 2.7/3.2. They should go in unicodeobject.h

For Python 2.7 and 3.2, I would prefer to not touch a public header, and so add the macros in unicodeobject.c.

and be public in 3.3+.

If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros public, they will use to substract 0x10000 themself (whereas my macros require the ordinal to be preproceed).

_Py_UNICODE_NEXT and _Py_UNICODE_PUT_NEXT are useful, so once we have agreed about the name they can go in. They can be private in all the 3 branches and made public in 3.4 if they work well;

Note: I don't think that _Py_UNICODE*NEXT should go into Python 2.7 or 3.2.

IS_NONBMP doesn't simplify much the code but makes it more readable. ICU has U_IS_BMP, but in most of the cases we want to check for non-BMP, so if we add this macro it might be ok to check for non-BMP;

If you want to make it public, it's better to call it PyUNICODE_IS_BMP() (check if the argument is in U+0000-U+FFFF).

I'm not sure HIGH_SURROGATE/LOW_SURROGATE are useful with _Py_UNICODE_NEXT. If they are they should get a better name because the current one is not clear about what they do.

They are still useful for UTF-16 encoders (to UTF-16-LE/BE and 16-bit wchar_t*). We can keep HIGH_SURROGATE and LOW_SURROGATE private in unicodeobject.c.

Unless someone disagrees I'll prepare a patch with PyUNICODEIS{HIGH|LOW|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for unicodeobject.h, using them where necessary, using with Victor implementation and commit it (after a review).

Cool. I suppose that you mean PyUNICODE_JOIN_SURROGATES (not Py_UNICODE_JOIN_SURROGATES). I used the verb "combine", taken from a comment in unicodeobject.c. "combine" is maybe better than "join"?

malemburg commented 12 years ago

STINNER Victor wrote:

STINNER Victor \victor.stinner@haypocalc.com\ added the comment:

Le 17/08/2011 07:04, Ezio Melotti a écrit : > As I said in msg142175 I think the Py_UNICODE_IS{HIGH|LOW|}SURROGATE and Py_UNICODE_JOINSURROGATES can be committed without trailing in 3.3 and with trailing _ in 2.7/3.2. They should go in unicodeobject.h

Ezio used two different naming schemes in his email. Please always use PyUNICODE... or _PyUNICODE (not PyUNICODE or _PyUNICODE_).

For Python 2.7 and 3.2, I would prefer to not touch a public header, and so add the macros in unicodeobject.c.

Why would you want to touch Python 2.7 at all ?

> and be public in 3.3+.

If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros public, they will use to substract 0x10000 themself (whereas my macros require the ordinal to be preproceed).

This can be done by having two definitions of the macros: one set for UCS2 builds and one for UCS4.

> * _Py_UNICODE_NEXT and _Py_UNICODE_PUT_NEXT are useful, so once we have agreed about the name they can go in. They can be private in all the 3 branches and made public in 3.4 if they work well;

Note: I don't think that _Py_UNICODE*NEXT should go into Python 2.7 or 3.2.

Certainly not into Python 2.7. Adding macros in patch level releases is also not such a good idea.

> * IS_NONBMP doesn't simplify much the code but makes it more readable. ICU has U_IS_BMP, but in most of the cases we want to check for non-BMP, so if we add this macro it might be ok to check for non-BMP;

If you want to make it public, it's better to call it PyUNICODE_IS_BMP() (check if the argument is in U+0000-U+FFFF).

Py_UNICODE_IS_BMP() please.

> * I'm not sure HIGH_SURROGATE/LOW_SURROGATE are useful with _Py_UNICODE_NEXT. If they are they should get a better name because the current one is not clear about what they do.

They are still useful for UTF-16 encoders (to UTF-16-LE/BE and 16-bit wchar_t*). We can keep HIGH_SURROGATE and LOW_SURROGATE private in unicodeobject.c.

> Unless someone disagrees I'll prepare a patch with PyUNICODEIS{HIGH|LOW|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for unicodeobject.h, using them where necessary, using with Victor implementation and commit it (after a review).

Cool. I suppose that you mean PyUNICODE_JOIN_SURROGATES (not Py_UNICODE_JOIN_SURROGATES). I used the verb "combine", taken from a comment in unicodeobject.c. "combine" is maybe better than "join"?

No, PyUNICODE... please !

Thanks, -- Marc-Andre Lemburg eGenix.com

2011-10-04: PyCon DE 2011, Leipzig, Germany 48 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

ezio-melotti commented 12 years ago

For Python 2.7 and 3.2, I would prefer to not touch a public header, and so add the macros in unicodeobject.c.

Is there some reason for this? I think it's better if we have them in the same place rather than renaming and moving them in another file between 3.2 and 3.3.

If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros public, they will use to substract 0x10000 themself (whereas my macros require the ordinal to be preproceed).

If they turn out to be useful and we find a clearer name we can even make them public in 3.3, but we'll have to see about that.

Note: I don't think that _Py_UNICODE*NEXT should go into Python 2.7 or 3.2.

If they don't it won't be possible to fix bpo-9200 in those branches (unless we decide that the bug shouldn't be fixed there, but I would rather fix it).

If you want to make it public, it's better to call it PyUNICODE_IS_BMP() (check if the argument is in U+0000-U+FFFF).

Yes, public APIs will follow the naming conventions. Not sure if it's better to check if it's a BMP char, or if it's not.

They are still useful for UTF-16 encoders (to UTF-16-LE/BE and 16-bit wchar_t*). We can keep HIGH_SURROGATE and LOW_SURROGATE private in unicodeobject.c.

What are the naming convention for private macros in the same .c file where they are used? Shouldn't they get at least a trailing _?

Unless someone disagrees I'll prepare a patch with PyUNICODEIS{HIGH|LOW|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for unicodeobject.h, using them where necessary, using with Victor implementation and commit it (after a review).

Cool. I suppose that you mean PyUNICODE_JOIN_SURROGATES (not Py_UNICODE_JOIN_SURROGATES).

All the other macros use PyUNICODE_*.

I used the verb "combine", taken from a comment in unicodeobject.c. "combine" is maybe better than "join"?

I like join, it's clear enough and shorter.

vstinner commented 12 years ago

Ah yes, the correct prefix for functions working on Py_UNICODE characters/strings is "Py_UNICODE", not "PyUNICODE", sorry.

> For Python 2.7 and 3.2, I would prefer to not touch a public header, > and so add the macros in unicodeobject.c.

Is there some reason for this?

We don't add new features to stable releases.

> If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros > public, they will use to substract 0x10000 themself (whereas my > macros require the ordinal to be preproceed).

If they turn out to be useful and we find a clearer name we can even make them public in 3.3, but we'll have to see about that.

I don't think that they are useful outside unicodeobject.c.

> Note: I don't think that _Py_UNICODE*NEXT should go into > Python 2.7 or 3.2.

If they don't it won't be possible to fix bpo-9200 in those branches

I don't think that bpo-9200 is a bug, but more a feature request.

Not sure if it's better to check if it's a BMP char, or if it's not.

I prefer a shorter name and avoiding double negation: !Py_UNICODE_IS_NON_BMP(ch).

What are the naming convention for private macros in the same .c file where they are used?

Hopefully, there is no convention for private macros :-)

Shouldn't they get at least a trailing _?

Nope.

ezio-melotti commented 12 years ago

Ezio used two different naming schemes in his email. Please always use PyUNICODE... or _PyUNICODE (not PyUNICODE or _PyUNICODE_).

Indeed, that was a typo + copy/paste. I meant to say PyUNICODE and _PyUNICODE. Sorry about the confusion.

Why would you want to touch Python 2.7 at all ? [...] Certainly not into Python 2.7. Adding macros in patch level releases is also not such a good idea.

Because it has the bug and we can fix it (the macros will be private so that we don't add any feature). Also what about 3.2? Are you saying that we should fix the bug in 3.2/3.3 only and leave 2.x alone or that you don't want the bug to be fixed in all the bug-fix releases (i.e. 2.7/3.2)? My idea is to fix the bug in 2.7/3.2/3.3 using the macros, but only make them public in 3.3 so that new features are exposed only in 3.3.

malemburg commented 12 years ago

Ezio Melotti wrote:

Ezio Melotti \ezio.melotti@gmail.com\ added the comment:

> Ezio used two different naming schemes in his email. Please always > use PyUNICODE... or _PyUNICODE (not PyUNICODE or _PyUNICODE_).

Indeed, that was a typo + copy/paste. I meant to say PyUNICODE and _PyUNICODE. Sorry about the confusion.

Good :-)

> Why would you want to touch Python 2.7 at all ? > [...] > Certainly not into Python 2.7. Adding macros in patch level releases > is also not such a good idea.

Because it has the bug and we can fix it (the macros will be private so that we don't add any feature). Also what about 3.2? Are you saying that we should fix the bug in 3.2/3.3 only and leave 2.x alone or that you don't want the bug to be fixed in all the bug-fix releases (i.e. 2.7/3.2)? My idea is to fix the bug in 2.7/3.2/3.3 using the macros, but only make them public in 3.3 so that new features are exposed only in 3.3.

For bug fixes, you can put the macros straight into unicodeobject.c, but please leave unicodeobject.h untouched - otherwise people will mess around with these macros (even if they are private) and users will start to wonder about linker errors if they use old patch level releases of Python 2.7/3.2.

Also note that some of these macros change the behavior of Python

that's good if it fixes a bug (obviously :-)), but bad if it changes areas that are correctly implemented and then suddenly expose new behavior.

ezio-melotti commented 12 years ago

For bug fixes, you can put the macros straight into unicodeobject.c, but please leave unicodeobject.h untouched - otherwise people will mess around with these macros (even if they are private) and users will start to wonder about linker errors if they use old patch level releases of Python 2.7/3.2.

OK, so in 2.7/3.2 I'll put them in unicodeobject.c, and in 3.3 I'll move them in unicodeobject.c.

Regarding the name, other macros in unicodeobject.c don't have any prefix, so we can do the same (e.g. IS_SURROGATE) for 2.7/3.2 if that's fine.

Also note that some of these macros change the behavior of Python

that's good if it fixes a bug (obviously :-)), but bad if it changes areas that are correctly implemented and then suddenly expose new behavior.

After this we can fix bpo-9200 and make narrow builds behave correctly (i.e. like wide ones) with non-BMP chars (at least in some places).

malemburg commented 12 years ago

Ezio Melotti wrote:

Ezio Melotti \ezio.melotti@gmail.com\ added the comment:

> For bug fixes, you can put the macros straight into unicodeobject.c, > but please leave unicodeobject.h untouched - otherwise people will > mess around with these macros (even if they are private) and users > will start to wonder about linker errors if they use old patch > level releases of Python 2.7/3.2.

OK, so in 2.7/3.2 I'll put them in unicodeobject.c, and in 3.3 I'll move them in unicodeobject.c.

Regarding the name, other macros in unicodeobject.c don't have any prefix, so we can do the same (e.g. IS_SURROGATE) for 2.7/3.2 if that's fine.

Sure.

> Also note that some of these macros change the behavior of Python > - that's good if it fixes a bug (obviously :-)), but bad if it > changes areas that are correctly implemented and then suddenly expose > new behavior.

After this we can fix bpo-9200 and make narrow builds behave correctly (i.e. like wide ones) with non-BMP chars (at least in some places).

Ok.

ericvsmith commented 12 years ago

On 8/17/2011 6:30 AM, Ezio Melotti wrote:

OK, so in 2.7/3.2 I'll put them in unicodeobject.c, and in 3.3 I'll move them in unicodeobject.c.

I believe the second file should be unicodeobject.h, correct?

ezio-melotti commented 12 years ago

Correct.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

Also what about 3.2? Are you saying that we should fix the bug in 3.2/3.3 only and leave 2.x alone or that you don't want the bug to be fixed in all the bug-fix releases (i.e. 2.7/3.2)?

Notice that the macros themselves don't fix any bugs. As for the bugs you apparently want to fix using these macros: they should be considered on a case-by-case basis. Some of your planned bug fixes may introduce incompatibilities that rule out fixing them.

vstinner commented 12 years ago

OK, so in 2.7/3.2 I'll put them in unicodeobject.c

It looks like bpo-9200 only needs Py_UNICODE_NEXT, which can be implemented without the other PyUNICODESURROGATE macros.

ezio-melotti commented 12 years ago

I attached a patch to fix the str.is* methods on bpo-9200 that also includes the macro.

Since they are not public there, I don't see a reason to do 2 separate commits on 2.7/3.2 (one for the feature and one for the fix).

ezio-melotti commented 12 years ago

The attached patch adds the following 4 public macros to unicodeobjects.h: Py_UNICODE_IS_SURROGATE(ch) Py_UNICODE_IS_HIGH_SURROGATE(ch) Py_UNICODE_IS_LOW_SURROGATE(ch) Py_UNICODE_JOIN_SURROGATES(high, low) and documents them.

Since _Py_UNICODE_NEXT is still private, I'll commit it later as part as bpo-9200.

malemburg commented 12 years ago

Ezio Melotti wrote:

Ezio Melotti \ezio.melotti@gmail.com\ added the comment:

The attached patch adds the following 4 public macros to unicodeobjects.h: Py_UNICODE_IS_SURROGATE(ch) Py_UNICODE_IS_HIGH_SURROGATE(ch) Py_UNICODE_IS_LOW_SURROGATE(ch) Py_UNICODE_JOIN_SURROGATES(high, low) and documents them.

Since _Py_UNICODE_NEXT is still private, I'll commit it later as part as bpo-9200.

Looks good.

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 12 years ago

New changeset 77171f993bf2 by Ezio Melotti in branch 'default': bpo-10542: Add 4 macros to work with surrogates: Py_UNICODE_IS_SURROGATE, Py_UNICODE_IS_HIGH_SURROGATE, Py_UNICODE_IS_LOW_SURROGATE, Py_UNICODE_JOIN_SURROGATES. http://hg.python.org/cpython/rev/77171f993bf2

vstinner commented 12 years ago

The PEP-393 has been accepted and merge into Python 3.3. Python 3.3 doesn't need the Py_UNICODE_NEXT macro anymore. But my macros (unicode_macros.patch) are still useful.

ezio-melotti commented 12 years ago

Py_UNICODE_NEXT has been removed from 3.3 but it's still available and used in 2.7/3.2 (even if it's private). In order to fix bpo-10521 on 2.7/3.2 the _Py_UNICODE_PUT_NEXT macro attached to this patch is required.

benjaminp commented 12 years ago

Closing now.