Py_UNICODE_NEXT and other macros for surrogates

abalkin commented 13 years ago

BPO	10542
Nosy	@malemburg, @loewis, @doerwalter, @birkenfeld, @rhettinger, @amauryfa, @abalkin, @pitrou, @vstinner, @ericvsmith, @benjaminp, @ezio-melotti
Files	unicode-next.diff issue10542-put-next.diff issue10542.diff issue10542a.diff unicode_macros.patch issue10542b.diff: Patch against 3.3

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/abalkin' closed_at = created_at = labels = ['extension-modules', 'interpreter-core', 'type-feature', 'expert-unicode'] title = 'Py_UNICODE_NEXT and other macros for surrogates' updated_at = user = 'https://github.com/abalkin' ``` bugs.python.org fields: ```python activity = actor = 'benjamin.peterson' assignee = 'belopolsky' closed = True closed_date = closer = 'benjamin.peterson' components = ['Extension Modules', 'Interpreter Core', 'Unicode'] creation = creator = 'belopolsky' dependencies = [] files = ['19825', '19845', '20186', '20190', '22915', '23000'] hgrepos = [] issue_num = 10542 keywords = ['patch'] message_count = 88.0 messages = ['122464', '122489', '122490', '122492', '122494', '122495', '122497', '122500', '122501', '122502', '122503', '122504', '122518', '122564', '122567', '122568', '122571', '122573', '122578', '122588', '122591', '122592', '122594', '122595', '123283', '123290', '123569', '123570', '123757', '124174', '124839', '124842', '124849', '124852', '124854', '124856', '124860', '124864', '124866', '124868', '124869', '124874', '124883', '124897', '124902', '124903', '124910', '124911', '124914', '142117', '142133', '142134', '142173', '142175', '142177', '142178', '142183', '142184', '142185', '142187', '142188', '142189', '142222', '142223', '142224', '142227', '142230', '142231', '142253', '142256', '142258', '142259', '142260', '142261', '142262', '142263', '142265', '142267', '142268', '142269', '142270', '142317', '142731', '142732', '142735', '144629', '144631', '150692'] nosy_count = 16.0 nosy_names = ['lemburg', 'loewis', 'doerwalter', 'georg.brandl', 'rhettinger', 'amaury.forgeotdarc', 'belopolsky', 'Rhamphoryncus', 'pitrou', 'vstinner', 'eric.smith', 'benjamin.peterson', 'stutzbach', 'ezio.melotti', 'python-dev', 'tchrist'] pr_nums = [] priority = 'normal' resolution = 'out of date' stage = 'commit review' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue10542' versions = ['Python 3.3'] ```

abalkin commented 13 years ago

As discussed in bpo-10521 and the sprawling "len(chr(i)) = 2?" thread [1] on python-dev, many functions in python library behave differently on narrow and wide builds. While there are unavoidable differences such as the length of strings with non-BMP characters, many functions can work around these differences. For example, the ord() function already produces integers over 0xFFFF when given a surrogate pair as a string of length two on a narrow build. Other functions such as str.isalpha(), are not yet aware of surrogates. See also bpo-9200.

A consensus is developing that non-BMP characters support on narrow builds is here to stay and that naive functions should be fixed. Unfortunately, working with surrogates in python code is tricky because unicode C-API does not provide much support and existing examples of surrogate processing look like this:

while (u != uend && w != wend) {
if (0xD800 \<= u[0] && u[0] \<= 0xDBFF
&& 0xDC00 \<= u[1] && u[1] \<= 0xDFFF)
{
*w = (((u[0] & 0x3FF) \<\< 10) | (u[1] & 0x3FF)) + 0x10000;
u += 2;
}
else {
*w = *u;
u++;
}
w++;
}

The attached patch introduces a Py_UNICODE_NEXT() macro that allows replacing the code above with two lines:

+ while (u != uend && w != wend) + *w++ = Py_UNICODE_NEXT(u, uend);

The patch also introduces a set of macros for manipulating the surrogates, but I have not started replacing more instances of verbose surrogate processing because I would like to first look for higher level abstractions such as Py_UNICODE_NEXT(). For example, there are many instances that can benefit from Py_UNICODE_PUT_NEXT(ptr, ch) macro that would put a UCS4 character ch into Py_UNICODE buffer pointed by ptr and advance ptr by 1 or 2 units as necessary.

[1] http://mail.python.org/pipermail/python-dev/2010-November/105908.html

ericvsmith commented 13 years ago

In addition to the proposed Py_UNICODE_NEXT and Py_UNICODE_PUTNEXT, str.\_format__ would also need a function that tells it how many Py_UNICODEs are needed to store a given Py_UCS4.

abalkin commented 13 years ago

On Fri, Nov 26, 2010 at 7:27 PM, Eric Smith \report@bugs.python.org\ wrote: ..

In addition to the proposed Py_UNICODE_NEXT and Py_UNICODE_PUTNEXT, > str.\_format__ would also need a function that tells it how many Py_UNICODEs are needed to store a given Py_UCS4.

Yes, this functionality is currently hidden in

unicode_aswidechar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size):

/* Helper function for PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(): convert a Unicode object to a wide character string.

If w is NULL: return the number of wide characters (including the nul character) required to convert the unicode object. Ignore size argument. .. */

and I believe is reimplemented in a few other places.

ericvsmith commented 13 years ago

I'd need access to this without having to build a PyUnicodeObject, for efficiency. But it sounds like it does have the basic functionality I need.

For my use I'd really need it to take the result of Py_UNICODE_NEXT. Something like:
Py_ssize_t
Py_UNICODE_NUM_NEEDED(Py_UCS4 c)
and it would always return 1 or 2. Always 1 for a wide build, and for a narrow build 1 if c is in the BMP else 2. Choose a better name, of course.

abalkin commented 13 years ago

On Fri, Nov 26, 2010 at 7:45 PM, Eric Smith \report@bugs.python.org\ wrote: ..

For my use I'd really need it to take the result of Py_UNICODE_NEXT. Something like: Py_ssize_t Py_UNICODE_NUM_NEEDED(Py_UCS4 c) and it would always return 1 or 2. Always 1 for a wide build, and for a narrow build 1 if c is in the BMP else 2. Choose a better name, of course.

Can you describe your use case in more detail? Would Py_UNICODE_PUT_NEXT() combined with Py_UNICODE_CODEPOINT_COUNT(Py_UNICODE *begin, Py_UNICODE *end) solve it?

vstinner commented 13 years ago

I don't like macro having a result and using multiple instructions using the evil magic trick (the ","). It's harder to maintain the code and harder to debug than a classical function.

Don't you think that modern compilers are able to inline the code? (If not, we may add the right C attribute/keyword)

ericvsmith commented 13 years ago

The code will basically be:

  Py_UCS4 fill;

parse_format_string(fmt, ..., &fill, ...);

/ lots more code \/

  if (fill_needed) {
    /* compute how many characters to reserve */
    space_needed = Py_UNICODE_NUM_NEEDED(fill) *
                      number_of_characters_to_fill;
  }

It would be most convenient (and require the fewest changes) if the computation could just use fill, instead of remembering the pointers to the beginning and end of fill.

Py_UNICODE_CODEPOINT_COUNT could be implemented with a primitive that does what I want.

abalkin commented 13 years ago

On Fri, Nov 26, 2010 at 8:41 PM, STINNER Victor \report@bugs.python.org\ wrote: ..

I don't like macro having a result and using multiple instructions using the evil magic trick (the ","). It's harder to maintain the code and harder to debug than a classical function.

You are preaching to the choir. In fact, my first version (bpo-10521-unicode-next.diff attached to bpo-10521) used a function. I would not worry about implementation at this point, though. Let's find the best abstraction first.

Don't you think that modern compilers are able to inline the code? (If not, we may add the right C attribute/keyword)

Not in C. In C++, I could use a reference to the pointer incremented by the macro, but in C, I have to use an address. Once you take an address of a variable, the compiler will refuse to put it in a register. So no, I don't think we can write an ANSI C function that will be as efficient as the macro.

ericvsmith commented 13 years ago

The compiler's decision to inline something should not be related to its ability to put variables in a register.

But I definitely agree that we should get the abstraction right first and worry about the implementation later.

abalkin commented 13 years ago

On Fri, Nov 26, 2010 at 9:22 PM, Eric Smith \report@bugs.python.org\ wrote: ..

But I definitely agree that we should get the abstraction right first and worry about the implementation later.

I am fairly happy with Py_UNICODE_NEXT() abstraction. It's semantics should be natural for users familiar with python iterators and the fact that it expands to simply *ptr++ on wide builds makes it easy to explain its usage. I am note very happy about the end argument for the following reasons:

Builtin "next()" takes the default value as a second argument. Extension writers may expect the same from Py_UNICODE_NEXT(). The name "end" should be self-explainatory though, especially to those with an exposure to STL.
If Py_UNICODE_NEXT() stays as a macro, an innocent looking Py_UNICODE_NEXT(p, p + size) will have a hard to detect bug. Can be fixed by making Py_UNICODE_NEXT() a function.

I wonder whether it is best to prefix the new macros with an underscore. On one hand, we want to make this available to extension writers, on the other hand, once more people start dealing with non-BMP issues, a better abstraction may be found and we man not want to maintain Py_UNICODE_NEXT() indefinitely.

abalkin commented 13 years ago

Raymond,

I wonder if you would like to comment on the iterator analogy and/or on adding public names to C API.

rhettinger commented 13 years ago

Mark, can you opine on this?

malemburg commented 13 years ago

Raymond Hettinger wrote:

Raymond Hettinger \rhettinger@users.sourceforge.net\ added the comment:

Mark, can you opine on this?

Yes, I'll have a look later today.

malemburg commented 13 years ago

I like the idea and thanks for putting work into this.

Some comments:

when using macro variables, always put the variables in parens in the expansion; this avoids precedence issues, weird syntax errors, etc. - even if it may not be necessary
a function would be cleaner, but since this code is very performance sensitive, I'd opt for the macro version, unless someone can prove that a function would be just as fast in benchmarks
the macros should be documented in the unicodeobject.h header file and clearly mention that ptr and end should be side-effect free and that ptr must an lvalue
please use the faster bitmask operators for joining surrogates, i.e. ucs4 = ((((high & 0x03FF) \<\< 10) | (low & 0x03FF)) + 0x00010000);
the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since it returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES()
same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT()
in order to make the macro easier to understand, please rename it to Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less than without the macro :-)

 * this version should be slightly faster and is also easier to read:

#define Py_UCS4_READ_CODE_POINT(ptr, end) \
    ((Py_UNICODE_ISHIGHSURROGATE((ptr)[0]) && \
      (ptr) < (end) && \
      Py_UNICODE_ISLOWSURROGATE((ptr)[1])) ? \
      Py_UNICODE_JOIN_SURROGATES((ptr)++, (ptr)++) : \
      (Py_UCS4)*(ptr)++)

I haven't tested it, but you get the idea.

BTW: You only focus on UCS2 builds. Please also make sure that these changes work on UCS4 builds, e.g. Py_UCS2_READ_CODE_POINT() will also work on UCS4 builds and join code points there.

Note that UCS4 builds currently don't join surrogates, so a high and low surrogates appear as two code points, which they are, but given the experience with UCS2 builds, may not be what the user expects. So for the purpose of consistency we should be careful with auto-joining surrogates in UCS2.

It does make sence for ord() and the various string methods, but should be done with care in other cases.

In any case, we should clearly document where these macros are used and warn about the implications of using them in the wrong places.

abalkin commented 13 years ago

On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg \report@bugs.python.org\ wrote: .. [I'll respond to skipped when I update the patch]

In any case, we should clearly document where these macros are used and warn about the implications of using them in the wrong places.

It may be best to start with _Py_UCS2_READ_CODE_POINT() (BTW, I like the name because it naturally lead to Py_UCS2_WRITE_CODE_POINT() counterpart.) The leading underscore will probably not stop early adopters from using it and we may get some user feedback if they ask to make these macros public.

abalkin commented 13 years ago

On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg \report@bugs.python.org\ wrote: ..

* same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT()

* in order to make the macro easier to understand, please rename it to Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less than without the macro :-)

I am not sure PyUCS4 prefix is right here. (I agree on *SURROGATE* methods.) The point of Py_UNICODE_NEXT(ptr, end) is that the pointers ptr and end are Py_UNICODE and the macro expands to \p++ on wide builds. Maybe Py_UNICODE_NEXT_USC4?

malemburg commented 13 years ago

Alexander Belopolsky wrote:

Alexander Belopolsky \belopolsky@users.sourceforge.net\ added the comment:

On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg \report@bugs.python.org\ wrote: .. > same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT() > > in order to make the macro easier to understand, please rename it to > Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still > a lot less than without the macro :-)

I am not sure PyUCS4 prefix is right here. (I agree on *SURROGATE* methods.) The point of Py_UNICODE_NEXT(ptr, end) is that the pointers ptr and end are Py_UNICODE and the macro expands to \p++ on wide builds. Maybe Py_UNICODE_NEXT_USC4?

The idea is that the first part refers to what the macro returns (Py_UCS4) and the "read" part of the name refers to moving a pointer across an array (any array of integers).

Note that the macro can also work on Py_UCS4 arrays (even in UCS2 builds), so it's universal in that respect.

Perhaps we should allow ord() to work on surrogates in UCS4 builds as well. That would reduce the number of surprises.

ericvsmith commented 13 years ago

The idea is that the first part refers to what the macro returns (Py_UCS4) and the "read" part of the name refers to moving a pointer across an array (any array of integers).

I thought the first part generally meant the type of the first parameter. Although I can go either way, especially if we add an underscore.

ezio-melotti commented 13 years ago

the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since it returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES()

same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT()

I'm not so familiar with the prefix conventions, but wouldn't that lead users to think that this macro is for wide builds and that they have to use PyUCS2* macros for narrow builds? If these macros are supposed to abstract the build type maybe they should have a "neutral" prefix. (But if the conventions we use say otherwise I guess the best we can do is to document it properly).

in order to make the macro easier to understand, please rename it to Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less than without the macro :-)

The term code point is not entirely correct here. High and low surrogates are code points too. The right term should be 'scalar value' (but that might be confusing). The 'READ' bit sounds fine though, maybe 'READ_NEXT'?

abalkin commented 13 years ago

On Sat, Nov 27, 2010 at 5:41 PM, Ezio Melotti \report@bugs.python.org\ wrote:

Ezio Melotti \ezio.melotti@gmail.com\ added the comment:

> the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since it returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES() > same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT()

I'm not so familiar with the prefix conventions, but wouldn't that lead users to think that this macro is for wide builds and that they have to use PyUCS2* macros for narrow builds? If these macros are supposed to abstract the build type maybe they should have a "neutral" prefix. (But if the conventions we use say otherwise I guess the best we can do is to document it properly).

When I was using the name, I did not think about argument type. PyUNICODE is just the namespace prefix used by all macros in unicodeobject.h. Case in point: Py_UNICODE_ISALPHA() and family that take Py_UCS4. (I know, there is a historical reason at work here, but why fight it?)

Functions use PyUnicode prefix and build specific functions use PyUnicodeUCSx prefix. As far as I can tell, there are no macros with PyUCS4 prefix. The choices I like in the order of preference are

Py_UNICODE_NEXT
Py_UNICODE_NEXT_UCS4
Py_UNICODE_READ_NEXT_UCS4

I can live with anything else, though.

rhettinger commented 13 years ago

I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the iterator protocol is being used.

pitrou commented 13 years ago

I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the iterator protocol is being used.

You can't use the iterator protocol on a non-PyObject, and PyUNICODE (as opposed to PyUnicode_) suggests the macro operates on a raw array of code points.

ezio-melotti commented 13 years ago

AFAIU the macro returns lone surrogates as they are, this means that: 1) if the string contains only surrogate pairs, Py_UNICODE_NEXT will iterate on scalar values0; 2) if the string contains only lone surrogates, it will iterate on codepoints1; 3) if it contains both it will be half and half (i.e. scalar values if the surrogates are in pair, or falling back on codepoints if they aren't); (for strings without surrogates, iterating on scalar values or codepoints is the same).

Is this semantic correct for all (or at least most of) the places where the macro will be used? Would a stricter version (that rejects lone surrogates and iterates on scalar values only) be useful in addition or in alternative to Py_UNICODE_NEXT?

abalkin commented 13 years ago

I am attaching a patch that defines Py_UNICODE_PUT_NEXT() macro (tentative name) and uses it to fix str.upper method. The implementation of surrogate-aware str.upper shows that NEXT/PUT_NEXT abstractions may lead to somewhat inefficient code for "by codepoint" processing. The issue is that once in in the process of reading the codepoint, it is determined whether the code point is BMP or non-BMP. Testing the result again in order to write it is somewhat wasteful. I don't think this would matter in practice, but would like to hear alternative opinions before moving further. (Please, don't argue over names - let's figure out the proper semantics first.)

abalkin commented 13 years ago

On Sat, Nov 27, 2010 at 6:38 PM, Raymond Hettinger \report@bugs.python.org\ wrote: ..

I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the iterator protocol is being used.

As a data point, ICU defines U16_NEXT() for similar purpose. I also like ICU terminology for surrogates ("lead" and "trail") better than the backward "high" and "low". The U16_APPEND() suggests Py_UNICODE_APPEND instead of PUT_NEXT (this one has a virtue of not having "next" in the name as well.) I still like NEXT better than ADVANCE because it is shorter and has an obvious PREV counterpart that we may want to add later.

Note that ICU uses U16_ prefix for these macros even when they operate on 32-bit characters.

More at

http://icu-project.org/apiref/icu4c/utf16_8h.html http://userguide.icu-project.org/strings

malemburg commented 13 years ago

Alexander Belopolsky wrote:

Alexander Belopolsky \belopolsky@users.sourceforge.net\ added the comment:

On Sat, Nov 27, 2010 at 6:38 PM, Raymond Hettinger \report@bugs.python.org\ wrote: .. > I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the iterator protocol is being used. >

As a data point, ICU defines U16_NEXT() for similar purpose. I also like ICU terminology for surrogates ("lead" and "trail") better than the backward "high" and "low".

"High" and "low" are Unicode standard terms, so we should use those.

Regarding Py_UCS4_READ_CODE_POINT: you're right that surrogates are code points, so how about Py_UCS4_READ_NEXT() ?!

Regarding Py_UCS4_READ_NEXT() vs. Py_UNICODE_READ_NEXT(): the return value of the macro is a Py_UCS4 value, not a Py_UNICODE value. The first argument of the macro can be any array, not just Py_UNICODE, but also Py_UCS4 or even int*.

Py_UCS2_READ_NEXT() would be plain wrong :-) Also note that Python does have a Py_UCS4 type; it doesn't have a Py_UCS2 type.

That's why we should use *Py_UCS4*_READ_NEXT().

abalkin commented 13 years ago

Daniel,

While these macros should not affect ABI, I would appreciate your feedback in light of your work on bpo-8654.

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 13 years ago

+1 on the general idea of abstracting out repeated code.

I will take a closer look at the details within the next few days.

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 13 years ago

In bltinmodule.c, it looks like some of the indentation doesn't line up?

Bikeshedding aside, it looks good to me.

I agree with Eric Smith that the first part macro name usually refers to the type of the first argument (or the type the first argument points to). Examples:

Py_UNICODE_ISSPACE : Py_UNICODE -> int
Py_UNICODE_TOLOWER : Py_UNICODE -> Py_UNICODE
Py_UNICODE_strlen: Py_UNICODE * -> size_t

This is true elsewhere in the code as well:

PyList_GET_SIZE : PyListObject * -> Py_ssize_t

Yes, I know there are some unfortunate exceptions. ;-)

I agree that it would be nice if something in the name hinted that the return type was Py_UCS4 though.

Marc-Andre Lemburg wrote:

The first argument of the macro can be any array, not just Py_UNICODE, but also Py_UCS4 or even int*.

It's true that macros in C do not have any type safety. While technically passing a Py_UCS4 * will work, on a UCS2 build it would needlessly check the Py_UCS4 data for surrogates. I think we should discourage that.

You can also technically pass a PyListObject * to PyTuple_GET_SIZE, but that's also not a good idea. ;-)

Alexander Belopolsky wrote:

The issue is that once in in the process of reading the codepoint, it is determined whether the code point is BMP or non-BMP. Testing the result again in order to write it is somewhat wasteful. I don't think this would matter in practice, but would like to hear alternative opinions before moving further.

If the common pattern is:

         ch = Py_UNICODE_NEXT(rp, end);
         uc = Py_UNICODE_SOME_TRANSFORMATION(ch);
         Py_UNICODE_PUT_NEXT(wp, uc);

The second check for surrogates in Py_UNICODE_PUT_NEXT is necessary, unless you can prove that Py_UNICODE_SOME_TRANSFORMATION will never transform characters \< 0x10000 into characters > 0x10000 or vice versa.

Can we prove will always be the case, for current and future versions of Unicode, for all or almost-all of the transformations we care about?

Answering that question and figuring out what to do about it are probably more trouble than it's worth. If a particularly point proves to be a bottleneck, we can always specialize the code there later.

abalkin commented 13 years ago

On Fri, Dec 10, 2010 at 6:09 PM, Daniel Stutzbach \report@bugs.python.org\ wrote: ..

The second check for surrogates in Py_UNICODE_PUT_NEXT is necessary, unless you can prove that Py_UNICODE_SOME_TRANSFORMATION will never transform characters \< 0x10000 into characters > 0x10000 or vice versa.

Can we prove will always be the case, for current and future versions of Unicode, for all or almost-all of the transformations we care about?

Certainly not for all, but for some important transformations, I believe Unicode Standard does promise that the transformation maps BMP to BMP and supplements to supplements. For example case folding and normalization are two important examples.

Answering that question and figuring out what to do about it are probably more trouble than it's worth. If a particularly point proves to be a bottleneck, we can always specialize the code there later.

Agree. It is even more likely that the applications that have to deal with lots of supplementary characters will be better off using a wide unicode build rather than such specialization.

abalkin commented 13 years ago

On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg \report@bugs.python.org\ wrote: ..

* this version should be slightly faster and is also easier to read:

define Py_UCS4_READ_CODE_POINT(ptr, end) \

.. Py_UNICODE_JOIN_SURROGATES((ptr)++, (ptr)++) : \ .. I haven't tested it, but you get the idea.

I don't think C guarantees the order of evaluation of the operands in bitwise expressions such as the expansion of the JOIN_SURROGATES macro.

abalkin commented 13 years ago

I am attaching a patch for commit review. I added an underscore prefix to all new macros. This way I am not introducing new features and we will have a full release cycle to come up with better names. i would just note that "next" terminology is consistent with PyDict_Next and _PySet_NextEntry. The latter suggests that Py_UNICODE_NEXT_UCS4 may be a better choice.

malemburg commented 13 years ago

Alexander Belopolsky wrote:

Alexander Belopolsky \belopolsky@users.sourceforge.net\ added the comment:

I am attaching a patch for commit review. I added an underscore prefix to all new macros. This way I am not introducing new features and we will have a full release cycle to come up with better names. i would just note that "next" terminology is consistent with PyDict_Next and _PySet_NextEntry. The latter suggests that Py_UNICODE_NEXT_UCS4 may be a better choice.

I don't think this should go into 3.2. The macros have the potential of subtly changing Python semantics when used in places that previously did not support auto-joining surrogates. Let's wait for 3.3 with the change.

Some comments:

The macros still need some more attention to enhance their performance.
For consistency, I'd choose names Py_UNICODE_READ_NEXT() and Py_UNICODE_WRITE_NEXT() instead of Py_UNICODE_NEXT() and Py_UNICODE_PUT_NEXT().
Py_UNICODE_JOIN_SURROGATES() either needs to go away completely (and be integrated straight into the other macros), or be renamed to Py_UCS4_JOIN_SURROGATES(), since it doesn't return Py_UNICODE values
The macros need to be carefully documented, both in unicodeobject.h and the general docs.
Your _Py_UNICODE_PUT_NEXT() implementation is missing a few casts to turn ch into a Py_UNICODE/Py_UCS4 value.
Same for your _Py_UNICODE_NEXT() to make sure that the return value is indeed a Py_UNICODE value.
In general, we should probably be clear on the allowed input and define the output types in the documentation.

birkenfeld commented 13 years ago

Let's wait for 3.3 with the change.

Definitely.

abalkin commented 13 years ago

On Wed, Dec 29, 2010 at 10:00 AM, Georg Brandl \report@bugs.python.org\ wrote: ..

> Let's wait for 3.3 with the change.

Definitely.

Does this also mean that the numerous surrogates related bugs should wait until 3.3 as well? (See issues bpo-9200 and bpo-10521.)

This patch was just a stepping stone for the bug fixes. I deliberately kept the code changes to the minimum sufficient to demonstrate and test the new macros. I would not mind restricting the patch further by limiting it to the header file changes so that the macros can be used to fix bugs. Fixing the bugs in the old verbose style does not seem feasible.

Note that surrogate bugs are not as exotic as they seem. For example, on a wide build I can do

42

but on a narrow build,

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    𝐀 = 42
       ^
SyntaxError: invalid character in identifier

So at the moment, narrow and wide builds implement two different languages.

birkenfeld commented 13 years ago

That bug already strikes me as quite exotic.

You need to at least address Marc-Andre's remarks, and to give an overview of what else you'd like to change as well, and how this could affect semantics.

Remember that the next release is already a release candidate.

abalkin commented 13 years ago

On Wed, Dec 29, 2010 at 7:19 AM, Marc-Andre Lemburg \report@bugs.python.org\ wrote: ..

The macros still need some more attention to enhance their performance.

Although I made your suggested change from '-' to '&', I seriously doubt that this would make any difference on modern CPUs. Why do you think these macros are performance critical? Users with lots of supplementary characters in their files are probably better off with a wide build where Py_UNICODE_NEXT() is just *ptr++ and can hardly be further optimized. Higher performance algorithms are possible, but those should probably do some loop unrolling and/or process more than one character at a time. At this point, however it is too soon to worry about optimization before we even know where these macros will be used.

For consistency, I'd choose names Py_UNICODE_READ_NEXT() and Py_UNICODE_WRITE_NEXT() instead of Py_UNICODE_NEXT() and Py_UNICODE_PUT_NEXT().

I would leave it for you and Raymond to reach a consensus. My understanding is that Raymond does not want "next" in the name, so your suggestion still conflicts with that. I would mildly prefer GET/PUT over READ/WRITE because the latter suggests multiple characters.

As discussed before, the macro prefix does not imply the return value. Compare this to Py_UNICODE_ISSPACE() and friends or pretty much any other PyUNICODE macro. Note that I added a leading underscore to Py_UNICODE_JOIN_SURROGATES and other new macros, so there is no immediate pressure to get the names perfect.

The macros need to be carefully documented, both in unicodeobject.h and the general docs.

I've added a description above _PyUNICODE*NEXT macros. I would really like to see these macros in private use for a while before they are published for general audience. I'll add a comment describing _Py_UNICODE_JOIN_SURROGATES. The remaining macros seem to be fairly self-explanatory (unlike, say Py_UNICODE_ISDIGIT or Py_UNICODE_ISTITLE which are not documented in unicodeobject.h.)

Explicit downcasts would probably make sense, for example *(ptr)++ = (Py_UNICODE)ch instead of *(ptr)++ = ch, but I don't think we need explicit casts say in Py_UCS4 code = (ch) - 0x10000; where they can mask coding errors.

I also looked for the use of casts elsewhere in unicodeobject.h and the following does not look right:

#define Py_UNICODE_ISSPACE(ch) \
    ((ch) < 128U ? _Py_ascii_whitespace[(ch)] : _PyUnicode_IsWhitespace(ch))

It looks like this won't work right if ch is a signed char.

Same for your _Py_UNICODE_NEXT() to make sure that the return value is indeed a Py_UNICODE value.

The return value of _Py_UNICODE_NEXT() is *not Py_UNICODE, it is Py_UCS4 and as far as I can see, every conditional branch in narrow case has an explicit cast. In the wide case, I don't think we want an explicit cast because ptr should already be Py_UCS4 and if it is not, it may be a coding error that we don't want to mask.

In general, we should probably be clear on the allowed input and define the output types in the documentation.

I agree. I'll add a note that ptr and end should be Py_UNICODE*. I am not sure what we should say about ch argument. If we add casts, the macro will accept anything, but we should probably document it as expecting Py_UCS4.

abalkin commented 13 years ago

On Sat, Nov 27, 2010 at 5:24 PM, Marc-Andre Lemburg \report@bugs.python.org\ wrote: ..

Perhaps we should allow ord() to work on surrogates in UCS4 builds as well. That would reduce the number of surprises.

This is an interesting idea, however, having surrogates in UCS4 builds will sooner or later lead to surprises such as

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed

I though UCS4 (or more properly, UTF-32) did not allow encoding of surrogate code points.

It is somewhat bothersome that a valid string literal such as '\uD800\uDC00' in narrow build is subtly invalid in wide build. It would probably be better if '\uD800\uDC00' was either rejected on a wide build, or interpreted as a single character so that

True

on any build.

abalkin commented 13 years ago

The example in my previous message should have been:

>>> '\U00010000' == '\uD800\uDC00'
True

abalkin commented 13 years ago

On Wed, Dec 29, 2010 at 11:36 AM, Georg Brandl \report@bugs.python.org\ wrote: ..

That bug already strikes me as quite exotic.

Would it look as exotic if presented like this?

File "\<stdin>", line 1 𐌀 = 5 ^ SyntaxError: invalid character in identifier (works on a wide build)

Note that with few exceptions, pretty much anything you can do with supplementary characters will produce different results in wide and narrow builds. This includes all character type methods (isalpha, isdigit, etc.), transformations such as case folding or normalization, text formatting, etc, etc.

When I suggested on python-dev that supplementary character support on narrow builds is not worth violating fundamental invariants such as len(chr(i)) == 1, pretty much everyone said that Python should support full Unicode regardless of build. When it comes to fixing specific differences between builds, I hear that these differences are not important because no one is using supplementary characters.

This example is less exotic than say str.center() or str.swapcase() not because it involves less exotic characters - all non-BMP characters are exotic by definition - but because it involves the core definition of the Python language.

abalkin commented 13 years ago

I should stop using e-mail to reply to bug reports! The mangled example was

>>> 𐌀 = 5
  File "<stdin>", line 1
    𐌀 = 5
       ^
SyntaxError: invalid character in identifier

vstinner commented 13 years ago

Le mercredi 29 décembre 2010 à 19:26 +0000, Alexander Belopolsky a écrit :

Would it look as exotic if presented like this?

File "\<stdin>", line 1 𐌀 = 5 ^ SyntaxError: invalid character in identifier (works on a wide build)

Use non-ASCII identifiers is exotic. Use non-BMP identifiers is crazy :-) Seriously, it can wait 3.3.

abalkin commented 13 years ago

On Wed, Dec 29, 2010 at 3:36 PM, STINNER Victor \report@bugs.python.org\ wrote: ..

Use non-ASCII identifiers is exotic. Use non-BMP identifiers is crazy :-)

Hmm, we clearly disagree on what crosses the boundary of the mental norm. IMHO, it is crazy to require users to care which plane their characters come from or whether their programs will be run on a wide or a narrow build. I see nothing wrong with a desire to use characters from say "Mathematical Alphanumeric Symbols" block if that makes some Python expressions look more like the mathematical formulas that they represent. However, it is not about any particular usage, but about the language definition. I don't remember even a suggestion during PEP-3131 discussion that non-BMP characters should be excluded from identifiers wholesale.

In any case, can someone remind me what was the use case that motivated chr(i) returning a two-character string for i > 0xFFFF? I think we should either stop pretending that narrow builds can handle non-BMP characters and disallow them in Python strings or we should try to fix the bugs associated with them.

Seriously, it can wait 3.3.

What exactly can wait until 3.3? The presented patch introduces no user visible changes. It is only a stepping stone to restoring some sanity in a way supplementary characters are treated by narrow builds. At the moment, it is a mine field: you can easily produce surrogate pairs from string literals and codecs, but when you start using them, you have 50% chance that things will blow up, 40% chance of getting wrong result and maybe 10% chance that it will work.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

> Seriously, it can wait 3.3.

What exactly can wait until 3.3? The presented patch introduces no user visible changes. It is only a stepping stone to restoring some sanity in a way supplementary characters are treated by narrow builds. At the moment, it is a mine field: you can easily produce surrogate pairs from string literals and codecs, but when you start using them, you have 50% chance that things will blow up, 40% chance of getting wrong result and maybe 10% chance that it will work.

I think the proposal is that fixing this minefield can wait until Python 3.3 (or even 3.4, or later).

I plan to propose a complete redesign of the representation of Unicode strings, which may well make this entire set of changes obsolete.

As for language definition: I think the definition is quite clear and unambiguous. It may be that Python 3.2 doesn't fully implement it.

IOW: relax.

abalkin commented 13 years ago

On Wed, Dec 29, 2010 at 8:02 PM, Martin v. Löwis \report@bugs.python.org\ wrote: ..

I plan to propose a complete redesign of the representation of Unicode strings, which may well make this entire set of changes obsolete.

Are you serious? This sounds like a py4k idea. Can you give us a hint on what the new representation will be? Meanwhile, what it your recommendation for application developers? Should they attempt to fix the code that assumes len(chr(i)) == 1? Should text processing applications designed to run on a narrow build simply reject non-BMP text? Should application writers avoid using str.isxyz() methods?

As for language definition: I think the definition is quite clear and unambiguous. It may be that Python 3.2 doesn't fully implement it.

Given that until recently (r87433) the PEP and the reference manual disagreed on the definition, I have to ask what definition you refer to. What Python 3.2 (or rather 3.1) implements, however is important because it has been declared to be *the* definition of the Python language regardless of what PEPs docs have to say.

IOW: relax.

This is the easy part. :-)

abalkin commented 13 years ago

On Wed, Dec 29, 2010 at 9:38 PM, Alexander Belopolsky \report@bugs.python.org\ wrote: ..

Given that until recently (r87433) the PEP and the reference manual disagreed on the definition,

Actually, it looks like PEP-3131 and the Language Reference [1] still disagree. The latter says:

identifier ::= id_start id_continue*

which should probably be

identifier ::= xid_start xid_continue*

instead. [1] http://docs.python.org/py3k/reference/lexical_analysis.html#identifiers

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

Are you serious? This sounds like a py4k idea. Can you give us a hint on what the new representation will be?

I'm thinking about an approach of a variable representation: one, two, or four bytes, depending on the widest character that appears in the string. I think it can be arranged to make this mostly backwards-compatible with existing APIs, so it doesn't need to wait for py4k, IMO. OTOH, I'm not sure I'll make it for 3.3.

Meanwhile, what it your recommendation for application developers? Should they attempt to fix the code that assumes len(chr(i)) == 1? Should text processing applications designed to run on a narrow build simply reject non-BMP text? Should application writers avoid using str.isxyz() methods?

Given that this is vaporware: proceed as if that idea didn't exist.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

Actually, it looks like PEP-3131 and the Language Reference [1] still disagree. The latter says:

identifier ::= id_start id_continue*

which should probably be

identifier ::= xid_start xid_continue*

instead.

Interesting. XID_* is being used in the PEP since r57023, whereas the documentation was added in r57824. In any case, this is now fixed in r87575/r87576.

birkenfeld commented 13 years ago

I think the proposal is that fixing this minefield can wait until Python 3.3 (or even 3.4, or later).

That is what I was thinking. (Alex: You might not know that Martin was the main proponent of non-ASCII identifiers, so this assessment should have some weight.)

I'm thinking about an approach of a variable representation: one, two, or four bytes, depending on the widest character that appears in the string. I think it can be arranged to make this mostly backwards-compatible with existing APIs, so it doesn't need to wait for py4k, IMO. OTOH, I'm not sure I'll make it for 3.3.

That is an interesting idea. I would be interested in helping out when you'll implement it.

ezio-melotti commented 13 years ago

python / cpython

Py_UNICODE_NEXT and other macros for surrogates #54751

define Py_UCS4_READ_CODE_POINT(ptr, end) \