python / cpython

The Python programming language
https://www.python.org
Other
63.69k stars 30.51k forks source link

\N{...} neglects formal aliases and named sequences from Unicode charnames namespace #56962

Closed 5c59cbd7-8186-4351-8391-b403f3a3a73f closed 13 years ago

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago
BPO 12753
Nosy @malemburg, @gvanrossum, @loewis, @terryjreedy, @abalkin, @ezio-melotti, @florentx
Superseder
  • bpo-4610: Unicode case mappings are incorrect
  • Files
  • nametests.py: test case to check unicodedata.lookup and \N{} against named chars AND formal alias AND named sequences
  • issue12753.diff: patch to add the aliases
  • issue12753-2.diff: patch to add the aliases and named sequences
  • issue12753-3.diff: patch to add the aliases and named sequences + tests + doc
  • issue12753-4.diff: patch to add the aliases and named sequences + tests + doc
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/ezio-melotti' closed_at = created_at = labels = ['interpreter-core', 'type-feature', 'expert-unicode'] title = '\\N{...} neglects formal aliases and named sequences from Unicode charnames namespace' updated_at = user = 'https://bugs.python.org/tchrist' ``` bugs.python.org fields: ```python activity = actor = 'belopolsky' assignee = 'ezio.melotti' closed = True closed_date = closer = 'ezio.melotti' components = ['Interpreter Core', 'Unicode'] creation = creator = 'tchrist' dependencies = [] files = ['22903', '23273', '23280', '23291', '23374'] hgrepos = [] issue_num = 12753 keywords = ['patch', 'needs review'] message_count = 37.0 messages = ['142136', '142145', '142502', '142506', '142507', '142508', '143043', '144679', '144681', '144703', '144708', '144716', '144738', '144739', '144757', '144758', '144760', '144779', '144783', '144802', '144803', '144825', '144827', '144832', '144836', '144839', '145254', '145263', '145327', '145401', '146034', '146036', '146075', '146114', '146129', '146135', '191737'] nosy_count = 10.0 nosy_names = ['lemburg', 'gvanrossum', 'loewis', 'terry.reedy', 'belopolsky', 'ezio.melotti', 'mrabarnett', 'flox', 'python-dev', 'tchrist'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = '4610' type = 'enhancement' url = 'https://bugs.python.org/issue12753' versions = ['Python 3.3'] ```

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

    Unicode character names share a common namespace with formal aliases and with named sequences, but Python recognizes only the original name. That means not everything in the namespace is accessible from Python. (If this is construed to be an extant bug from than an absent feature, you probably want to change this from a wish to a bug in the ticket.)

    This is a problem because aliases correct errors in the original names, and are the preferred versions. For example, ISO screwed up when they called U+01A2 LATIN CAPITAL LETTER OI. It is actually LATIN CAPITAL LETTER GHA according to the file NameAliases.txt in the Unicode Character Database. However, Python blows up when you try to use this:

    % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL LETTER OI}")'
    Ƣ
    
    % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL LETTER GHA}")'
      File "<string>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-27: unknown Unicode character name
    Exit 1

    This unfortunate, because the formal aliases correct egregious blunders, such as the Standard reading "BRAKCET" instead of "BRACKET":

    $ uninames '^\s+%'
     Ƣ  01A2        LATIN CAPITAL LETTER OI
            % LATIN CAPITAL LETTER GHA
     ƣ  01A3        LATIN SMALL LETTER OI
            % LATIN SMALL LETTER GHA
            * Pan-Turkic Latin alphabets
     ೞ  0CDE        KANNADA LETTER FA
            % KANNADA LETTER LLLA
            * obsolete historic letter
            * name is a mistake for LLLA
     ຝ  0E9D        LAO LETTER FO TAM
            % LAO LETTER FO FON
            = fo fa
            * name is a mistake for fo sung
     ຟ  0E9F        LAO LETTER FO SUNG
            % LAO LETTER FO FAY
            * name is a mistake for fo tam
     ຣ  0EA3        LAO LETTER LO LING
            % LAO LETTER RO
            = ro rot
            * name is a mistake, lo ling is the mnemonic for 0EA5
     ລ  0EA5        LAO LETTER LO LOOT
            % LAO LETTER LO
            = lo ling
            * name is a mistake, lo loot is the mnemonic for 0EA3
     ࿐  0FD0        TIBETAN MARK BSKA- SHOG GI MGO RGYAN
            % TIBETAN MARK BKA- SHOG GI MGO RGYAN
            * used in Bhutan
     ꀕ A015        YI SYLLABLE WU
            % YI SYLLABLE ITERATION MARK
            * name is a misnomer
     ︘ FE18        PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET
            % PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET
            * misspelling of "BRACKET" in character name is a known defect
            # <vertical> 3017
     𝃅  1D0C5       BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
            % BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
            * misspelling of "FTHORA" in character name is a known defect

    There are only

    In Perl, \N{...} grants access to the single, shared, common namespace of Unicode character names, formal aliases, and named sequences without distinction:

    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER OI}")'
    Ƣ
    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER GHA}")'
    Ƣ
    
    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER OI}")'  | uniquote -x
    \x{1A2}
    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER GHA}")' | uniquote -x
    \x{1A2}

    It is my suggestion that Python do the same thing. There are currently only 11 of these.

    The third element in this shared namespace of name, named sequences, are multiple code points masquerading under one name. They come from the NamedSequences.txt file in the Unicode Character Database. An example entry is:

    LATIN CAPITAL LETTER A WITH MACRON AND GRAVE;0100 0300

    There are 418 of these named sequences as of Unicode 6.0.0. This shows that Perl can also access named sequences:

      $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")'
      Ā̀
    
      $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")' | uniquote -x
      \x{100}\x{300}
    
      $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")'            
      ㇷ゚
    
      $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")' | uniquote -x
       \x{31F7}\x{309A}

    Since it is a single namespace, it makes sense that all members of that namespace should be accessible using \N{...} as a sort of equal-opportunity accessor mechanism, and it does not make sense that they not be.

    Just makes sure you take only the approved named sequences from the NamedSequences.txt file. It would be unwise to give users access to the provisional sequences located in a neighboring file I shall not name :) because those are not guaranteed never to be withdrawn the way the others are, and so you would risk introducing an incompatibility.

    If you look at the ICU UCharacter class, you can see that they provide a more

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

    Here’s the right test file for the right ticket.

    terryjreedy commented 13 years ago

    I verified that the test file raises the quoted SyntaxError on 3.2 on Win7. This:

    >>> "\N{LATIN CAPITAL LETTER GHA}"
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-27: unknown Unicode character name

    is most likely a result of this:

    >>> unicodedata.lookup("LATIN CAPITAL LETTER GHA")
    Traceback (most recent call last):
      File "<pyshell#1>", line 1, in <module>
        unicodedata.lookup("LATIN CAPITAL LETTER GHA")
    KeyError: "undefined character name 'LATIN CAPITAL LETTER GHA'"

    Although the lookup comes first in nametests.py, it is never executed because of the later SyntaxError.

    The Reference for string literals says" "\N{name} Character named name in the Unicode database"

    The doc for unicodedata says "This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 6.0.0.

    The module uses the same names and symbols as defined by Unicode Standard Annex #44, “Unicode Character Database”." http://www.unicode.org/reports/tr44/tr44-6.html

    So the question is, what are the 'names' therein defined? All such should be valid inputs to "unicodedata.lookup(name) Look up character by name."

    The annex refers to http://www.unicode.org/Public/6.0.0/ucd/ This contains NamesList.txt, derived from UnicodeData.txt. Unicodedata must be using just the latter. The ucd directory also contains NameAliases.txt, NamedSequences.txt, and the file of provisional named sequences.

    As best I can tell, the annex plus files are a bit ambiguous as to 'Unicode character name'. The following quote seems neutral: "the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names." The following: "Unicode character names constitute a special case. Formally, they are values of the Name property." points toward UnicodeData.txt, which lists the Name property along with others. However, "Unicode character name, as published in the Unicode names list," indirectly points toward including aliases. NamesList.txt says it contains the "Final Unicode 6.0 names list." (but one which "should not be parsed for machine-readable information". It includes all 11 aliases in NameAliases.txt.

    My current opinion is that adding the aliases might be done in current releases. It certainly would serve the any user who does not know to misspell 'FTHORA' as 'FHTORA' for just one of the 17 'FTHORA' chars.

    Adding named sequences is definitely a feature request. The definition of .lookup(name) would be enlarged to "Look up character by name, alias, or named sequence" with reference to the specific files. The meaning of \N{} would also have to be enlarged.

    Minimal test code might be:

    from unicodedata import lookup
    AssertEqual(lookup("LATIN CAPITAL LETTER GHA")), "\u01a2")
    AssertEqual(lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE"),
       "\u0100\u0300")
    plus a test that "\N{LATIN CAPITAL LETTER GHA}" and
    "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" compile without error (I have no idea how to write that).

    "If you look at the ICU UCharacter class, you can see that they provide a more"

    More what ;-) I presume ICU =International Components for Unicode, icu-project.org/ "Offers a portable set of C/C++ and Java libraries for Unicode support, software internationalization (I18N) and globalization (G11N)." [appears to be free, open source, and possibly usable within Python]

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

    "Terry J. Reedy" \report@bugs.python.org\ wrote on Fri, 19 Aug 2011 22:50:58 -0000:

    My current opinion is that adding the aliases might be done in current releases. It certainly would serve the any user who does not know to misspell 'FTHORA' as 'FHTORA' for just one of the 17 'FTHORA' chars.

    Yes, I think the 11 aliases pose no problem. It's amazing the trouble you get into from having a fat-fingered amanuesis typing your laws into indelible stone tablets.

    Adding named sequences is definitely a feature request. The definition of .lookup(name) would be enlarged to "Look up character by name, alias, or named sequence" with reference to the specific files. The meaning of \N{} would also have to be enlarged.

    But these do. The problem is bracketed character classes.
    Yes, if you got named reference into the regex compiler as a raw string, it could in theory rewrite

    [abc\N{seq}] 

    as

    (?:[abc]|\N{seq})

    but that doesn't help if the sequence got replaced as a string escape. At which point you have different behavior in the two lookalike cases.

    If you ask how we do this in Perl, the answer is "poorly". It really only works well in strings, not charclasses, although there is a proposal to do a rewrite during compilation like I've spelled out above. Seems messy for something that might(?) not get much use. But it would be nice for \N{} to work to access the whole namespace without prejudice. I have a feeling this may be a case of trying to keep one's cake and eating it too, as the two goals seem to rule each other out.

    > "If you look at the ICU UCharacter class, you can see that they provide a more"

    More what ;-)

    More expressive set of lookup functions where it is clear which thing you are getting. I believe the ICU regexes only support one-char returns for \N{...}, not multis per the sequences. But I may not be looking at the right docs for ICU; not sure.

    I presume ICU =International Components for Unicode, icu-project.org/ "Offers a portable set of C/C++ and Java libraries for Unicode support, software internationalization (I18N) and globalization (G11N)." [appears to be free, open source, and possibly usable within Python]

    Well, there are some Python bindings for ICU that I was eager to try out, because I wanted to see whether I couild get at full/real Unicode collation that way, but I had trouble getting the Python bindings to compile. Not sure why. The documentation for the Python bindings isn't very um wordy, and it isn't clear how tightly integrated it all is: there's talk about C++ strings that kind of scares me. :)

    Hm, and maybe they are only for Python 2 not Python 3, which I try to do all my Python stuff in because it seems like it has a better Unicode model.

    --tom

    39d85a87-36ea-41b2-b2bb-2be43abb500e commented 13 years ago

    For the "Line_Break" property, one of the possible values is "Inseparable", with 2 permitted aliases, the shorter "IN" (which is reasonable) and "Inseperable" (ouch!).

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

    Matthew Barnett \report@bugs.python.org\ wrote on Fri, 19 Aug 2011 23:36:45 -0000:

    For the "Line_Break" property, one of the possible values is "Inseparable", with 2 permitted aliases, the shorter "IN" (which is reasonable) and "Inseperable" (ouch!).

    Yeahy, I've shaken my head at that one, too.

    It's one thing to make an alias for something you typo'd in the first place, but to have something that's correct which you then make a typo alias for is just encouraging bad/sloppy/wrong behavior.

        Bidi_Class=Paragraph_Separator
        Bidi_Class=Common_Separator
        Bidi_Class=European_Separator
        Bidi_Class=Segment_Separator
        General_Category=Line_Separator
        General_Category=Paragraph_Separator
        General_Category=Separator
        General_Category=Space_Separator
        Line_Break=Inseparable
        Line_Break=Inseperable

    And there's still set, which makes you wonder why they couldn't spell at least *one* of them out:

        Sentence_Break=Sep SB=SE
        Sentence_Break=Sp  SB=Sp

    You really have to look those up to realize they're two different things:

    SB ; SE        ; Sep
    SB ; SP        ; Sp

    And that none of them have something like SB=Space or SB=Separator so you know what you're talking about. Grrr.

    --tom

    gvanrossum commented 13 years ago

    +1 on the feature request.

    ezio-melotti commented 13 years ago

    The attached patch changes Tools/unicode/makeunicodedata.py to create a list of names and codepoints taken from http://www.unicode.org/Public/6.0.0/ucd/NameAliases.txt and adds it to Modules/unicodename_db.h. During the lookup the _getcode function at Modules/unicodedata.c:1055 loops over the 11 aliases and checks if any of those match. The patch also includes tests for both unicodedata.lookup and \N{}.

    I'm not sure this is the best way to implement this, and someone will probably want to review and tweak both the approach and the C code, but it works fine:
    >>> "\N{LATIN CAPITAL LETTER GHA}"
    'Ƣ'
    >>> import unicodedata
    >>> unicodedata.lookup("LATIN CAPITAL LETTER GHA")
    'Ƣ'
    >>> "\N{LATIN CAPITAL LETTER OI}"
    'Ƣ'
    >>> unicodedata.lookup("LATIN CAPITAL LETTER OI")
    'Ƣ'

    The patch doesn't include changes for NamedSequences.txt.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

    I propose to use a better lookup algorithm using binary search, and then integrate the NamedSequences into this as well. The search result could be a record

     struct {
       char *name;
       int len;
       Py_UCS4 chars[3]; /* no sequence is more than 3 chars */
     }

    You would have two tables for these: one for the aliases, and one for the named sequences.

    _getcode would continue to return a single char only, and thus not support named sequences. lookup could well return strings longer than 1, but only in 3.3.

    I'm not sure that \N escapes should support named sequences: people rightfully expect that each escaped element in a string literal constitutes exactly one character.

    ezio-melotti commented 13 years ago

    Leaving named sequences for unicodedata.lookup() only (and not for \N{}) makes sense.

    The list of aliases is so small (11 entries) that I'm not sure using a binary search for it would bring any advantage. Having a single lookup algorithm that looks in both tables doesn't work because the aliases lookup must be in _getcode for \N{...} to work, whereas the lookup of named sequences will happen in unicodedata_lookup (Modules/unicodedata.c:1187). I think we can leave the for loop over aliases in _getcode and implement a separate (and binary) search in unicodedata_lookup for the named sequences. Does that sound fine?

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

    Ezio Melotti \ezio.melotti@gmail.com\ added the comment:

    Leaving named sequences for unicodedata.lookup() only (and not for \N{}) makes sense.

    There are certainly advantages to that strategy: you don't have to deal with [\N{sequence}] issues. If the argument to unicode.lookup() and be any of name, alias, or sequence, that seems ok. \N{} should still do aliases, though, since those don't have the complication that sequences have.

    You may wish unicode.name() to return the alias in preference, however. That's what we do. And of course, there is no issue of sequences there.

    The rest of this perhaps painfully long message is just elaboration and icing on what I've said above.

    --tom

    The list of aliases is so small (11 entries) that I'm not sure using a binary search for it would bring any advantage. Having a single lookup algorithm that looks in both tables doesn't work because the aliases lookup must be in _getcode for \N{...} to work, whereas the lookup of named sequences will happen in unicodedata_lookup (Modules/unicodedata.c:1187). I think we can leave the for loop over aliases in _getcode and implement a separate (and binary) search in unicodedata_lookup for the named sequences. Does that sound fine?

    If you mean, is it ok to add just the aliases and not the named sequences to \N{}, it is certainly better than not doing so at all. Plus that way you do *not* have to figure out what in the world to to do with [^a-c\N{sequence}], since that would have be something like (?!\N{sequence})[^a-c]), which is hardly obvious, especially if \N{sequence} actually starts with [a-c].

    However, because the one namespace comprises all three of names, aliases, and named sequences, it might be best to have a functional (meaning, non-regex) API that allows one to do a fetch on the whole namespace, or on each individual component.

    The ICU library supports this sort of thing. In ICU4J's Java bindings, we find this:

        static int getCharFromExtendedName(String name) 
           [icu] Find a Unicode character by either its name and return its code point value.
        static int  getCharFromName(String name) 
           [icu] Finds a Unicode code point by its most current Unicode name and return its code point value.
        static int  getCharFromName1_0(String name) 
           [icu] Find a Unicode character by its version 1.0 Unicode name and return its code point value.
        static int  getCharFromNameAlias(String name) 
           [icu] Find a Unicode character by its corrected name alias and return its code point value.

    The first one obviously has a bug in its definition, as the English doesn't scan. Looking at the full definition is even worse. Rather than dig out the src jar, I looked at ICU4C, but its own bindings are completely different. There you have only one function, with an enum to say what namespace to access:

    UChar32 u_charFromName  (       UCharNameChoice         nameChoice, 
            const char *    name, 
            UErrorCode *    pErrorCode 
        )

    The UCharNameChoice enum tells what sort of thing you want:

    U_UNICODE_CHAR_NAME,
    U_UNICODE_10_CHAR_NAME,
    U_EXTENDED_CHAR_NAME,
    U_CHAR_NAME_ALIAS,          
    U_CHAR_NAME_CHOICE_COUNT

    Looking at the src for the Java is no more immediately illuminating, but I think that "extended" may refer to a union of the old 1.0 names with the current names.

    Now I'll tell you what Perl does. I do this not to say it is "right", but just to show you one possible strategy. I also am in the middle of writing about this for the Camel, so it is in my head.

    Perl does not provide the old 1.0 names at all. We don't have a Unicode 1.0 legacy to support, which makes this cleaner. However, we do provide for the names of the C0 and C1 Control Codes, because apart from Unicode 1.0, they don't condescend to name the ASCII or Latin1 control codes.

    We also provide for certain well known aliases from the Names file: anything that says "* commonly abbreviated as ...", so things like LRO and ZWJ and such.

    Perl makes no distinction between anything in the namespace when using the \N{} form for string and regex escapes. That means when you use "\N{...}" or /\N{...}/, you don't know which it is, nor can you. (And yes, the bracketed character class issue is annoying and unsolved.)

    However, the "functional" API does make a slight distinction.

    -- charnames::vianame() takes a name or alias (as a string) and returns a single integer code point.

    eg: This therefore converts "LATIN SMALL LETTER A" into 0x61.
        It also converts both 
        BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
        and 
        BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
        into 0x1D0C5.  See below.

    -- charnames::string_vianame() takes a string name, alias, *or* sequence, and gives back a string.

    eg: This therefore converts "LATIN SMALL LETTER A" into "a".
            Since it has a string return instead of an int, it now also
            handles everything from NamedSequences file as well. (See below.)

    -- charnames::viacode() takes an integer can gives back the official alias if there is one, and the official name if there is not.

    eg: This converts 0x61 into "LATIN SMALL LETTER A".
            It also converts 0x1D0C5 into "BYZANTINE MUSICAL SYMBOL FTHORA
            SKLIRON CHROMA VASIS".

    Consider

    BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS

    That was an error, and there is an official alias fixing it:

    BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS

    (That's FHTORA vs FTHORA.)

    You may use either as the name, and if you reverse the code point to name, you get the replacement alias.

    % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS")' 1D0C5

    % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS")' 1D0C5

    % perl -mcharnames -wle 'print charnames::viacode(charnames::vianame("BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS"))' BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS

    So on round-tripping, I gave it the "wrong" one (the original) and it gave me back the "right" one (the replacement).

    Using the \N{} thing, it again doesn't matter:

    % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS}"' 1D0C5

    % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}"' 1D0C5

    The interesting thing is the named sequences. string_vianame() works just fine on those:

    % perl -mcharnames -wle 'print length charnames::string_vianame("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")' 2

    % perl -mcharnames -wle 'printf "U+%v04X\n", charnames::string_vianame("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")' U+0100.0300

    And that works fine with \N{} as well (provided you don't try charclasses):

    % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' Ā̀

    % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' | uniquote -v \N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}

    % perl -mcharnames=:full -wle 'print length "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' 2

    % perl -mcharnames=:full -wle 'printf "U+%v04X\n", "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' U+0100.0300

    It's kinda sad that for \N{} and sequneces you can't just "do the right thing" with strings and say that charclass stuff just isn't supported. But my guess is that this simply won't work because you don't have first class regexes. If you pass both of these to the regex engine, they should behave the same (and would, assuming the regex compiler knows about \N{} escapes):

    "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"
    r'\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}'

    However, that falls part if you do

    "[^\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]"
    r'[^\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]'

    Because the compiler will do the substitution early on the first one but not the second. This seems a problem, eh? So I guess you can't do it at all? Or could you document it? I think there is no good solution here. Perl can and does actually do something quite reasonable in the noncharclass case, but that is because we know that we are compiling a regex in virtually all scenarios.

    % perl -Mcharnames=:full -le 'print qr/\N{LATIN SMALL LETTER A}/'
    (?^u:\N{U+61})
    
    % perl -Mcharnames=:full -le 'print qr/\N{LATIN CAPITAL LETTER A WITH MACRON}/'
    (?^u:\N{U+100})
    
    % perl -Mcharnames=:full -le 'print qr/\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}/'
    (?^u:\N{U+100.300})

    So you can do:

    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ /\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}/'
    1

    And it is just fine. The issue is that there are ways for you to get yoruself into trouble if you do string-string stuff:

    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"'
    1
    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ "^[\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]+\$"'
    1

    That works, but only accidentally, because of course U+0100.0300 contains nothing but either U+0100 or U+0300.

    This is not a solved problem.

    I hope this helps.

    --tom

    ezio-melotti commented 13 years ago

    Attached a new patch that adds support for named sequences (still needs some test and can probably be improved).

    There are certainly advantages to that strategy: you don't have to deal with [\N{sequence}] issues.

    I assume with [] you mean a regex character class, right?

    If the argument to unicode.lookup() and be any of name, alias, or sequence, that seems ok.

    With my latest patch, all 3 are supported.

    \N{} should still do aliases, though, since those don't have the complication that sequences have.

    \N{} will only support names and aliases (maybe this can go in 2.7/3.2 too).

    You may wish unicode.name() to return the alias in preference, however. That's what we do. And of course, there is no issue of sequences there.

    This can be done for 3.3, but I wonder if it might create problems. People might use unicodedata.name() to get a name and use it elsewhere, and the other side might not be aware of aliases.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

    Does that sound fine?

    Yes, that's fine as well.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

    You may wish unicode.name() to return the alias in preference, however.

    -1. .name() is documented (and users familiar with it expect it) as returning the name of the character from the UCD.

    It doesn't really matter much to me if it's non-sensical - it's just a label. Notice that many characters have names like "CJK UNIFIED IDEOGRAPH-4E20", which isn't very descriptive, either. What does matter is that the name returned matches the same name in many other places in the net, which (rightfully) all use the UCD name (they might provide the alias as well if they are aware of aliases, but often don't).

    If you mean, is it ok to add just the aliases and not the named sequences to \N{}, it is certainly better than not doing so at all. Plus that way you do *not* have to figure out what in the world to to do with [^a-c\N{sequence}],

    Python doesn't use regexes in the language parser, but does do \N escapes in the parser. So there is no way this transformation could possibly be made - except when you are talking about escapes in regexes, and not escapes in Unicode strings.

    Perl does not provide the old 1.0 names at all. We don't have a Unicode 1.0 legacy to support, which makes this cleaner. However, we do provide for the names of the C0 and C1 Control Codes, because apart from Unicode 1.0, they don't condescend to name the ASCII or Latin1 control codes.

    If there would be a reasonably official source for these names, and one that guarantees that there is no collision with UCD names, I could accept doing so for Python as well.

    We also provide for certain well known aliases from the Names file: anything that says "* commonly abbreviated as ...", so things like LRO and ZWJ and such.

    -1. Readability counts, writability not so much (I know this is different for Perl :-). If there is too much aliasing, people will wonder what these codes actually mean.

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

    > Perl does not provide the old 1.0 names at all. We don't have a Unicode > 1.0 legacy to support, which makes this cleaner. However, we do provide > for the names of the C0 and C1 Control Codes, because apart from Unicode > 1.0, they don't condescend to name the ASCII or Latin1 control codes. =20

    If there would be a reasonably official source for these names, and one that guarantees that there is no collision with UCD names, I could accept doing so for Python as well.

    The C0 and C1 control code names don't change. There is/was one stability issue where they screwed up, because they ended up having a UAX (required) and a UTS (not required) fighting because of the dumb stuff they did with the Emoji names. They neglected to prefix them with "Emoji ..." or some such, the way things like "GREEK ... LETTER ..." or "MATHEMATICAL ..." or "MUSICAL ..." did. The problem is they stole BELL without calling it EMOJI BELL. This is C0 name for Control-G. Dimwits.

    The problem with official names is that they have things in them that you are not expected in names. Do you really and truly mean to tell me you think it is somehow **good** that people are forced to write

    \N{LINE FEED (LF)}

    Rather than the more obvious pair of

    \N{LINE FEED}
    \N{LF}

    ??

    If so, then I don't understand that. Nobody in their right mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they?

    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED}"'
    U+000A
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LF}"'
    U+000A
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED (LF)}"'
    U+000A
    
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE}"'
    U+0085
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEL}"'
    U+0085
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE (NEL)}"'
    U+0085

    > We also provide for certain well known aliases from the Names file: > anything that says "* commonly abbreviated as ...", so things like LRO > and ZWJ and such.

    -1. Readability counts, writability not so much (I know this is different for Perl :-).

    I actually very strongly resent and rebuff that entire mindset in the most extreme way possible. Well-written Perl code is perfectly readable by people who speak that langauge. If you find Perl code that isn't readable, it is by definition not well-written.

    *PLEASE* don't start.

    Yes, I just got done driving 16 hours and am overtired, but it's something I've been fighting against all of professional career. It's a "leyenda negra".

    If there is too much aliasing, people will wonder what these codes actually mean.

    There are 15 "commonly abbreviated as" aliases in the Names.txt file.

    * commonly abbreviated as NBSP
    * commonly abbreviated as SHY
    * commonly abbreviated as CGJ
    * commonly abbreviated ZWSP
    * commonly abbreviated ZWNJ
    * commonly abbreviated ZWJ
    * commonly abbreviated LRM
    * commonly abbreviated RLM
    * commonly abbreviated LRE
    * commonly abbreviated RLE
    * commonly abbreviated PDF
    * commonly abbreviated LRO
    * commonly abbreviated RLO
    * commonly abbreviated NNBSP
    * commonly abbreviated WJ

    All of the standards documents *talk* about things like LRO and ZWNJ. I guess the standards aren't "readable" then, right? :)

    From the charnames manpage, which shows that we really don't just make these up as we feel like (although we could; see below). They're all from this or that standard:

    ALIASES
       A few aliases have been defined for convenience: instead
       of having to use the official names
    
           LINE FEED (LF)
           FORM FEED (FF)
           CARRIAGE RETURN (CR)
           NEXT LINE (NEL)
    
       (yes, with parentheses), one can use
    
           LINE FEED
           FORM FEED
           CARRIAGE RETURN
           NEXT LINE
           LF
           FF
           CR
           NEL
    
       All the other standard abbreviations for the controls,
       such as "ACK" for "ACKNOWLEDGE" also can be used.
    
       One can also use
    
           BYTE ORDER MARK
           BOM
    
       and these abbreviations
    
           Abbreviation        Full Name
    
           CGJ                 COMBINING GRAPHEME JOINER
           FVS1                MONGOLIAN FREE VARIATION SELECTOR ONE
           FVS2                MONGOLIAN FREE VARIATION SELECTOR TWO
           FVS3                MONGOLIAN FREE VARIATION SELECTOR THREE
           LRE                 LEFT-TO-RIGHT EMBEDDING
           LRM                 LEFT-TO-RIGHT MARK
           LRO                 LEFT-TO-RIGHT OVERRIDE
           MMSP                MEDIUM MATHEMATICAL SPACE
           MVS                 MONGOLIAN VOWEL SEPARATOR
           NBSP                NO-BREAK SPACE
           NNBSP               NARROW NO-BREAK SPACE
           PDF                 POP DIRECTIONAL FORMATTING
           RLE                 RIGHT-TO-LEFT EMBEDDING
           RLM                 RIGHT-TO-LEFT MARK
           RLO                 RIGHT-TO-LEFT OVERRIDE
           SHY                 SOFT HYPHEN
           VS1                 VARIATION SELECTOR-1
           .
           .
           .
           VS256               VARIATION SELECTOR-256
           WJ                  WORD JOINER
           ZWJ                 ZERO WIDTH JOINER
           ZWNJ                ZERO WIDTH NON-JOINER
           ZWSP                ZERO WIDTH SPACE
    
       For backward compatibility one can use the old names for
       certain C0 and C1 controls
    
           old                         new
    
           FILE SEPARATOR              INFORMATION SEPARATOR FOUR
           GROUP SEPARATOR             INFORMATION SEPARATOR THREE
           HORIZONTAL TABULATION       CHARACTER TABULATION
           HORIZONTAL TABULATION SET   CHARACTER TABULATION SET
           HORIZONTAL TABULATION WITH JUSTIFICATION    CHARACTER TABULATION
                                                       WITH JUSTIFICATION
           PARTIAL LINE DOWN           PARTIAL LINE FORWARD
           PARTIAL LINE UP             PARTIAL LINE BACKWARD
           RECORD SEPARATOR            INFORMATION SEPARATOR TWO
           REVERSE INDEX               REVERSE LINE FEED
           UNIT SEPARATOR              INFORMATION SEPARATOR ONE
           VERTICAL TABULATION         LINE TABULATION
           VERTICAL TABULATION SET     LINE TABULATION SET
    
       but the old names in addition to giving the character will
       also give a warning about being deprecated.
    
       And finally, certain published variants are usable,
       including some for controls that have no Unicode names:
    
           name                                   character
    
           END OF PROTECTED AREA                  END OF GUARDED AREA, U+0097
           HIGH OCTET PRESET                      U+0081
           HOP                                    U+0081
           IND                                    U+0084
           INDEX                                  U+0084
           PAD                                    U+0080
           PADDING CHARACTER                      U+0080
           PRIVATE USE 1                          PRIVATE USE ONE, U+0091
           PRIVATE USE 2                          PRIVATE USE TWO, U+0092
           SGC                                    U+0099
           SINGLE GRAPHIC CHARACTER INTRODUCER    U+0099
           SINGLE-SHIFT 2                         SINGLE SHIFT TWO, U+008E
           SINGLE-SHIFT 3                         SINGLE SHIFT THREE, U+008F
           START OF PROTECTED AREA                START OF GUARDED AREA, U+0096
    
    perl v5.14.0                2011-05-07                          2

    Those are the defaults. They are overridable. That's because we feel that people should be able to name their character constants however they feel makes sense for them. If they get tired of typing

    \N{LATIN SMALL LETTER U WITH DIAERESIS}

    let alone

    \N{LATIN CAPITAL LETTER THORN WITH STROKE THROUGH DESCENDER}

    then they can, because there is a mechanism for making aliases:

    use charnames ":full", ":alias" =\> {
    U_uml =\> "LATIN CAPITAL LETTER U WITH DIAERESIS",
    u_uml =\> "LATIN SMALL LETTER U WITH DIAERESIS",
    };

    That way you can do

    s/\N{U_uml}/UE/;
    s/\N{u_uml}/ue/;

    This is probably not as persuasive as the private-use case described below.

    It is important to remember that all charname bindings in Perl are attached to a *lexically-scoped declaration. It is completely constrained to operate only within that lexical scope. That's why the compiler replaces things like

    use charnames ":full", ":alias" =\> {
    U_uml =\> "LATIN CAPITAL LETTER U WITH DIAERESIS",
    u_uml =\> "LATIN SMALL LETTER U WITH DIAERESIS",
    };
    
    my $find_u_uml = qr/\N{u_uml}/i;
    
    print "Seach pattern is: $find_u_uml\n";

    Which dutifully prints out:

    Seach pattern is: (?^ui:\N{U+FC})

    So charname bindings are never "hard to read" because the effect is completely lexically constrained, and can never leak outside of the scope.

    I realize (or at least, believe) that Python has no notion of nested lexical scopes, and like many things, this sort of thing can therefore never work there because of that.

    The most persuasive use-case for user-defined names is for private-use area code points. These will never have an official name. But it is just fine to use them. Don't they deserve a better name, one that makes sense within your own program that uses them? Of course they do.

    For example, Apple has a bunch of private-use glyphs they use all the time. In the 8-bit MacRoman encoding, the byte 0xF0 represents the Apple corporate logo/glyph thingie of an apple with a bite taken out of it. (Microsoft also has a bunch of these.) If you upgrade MacRoman to Unicode, you will find that that 0xF0 maps to code point U+F8FF using the regular converter.

    Now what are you supposed to do in your program when you want a named character there? You certainly do not want to make users put an opaque magic number as a Unicode escape. That is always really lame, because the whole reason we have \N{...} escapes is so we don't have to put mysterious unreadable magic numbers in our code!!

    So all you do is

    use charnames ":alias" => {
        "APPLE LOGO" => 0xF8FF,
    };

    and now you can use \N{APPLE LOGO} anywhere within that lexical scope. The compiler will dutifully resolve it to U+F8FF, since all name lookups happen at compile-time. And it cannot leak out of the scope.

    I assert that this facility makes your program more readable, and its absence makes your program less readable.

    Private use characters are important in Asian texts, but they are also important for other things. For example, Unicode intends to get around to allocating Tengwar up the the SMP. However, lots of stupid old code can't use full Unicode, being constrained to UCS-2 only. So many Tengwar fonts start at a different base, and put it in the private use area instead or the SMP. Here are two constants:

    use constant {
        TB_CONSCRIPT_UNICODE_REGISTRY    => 0x00_E000,  # private use
        TB_UNICODE_CONSORTIIUM           => 0x01_6080,  # where it will really go
    };

    I have an entire Tengwar module that makes heavy use of named private-use characters. All I do is this:

    use constant TENGWAR_BASE => TB_CONSCRIPT_UNICODE_REGISTRY;
    
    use charnames ":alias" => { 
      reverse (
        (TENGWAR_BASE + 0x00) => "TENGWAR LETTER TINCO",
        (TENGWAR_BASE + 0x01) => "TENGWAR LETTER PARMA",
        (TENGWAR_BASE + 0x02) => "TENGWAR LETTER CALMA",
        (TENGWAR_BASE + 0x03) => "TENGWAR LETTER QUESSE",
        (TENGWAR_BASE + 0x04) => "TENGWAR LETTER ANDO",
        ....
      )
    };

    Now you can write \N{TENGWAR LETTER TINCO} etc. See how slick that is? Consider the alternative. Magic numbers. Worse, magic numbers with funny calculations in them. That is just so wrong that it completely justifies letting people name things how they want to, so long as they don't make other people do the same. What people do in the privacy of their own lexical scope is their own business.

    It gets better. Perl lets you define your character properties, too. Therefore I can write things like \p{Is_Tengwar_Decimal} and such. Right now I have these properties:

    In_Tengwar, Is_Tengwar
    In_Tengwar_Alphanumerics
    In_Tengwar_Consonants, In_Tengwar_Vowels, In_Tengwar_Alphabetics
    In_Tengwar_Numerals, Is_Tengwar_Decimal, Is_Tengwar_Duodecimal
    In_Tengwar_Punctuation
    In_Tengwar_Marks 

    So I have code in my Tengwar module that does stuff like this, using my own named characters (which again, are compile-time resolved and work only within this lexical scope):

     chr( $1 + ord("\N{TENGWAR DIGIT ZERO}") )

    Not to mention this using my own properties:

    $TENGWAR_GRAPHEME_RX = qr/(?:(?=\p{In_Tengwar})\P{In_Tengwar_Marks}\p{In_Tengwar_Marks}*)|\p{In_Tengwar_Marks}/x;

    Actually, I'm fibbing. I *never* write regexes all on one line like that: they are abhorrent to me. The pattern really looks like this in the code:

    $TENGWAR_GRAPHEME_RX = qr{
        (?:
            (?= \p{In_Tengwar} ) \P{In_Tengwar_Marks}   # Either one basechar...
            \p{In_Tengwar_Marks} *                      # ... plus 0 or more marks
        ) | 
            \p{In_Tengwar_Marks}                        # or else a naked unpaired mark.
    }x;

    People who write patterns without whitespace for cognitive chunking (plus comments for explanation) are wicked wicked wicked. Frankly I'm surprised Python doesn't require it. :)/2

    Anyway, do you see how much better that is than opaque unreadable magic numbers? Can you just imagine the sheer horror of writing that sort of code without the ability to define your own named characters *and* your own character properties? It's beautiful, simple, clean, and readable. I'll even go so far as to call it intuitive.

    No, I don't expect Python to do this sort of thing. You don't have proper scoping, so you can't ever do it cleanly the way Perl can.

    I just wanted to give a concrete example where flexibility leads to a much more readable program than inflexibility ever can.

    --tom

    "We hates magic numberses.  We hates them forevers!"
        --Sméagol the Hacker
    ezio-melotti commented 13 years ago

    The problem with official names is that they have things in them that you are not expected in names. Do you really and truly mean to tell me you think it is somehow **good** that people are forced to write \N{LINE FEED (LF)} Rather than the more obvious pair of \N{LINE FEED} \N{LF} ??

    Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely because that's a Unicode 1 name, and nowadays these codepoints are simply marked as '\<control>'.

    If so, then I don't understand that. Nobody in their right mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they?

    They probably don't, but they just write \n anyway. I don't think we need to support any of these aliases, especially if they are not defined in the Unicode standard.

    I'm also not sure humans use \N{...}: you don't want to write 'R\N{LATIN SMALL LETTER E WITH ACUTE}sum\N{LATIN SMALL LETTER E WITH ACUTE}' and you would need to look up the exact name somewhere anyway before using it (unless you know them by heart). If 'R\xe9sum\xe9' or 'R\u00e9sum\u00e9' are too obscure and/or magic, you can always print() them and get 'Résumé' (or just write 'Résumé' directly in the source).

    All of the standards documents *talk* about things like LRO and ZWNJ. I guess the standards aren't "readable" then, right? :)

    Right, I had to read down till the table with the meanings before figuring out what they were (and I already forgot it).

    The most persuasive use-case for user-defined names is for private-use area code points. These will never have an official name. But it is just fine to use them. Don't they deserve a better name, one that makes sense within your own program that uses them? Of course they do.

    For example, Apple has a bunch of private-use glyphs they use all the time. In the 8-bit MacRoman encoding, the byte 0xF0 represents the Apple corporate logo/glyph thingie of an apple with a bite taken out of it. (Microsoft also has a bunch of these.) If you upgrade MacRoman to Unicode, you will find that that 0xF0 maps to code point U+F8FF using the regular converter.

    Now what are you supposed to do in your program when you want a named character there? You certainly do not want to make users put an opaque magic number as a Unicode escape. That is always really lame, because the whole reason we have \N{...} escapes is so we don't have to put mysterious unreadable magic numbers in our code!!

    So all you do is use charnames ":alias" => { "APPLE LOGO" => 0xF8FF, };

    and now you can use \N{APPLE LOGO} anywhere within that lexical scope. The compiler will dutifully resolve it to U+F8FF, since all name lookups happen at compile-time. And it cannot leak out of the scope.

    This is actually a good use case for \N{..}.

    One way to solve that problem is doing: apples = { 'APPLE': '\uF8FF', 'GREEN APPLE': '\U0001F34F', 'RED APPLE': '\U0001F34E', } and then: print('I like {GREEN APPLE} and {RED APPLE}, but not {APPLE}.'.format(**apples))

    This requires the format call for each string and it's a workaround, but at least is readable (I hope you don't have too many apples in your strings).

    I guess we could add some way to define a global list of names, and that would probably be enough for most applications. Making it per-module would be more complicated and maybe not too elegant.

    People who write patterns without whitespace for cognitive chunking (plus comments for explanation) are wicked wicked wicked. Frankly I'm surprised Python doesn't require it. :)/2

    I actually find those *less readable. If there's something fancy in the regex, a comment *before it is welcomed, but having to read a regex divided on several lines and remove meaningless whitespace and redundant comments just makes the parsing more difficult for me.

    ezio-melotti commented 13 years ago

    Attached a new patch with more tests and doc.

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

    Ezio Melotti \report@bugs.python.org\ wrote on Sun, 02 Oct 2011 06:46:26 -0000:

    Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely bec= ause that's a Unicode 1 name, and nowadays these codepoints are simply mark= ed as '\<control>'.

    Yes, but there are a lot of them, 65 of them in fact. I do not care to see people being forced to use literal control characters or inscrutable magic numbers. It really bothers me that you have all these defined code points with properties and all that have no name. People do use these. Some of them a lot. I don't mind \n and such -- and in fact, prefer them even -- but I feel I should not have scratch my head over character \033, \0177, and brethren. The C0 and C1 standards are not just inventions, so we use them. Far better than one should write \N{ESCAPE} for \033 or \N{DELETE} for \0177, don't you think?

    > If so, then I don't understand that. Nobody in their right=20 > mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they?

    They probably don't, but they just write \n anyway. I don't think we need = to support any of these aliases, especially if they are not defined in the = Unicode standard.

    If you look at Names.txt, there are significant "aliases" there for the C0/C1 stuff. My bottom line is that I don't like to be forced to use magic numbers. I prefer to name my abstactions. It is more readable and more maintainble that way.

    There are still "holes" of course. Code point 128 has no name even in C1. But something is better than nothing. Plus at least in Perl we *can* give things names if we want, per the APPLE LOGO example for U+F8FF. So nothing needs to remain nameless. Why, you can even name your Kanji if you want, using whatever Romanization you prefer. I think the private-use case example is really motivating, but I have no idea how to do this for Python because there is no lexical scope. I suppose you could attach it to the module, but that still doesn't really work because of how things get evaluated. With a Perl compile-time use, we can change the compiler's ideas about things, like adding function prototypes and even extending the base types:

    % perl -Mbigrat -le 'print 1/2 + 2/3 * 4/5'
    31/30
    
    % perl -Mbignum -le 'print 21->is_odd'
    1
    % perl -Mbignum -le 'print 18->is_odd'
    0
    
    % perl -Mbignum -le 'print substr(2**5000, -3)'
    376
    % perl -Mbignum -le 'print substr(2**5000-1, -3)'
    375
    
    % perl -Mbignum -le 'print length(2**5000)'
    1506
    % perl -Mbignum -le 'print length(10**5000)'
    5001
    
    % perl -Mbignum -le 'print ref 10**5000'
    Math::BigInt
    % perl -Mbigrat -le 'print ref 1/3'
    Math::BigRat

    I recognize that redefining what sort of object the compiler treats some of its constants as is never going to happen in Python, but we actually did manage that with charnames without having to subclass our strings: the hook for \N{...} doesn't require object games like the ones above.

    But it still has to happen at compile time, of course, so I don't know what you could do in Python. Is there any way to change how the compiler behaves even vaguely along these lines?

    The run-time looks of Python's unicodedata.lookup (like Perl's charnames::viacode) and unicodedata.name (like Perl's charnames::viacode on the ord) could be managed with a hook, but the compile-time lookups of \N{...} I don't see any way around. But I don't know anything about Python's internals, so don't even know what is or is not possible.

    I do note that if you could extend \N{...} the way we do with charname aliases for private-use characters, the user could load something that did the C0 and C1 control if they wanted to. I just don't know how to do that early enough that the Python compiler would see it. Your import happens at run-time or at compile-time? This would be some sort of compile-time binding of constants.

    d=20
    >> Python doesn't require it. :)/2

    I actually find those *less readable. If there's something fancy in the r= egex, a comment *before it is welcomed, but having to read a regex divided= on several lines and remove meaningless whitespace and redundant comments = just makes the parsing more difficult for me.

    Really? White space makes things harder to read? I thought Pythonistas believed the opposite of that. Whitespace is very useful for cognitive chunking: you see how things logically group together.

    Inomorewantaregexwithoutwhitespacethananyothercodeortext. :)

    I do grant you that chatty comments may be a separate matter.

    White space in patterns is also good when you have successive patterns across multiple lines that have parts that are the same and parts that are different, as in most of these, which is from a function to render an English headline/book/movie/etc title into its proper casing:

    # put into lowercase if on our stop list, else titlecase
    s/  ( \pL [\pL']* )  /$stoplist{$1} ? lc($1) : ucfirst(lc($1))/xge;
    
    # capitalize a title's last word and its first word
    s/^ ( \pL [\pL']* )  /\u\L$1/x;  
    s/  ( \pL [\pL']* ) $/\u\L$1/x;  
    
    # treat parenthesized portion as a complete title
    s/ \( ( \pL [\pL']* )    /(\u\L$1/x;
    s/    ( \pL [\pL']* ) \) /\u\L$1)/x;
    
    # capitalize first word following colon or semi-colon
    s/ ( [:;] \s+ ) ( \pL [\pL']* ) /$1\u\L$2/x;

    Now, that isn't good code for all *kinds* of reasons, but white space is not one of them. Perhaps what it is best at demonstrating is why Python goes about this the right way and that Perl does not. Oh drat, I'm about to attach this to the wrong bug. But it was the dumb code above that made me think about the following.

    By virtue of having a "titlecase each word's first letter and lowercase the rest" function in Python, you can put the logic in just one place, and therefore if a bug is found, you can fix all code all at one.

    But because Perl has always made it easy to grab "words" (actually, traditional programming language identifiers) and diddle their case, people write this all the time:

    s/(\w+)/\u\L$1/g;

    all the time, and that has all kind of problems. If you prefer the functional approach, that is really

    s/(\w+)/ucfirst(lc($1))/ge;

    but that is still wrong.

    1. Too much code duplication. Yes, it's nice to see \pL[\pL']* stand out on each line, but shouldn't that be in a variable, like

      $word = qr/\pL[\pL']*/;
    2. What is a "word"? That code above is better than \w because it avoids numbers and underscores; however, it still uses letters only, not letters and marks, let alone number letters like Roman numerals.

    3. I see the apostrophe there, which is a good start, but what if it is a RIGHT SINGLE QUOTATION MARK, as in "Henry’s"? And what about hyphens? Those should not trigger capitalization in normal titles.

    4. It turns out that all code that does a titlecase on the first character of a string it has already converted to lowercase has irreversibly lost information. Unicode casing it not reversable. Using \w for convenience, these can do different things:

      s/(\w+)/\u\L$1/g;
      s/(\w)(\w*)/\u$1\L$2/g;

      or in the functional approach,

      s/(\w+)/ucfirst(lc($1))/ge;
      s/(\w)(\w*)/ucfirst($1) . lc($2)/ge;

      Now while it is true that only these code points alone do the wrong thing using the naïve approach under Unicode 6.0:

      % unichars -gas 'ucfirst ne ucfirst lc' İ U+00130 GC=Lu SC=Latin LATIN CAPITAL LETTER I WITH DOT ABOVE ϴ U+003F4 GC=Lu SC=Greek GREEK CAPITAL THETA SYMBOL ẞ U+01E9E GC=Lu SC=Latin LATIN CAPITAL LETTER SHARP S Ω U+02126 GC=Lu SC=Greek OHM SIGN K U+0212A GC=Lu SC=Latin KELVIN SIGN Å U+0212B GC=Lu SC=Latin ANGSTROM SIGN

      But it is still the wrong thing, and we never know what might happen in the future.

    I think Python is being smarter than Perl in simply providing people with a titlecase-each-word('s-first-letterand-lowercase-the-rest)in-the-whole- string function, because this means people won't be tempted to write

    s/(\w+)/ucfirst(lc($1))/ge;

    all the time. However, as I have written elsewhere, I question a lot of its underlying assumptions. It's clear that a "word" must in general include not just Letters but also Marks, or else you get different results in NFD and NFC, and the Unicode Standard is very against that.

    However, the problem is that what a word is cannot be considered independent of language. Words in English can contain apostrophes (whether written as an APOSTROPHE or as RIGHT SINGLE QUOTATION MARK) and hyphens (written as HYPHEN-MINUS, HYPHEN, and rarely even EN DASH).

    Each of these is a single word:

    ’tisn’t
    anti‐intellectual
    earth–moon

    The capitalization there should be

    ’Tisn’t
    Anti‐intellectual
    Earth–Moon

    Notice how you can't do the same with the first apostrophe+t as with the second on "’Tisn’t"". That is all challenging to code correctly (did you notice the EN DASH?), especially when you find something like red‐violet–colored. You problably want that to be Red‐violet–colored, because it is not an equal compound like earth–moon or yin–yang, which in correct orthography take an EN DASH not a HYPHEN, just as occurs when you hyphenate an already hyphenated word like red‐violet against colored, as in a red‐violet–colored flower. English titling rules only capitalize the first word in hyphenated words, which is why it's Anti‐intellectual not Anti-Intellectual.

    And of course, you can't actually create something in true English titlecase without knowing having a stop list of articles and (short) prepositions, and paying attention to whether it is the first or last word in the title, and whether it follows a colon or semicolon. Consider that phrasal verbs are construed to take adverbs not prepositions, and so "Bringing In the Sheaves" would be the correct capitalization of that song, since "to bring in" is a phrasal verb, but "A Ringing in My Ears" would be right for that. It is remarkably complicated.

    With English titlecasing, you have to respect what your publishing house considers a "short" preposition. A common cut-off is that short preps have 4 or fewer characters, but I have seen longer cutoffs. Here is one rather exhaustive list of English prepositions sorted by length:

    2: as at by in of on to up vs

    3: but for off out per pro qua via

    4: amid atop down from into like near next onto over pace past plus sans save than till upon with

    \<cutoff point for O'Reilly Media>

    5: about above after among below circa given minus round since thru times under until worth

    6: across amidst around before behind beside beside beyond during except inside toward unlike versus within

    7: against barring beneath besides between betwixt despite failing outside through thruout towards without

    10: throughout underneath

    The thing is that prepositions become adverbs in phrasal verbs, like "to go out" or "to come in", and all adverbs are capitalized. So a complete solution requires actual parsing of English!!!! Just say no -- or stronger.

    Merely getting something like this right:

    the lord of the rings: the fellowship of the ring  # Unicode lowercase
    THE LORD OF THE RINGS: THE FELLOWSHIP OF THE RING  # Unicode uppercase
    The Lord of the Rings: The Fellowship of the Ring  # English titlecase

    is going to take a bit of work. So is

    the sad tale of king henry ⅷ   and caterina de aragón  # Unicode lowercase
    THE SAD TALE OF KING HENRY Ⅷ   AND CATERINA DE ARAGÓN  # Unicode uppercase
    The Sad Tale of King Henry Ⅷ   and Caterina de Aragón  # English titlecase

    (and that must give the same answer in NFC vs NFD, of course.)

    Plus what to do with something like num2ascii is ill-defined in English, because having digits in the middle of a word is a very new phenomenon. Yes, Y2K gets caps, but that is for another reason. There is no agreement on what one should do with num2ascii or people42see. A function name shouldn't be capitalized at all of course.

    And that is just English. Other languages have completely different rules. For example, per Wikipedia's entry on the colon:

    In Finnish and Swedish, the colon can appear inside words in a
    manner similar to the English apostrophe, between a word (or
    abbreviation, especially an acronym) and its grammatical (mostly
    genitive) suffixes. In Swedish, it also occurs in names, for example
    Antonia Ax:son Johnson (Ax:son for Axelson). In Finnish it is used
    in loanwords and abbreviations; e.g., USA:han for the illative case
    of "USA". For loanwords ending orthographically in a consonant but
    phonetically in a vowel, the apostrophe is used instead: e.g. show'n
    for the genitive case of the English loan "show" or Versailles'n for
    the French place name Versailles.

    Isn't that tricky! I guess that you would have to treat punctuation that has a word character immediately following it (and immediately preceding it) as being part of the word, and that it doesn't signal that a change in case is merited.

    I'm really not sure. It is not obvious what the right thing to do here.

    I do believe that Python's titlecase function can and should be fixed to work correctly with Unicode. There really is no excuse for turning Aragón into AragóN, for example, or not doing the right thing with ⅷ and Ⅷ .

    I fear the only thing you can do with the confusion of Unicode titlecase and English titlecase is to explain that properly rendering English titles and headlines is a much more complicated job which you will not even attempt. (And shoudln't. English titelcase is clear too specialized for a general function.)

    However, I'm still bothered by things with apostrophes though.

    can't 
    isn't 
    woudn't've
    Bill's
    'tisn't

    since I can't countenance the obviously wrong:

    Can'T 
    Isn'T 
    Woudn'T'Ve
    Bill'S
    'Tisn'T

    with the last the hardest to get right. I do have code that correctly handles English words and code that correctly handles English titles, but it is much tricker the titlecase() function.

    And Swedes might be upset seeing Antonia Ax:Son Johnson instead of Antonia Ax:son Johnson.

    Maybe we should just go back to the Pythonic equivalent of

    s/(\w)(\w*)/ucfirst($1) . lc($2)/ge;

    where \w is specifically per tr18's Annex C, and give up on punctuation altogether, with a footnoted caveat or something. I wouldn't complain about that. The rest is just too, too hard. Wouldn't you agree?

    Thank you very much for all your hard work -- and patience with me.

    --tom

    terryjreedy commented 13 years ago

    Really? White space makes things harder to read? I thought Pythonistas believed the opposite of that.

    I was surprised at that too ;-). One person's opinion in a specific context. Don't generaliza.

    English titling rules only capitalize the first word in hyphenated words, which is why it's Anti‐intellectual not Anti-Intellectual.

    Except that I can imagine someone using the latter as a noun to make the work more officious or something. There are no official English titling rules and as you noted, publishers vary. I agree that str.title should do something sensible based on Unicode, with the improvements you mentioned.

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

    > Really? White space makes things harder to read? I thought Pythonistas > believed the opposite of that.

    I was surprised at that too ;-). One person's opinion in a specific context. Don't generalize.

    The example I initially showed probably wasn't the best for that. Mostly I was trying to demonstrate how useful it is to have user-defined properties is all. But I have no asked for that (I have asked for properties, though).

    > English titling rules > only capitalize the first word in hyphenated words, which is why it's > Anti‐intellectual not Anti-Intellectual.

    Except that I can imagine someone using the latter as a noun to make the work more officious or something.

    If Good-Looking looks more officous than Good-looking, I bet GOOD-LOOKING is better still. :)

    There are no official English titling rules and as you noted, publishers vary.

    If there aren't any rules, then how come all book and movie titles always look the same? :) I don't think anyone would argue with these two:

    1. Capitalize the first word, the last word, and the word right after a colon (or semicolon).

    2. Capitalize all intervening words except for articles (a, an, the) and short prepositions.

    Those are the basic rules. The main problem is that "short" isn't well defined--and indeed, there are even places where "preposition" isn't well defined either.

    English has sentence casing (only the first word) and headline casing (most of them). It's problematic that computer people call capitalizing each word titlecasing, since in English, this is never correct.

    http://www.chicagomanualofstyle.org/CMS_FAQ/CapitalizationTitles/CapitalizationTitles23.html
    
     Although Chicago style lowercases prepositions (but see CMOS 8.157
     for exceptions), some style guides uppercase them. Ask your editor
     for a style guide.

    I myself usually fall back to the Chicago Manual of Style or the Oxford Guide to Style. I don't think I do anything neither of them says to do.

    But I completely agree that this should *not* be in the titlecase() function. I think the docs for the function might perhaps say something about how it does not mean correct English headline case when it says titlecase, but that's largely just nitpicking.

    I agree that str.title should do something sensible based on Unicode, with the improvements you mentioned.

    One of the goals of Unicode is that casing not be language dependent. And they almost got there, too. The Turkic I is the most notable exception.

    Did you know there is a problem with all the case stuff in Python? It was clearly put in before they had realized that they needed to have things other the Lu/Lt/Ll have casing properties. That's why there is a difference betwen GC=Ll and the Lowercase property.

        str.islower()
    Return true if all cased characters in the string are lowercase and
    there is at least one cased character, false otherwise. Cased
    characters are those with general category property being one of
    “Lu”, “Ll”, or “Lt” and lowercase characters are those with general
    category property “Ll”.
    
    http://docs.python.org/release/3.2/library/stdtypes.html

    That really isn't right. A cased character is one with the Unicode "Cased" property, and a lowercase character is one wiht the Unicode "Lowercase" property. The General Category is actually immaterial here.

    I've spent all bloody day trying to model Python's islower, isupper, and istitle functions, but I get all kinds of errors, both in the definitions and in the models of the definitions. Under both 2.7 and 3.2, I get all these bugs:

    ᶜ not islower() but has at least one cased character with all cased characters lowercase!
    ᴰ not islower() but has at least one cased character with all cased characters lowercase!
    ⓚ not islower() but has at least one cased character with all cased characters lowercase!
    ͅ not islower() but has at least one cased character with all cased characters lowercase!
    Ⅷ not isupper() but has at least one cased character with all cased characters uppercase!
    Ⅷ not istitle() but should be
    ⅷ not islower() but has at least one cased character with all cased characters lowercase!
    2ⁿᵈ not islower() but has at least one cased character with all cased characters lowercase!
    2ᴺᴰ not islower() but has at least one cased character with all cased characters lowercase!
    Ὰͅ isupper() but fails to have at least one cased character with all cased characters uppercase!
    ThisIsInTitleCaseYouKnow not istitle() but should be
    Mᶜ isupper() but fails to have at least one cased character with all cased characters uppercase!
    ᶜM isupper() but fails to have at least one cased character with all cased characters uppercase!
    ᶜM istitle() but should not be
    MᶜKINLEY isupper() but fails to have at least one cased character with all cased characters uppercase!

    I really don't understand. BTW, I feel that MᶜKinley is titlecase in that lowercase always follows uppercase and uppercase never follows itself. And Python agrees with me. But that same definition should vet ThisIsInTitleCaseYouKnow, but Python disagrees.

    I really don't understand any of these functions. I'm very sad. I think they are wrong, but maybe I am. It is extremely confusing.

    Shall I file a separate bug report?

    --tom

    from __future__ import unicode_literals
    from __future__ import print_function
    
    import regex
    
    VERBOSE = 0 
    
    data = [

    # first test the problem cases just one at a time "\N{MODIFIER LETTER SMALL C}", "\N{SUPERSCRIPT LATIN SMALL LETTER N}", "\N{MODIFIER LETTER CAPITAL D}", "\N{CIRCLED LATIN SMALL LETTER K}", "\N{COMBINING GREEK YPOGEGRAMMENI}", "\N{ROMAN NUMERAL EIGHT}", "\N{SMALL ROMAN NUMERAL EIGHT}", "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}", "\N{LATIN LETTER SMALL CAPITAL R}",

    # test superscripts "2\N{SUPERSCRIPT LATIN SMALL LETTER N}\N{MODIFIER LETTER SMALL D}", "2\N{MODIFIER LETTER CAPITAL N}\N{MODIFIER LETTER CAPITAL D}", "2\N{FEMININE ORDINAL INDICATOR}", # as in "segunda"

    # test romans "ROMAN NUMERAL EIGHT IS \N{ROMAN NUMERAL EIGHT}", "roman numeral eight is \N{SMALL ROMAN NUMERAL EIGHT}",

    # test small caps "\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL A}\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL E}",

    # test cased combining mark (this is in titlecase) "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI}", "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI} \N{GREEK CAPITAL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK CAPITAL LETTER DELTA}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}",

    # test cased symbols "circle \N{CIRCLED LATIN SMALL LETTER K}", "CIRCLE \N{CIRCLED LATIN CAPITAL LETTER K}",

    # test titlecased code point 3-way "\N{LATIN CAPITAL LETTER DZ}", "\N{LATIN CAPITAL LETTER DZ}UR", "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}ur", "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}", "\N{LATIN SMALL LETTER DZ}ur", "\N{LATIN SMALL LETTER DZ}",

    # test titlecase

    "FBI", "F B I", "F.B.I",
    "HP Company", "H.P. Company",
    "ThisIsInTitleCaseYouKnow",
    
    "M\N{MODIFIER LETTER SMALL C}",
    "\N{MODIFIER LETTER SMALL C}M",
    
    "M\N{MODIFIER LETTER SMALL C}Kinley",  # titlecase
    "M\N{MODIFIER LETTER SMALL C}KINLEY",  # uppercase
    "m\N{MODIFIER LETTER SMALL C}kinley",  # lowercase
    
    # Return true if the string is a titlecased string and there
    # is at least one character, for example uppercase characters may
    # only follow uncased characters and lowercase characters only
    # cased ones. Return false otherwise.
    
    \# Return true if all cased characters in the string are lowercase and there is at least one cased character,

    ]

    for s in data:

    # "Return true if all cased characters in the string are lowercase # and there is at least one cased character"

        if s.islower():
            if not (        regex.search(r'\p{cased}', s) 
                    and not regex.search(r'(?=\p{cased})\P{LOWERCASE}', s)):
                print(s+" islower() but fails to have at least one cased character with all cased characters lowercase!")
        else:
            if (        regex.search(r'\p{cased}', s) 
                and not regex.search(r'(?=\p{cased})\P{LOWERCASE}', s)):
                print(s+" not islower() but has at least one cased character with all cased characters lowercase!")

    # "Return true if all cased characters in the string are uppercase # and there is at least one cased character"

        if s.isupper():
            if not (        regex.search(r'\p{cased}', s) 
                    and not regex.search(r'(?=\p{cased})\P{UPPERCASE}', s)):
                print(s+" isupper() but fails to have at least one cased character with all cased characters uppercase!")
        else:
            if (        regex.search(r'\p{cased}', s) 
                and not regex.search(r'(?=\p{cased})\P{UPPERCASE}', s)):
                print(s+" not isupper() but has at least one cased character with all cased characters uppercase!")

    # "Return true if the string is a titlecased string and there is at # least one character, for example uppercase characters may only # follow uncased characters and lowercase characters only cased ones."

        has_it  = s.istitle()
        want_it1 = (  
              # at least one title/uppercase
                    regex.search(r'[\p{Lt}\p{uppercase}]', s) 
                      and not 
              # plus no title/uppercase follows cased character
                   regex.search(r'(?<=\p{cased})[\p{Lt}\p{uppercase}]', s)
                      and not 
              # plus no lowercase follows uncased character
                   regex.search(r'(?<=\P{CASED})\p{lowercase}', s)
                  )
    
        want_it  = regex.search(r'''(?x) 
            ^ 
                (?:
                    \P{CASED} * 
                    [\p{Lt}\p{uppercase}] 
                    (?! [\p{Lt}\p{uppercase}] )
                        \p{lowercase} *
                ) +
                \P{CASED} * 
            $
        ''', s)
    
        if VERBOSE:
            if has_it and want_it:
                print( s + " istitle() and should be (OK)")
            if not has_it and not want_it:
                print( s + " not istitle() and should not be (OK)")
    
        if has_it and not want_it:
            print( s + " istitle() but should not be")
    
        if want_it and not has_it:
            print( s + " not istitle() but should be")
    ezio-melotti commented 13 years ago

    But it still has to happen at compile time, of course, so I don't know what you could do in Python. Is there any way to change how the compiler behaves even vaguely along these lines?

    I think things like "from __future__ import ..." do something similar, but I'm not sure it will work in this case (also because you will have to provide the list of aliases somehow).

    > Really? White space makes things harder to read? I thought Pythonistas > believed the opposite of that. Whitespace is very useful for cognitive > chunking: you see how things logically group together.

    I was surprised at that too ;-). One person's opinion in a specific context. Don't generaliza.

    Also don't generalize my opinion regarding *where* whitespace makes thing less readable: I was just talking about regex. What I was trying to say here is best summarized by a quote from Paul Graham's article "Succinctness is Power": """ If you're used to reading novels and newspaper articles, your first experience of reading a math paper can be dismaying. It could take half an hour to read a single page. And yet, I am pretty sure that the notation is not the problem, even though it may feel like it is. The math paper is hard to read because the ideas are hard. If you expressed the same ideas in prose (as mathematicians had to do before they evolved succinct notations), they wouldn't be any easier to read, because the paper would grow to the size of a book. """ Try replacing s/novels and newspaper articles|prose/Python code/g s/single page/single regex/ s/math paper/regex/g.

    To provide an example, I find:

    # define a function to capitalize s
    def my_capitalize(s):
        """This function capitalizes the argument s and returns it"""
        the_first_letter = s[0]  # 0 means the first char
        the_rest_of_s = s[1:]  # 1: means from the second till the end
        the_first_letter_uppercased = the_first_letter.upper()  # upper makes the string uppercase
        the_rest_of_s_lowercased = the_rest_of_s.lower()  # lower makes the string lowercase
        s_capitalized = the_first_letter_uppercased + the_rest_of_s_lowercased  # + concatenates
        return s_capitalized

    less readable than:

    def my_capitalize(s):
        return s[0].upper() + s[1:].lower()

    You could argue that the first is much more explicit and in a way clearer, but overall I think you agree with me that is less readable. Also this clearly depends on how well you know the notation you are reading: if you don't know it very well, you might still prefer the commented/verbose/extended/redundant version. Another important thing to mention, is that notation of regular expressions is fairly simple (especially if you leave out look-arounds and Unicode-related things that are not used too often), but having a similar succinct notation for a whole programming language (like Perl) might not work as well (I'm not picking on Perl here, as you said you can write readable programs if you don't abuse the notation, and the succinctness offered by the language has some advantages, but with Python we prefer more readable, even if we have to be a little more verbose). Another example of a trade-off between verbosity and succinctness is the new string formatting mini-language.

    That really isn't right. A cased character is one with the Unicode "Cased" property, and a lowercase character is one wiht the Unicode "Lowercase" property. The General Category is actually immaterial here.

    You might want to take a look and possibly add a comment on bpo-12204 about this.

    I've spent all bloody day trying to model Python's islower, isupper, and istitle functions, but I get all kinds of errors, both in the definitions and in the models of the definitions.

    If by "model" you mean "trying to figure out how they work", it's probably easier to look at the implementation (I assume you know enough C to understand what they do). You can find the code for str.istitle() at http://hg.python.org/cpython/file/default/Objects/unicodeobject.c#l10358 and the actual implementation of some macros like Py_UNICODE_ISTITLE at http://hg.python.org/cpython/file/default/Objects/unicodectype.c.

    I really don't understand any of these functions. I'm very sad. I think they are wrong, but maybe I am. It is extremely confusing.

    Shall I file a separate bug report?

    If after reading the code and/or the documentation you still think they are broken and/or that they can be improved, then you can open another issue.

    BTW, instead of writing custom scripts to test things, it might be better to use unittest (see http://docs.python.org/py3k/library/unittest.html#basic-example), or even better write a patch for Lib/test/test_unicode.py. Using unittest has the advantage that is then easy to integrate those tests within our test suite, but on the other hand as soon as something fails the failure is returned without evaluating the following assertions in the method. This as the advantage that

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

    > There are no official English titling rules and as you noted, > publishers vary.

    If there aren't any rules, then how come all book and movie titles always look the same? :)

    Can we please leave the English language out of this issue? Else I will ask that Python uses German text-processing rules, just so that this gets fewer comments :-)

    As a point of order, please all try to stick at the issue at hand. Linguistics discussions or general Unicode discussion have better places than this bug tracker. I just had to stop reading Tom's comments as too verbose (which is more difficult since it's in a foreign language).

    ezio-melotti commented 13 years ago

    The patch is pretty much complete, it just needs a review (I left some comments on the review page). One thing that can be added is some compression for the names of the named sequences. I'm not sure I can reuse the same compression used for the other names easily. Does the size of the db really matters? Are the new names using too much extra space?

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

    The patch needs to take versioning into account. It seems that NamedSequences where added in 4.1, and NameAliases in 5.0. So for the moment, when using 3.2 (i.e. when self is not NULL), it is fine to lookup neither. Please put an assertion into makeunicodedata that this needs to be reviewed when an old version other than 3.2 needs to be supported.

    The size of the DB does matter; there are frequent complaints about it. The named sequences take 20kB on my system; not sure whether that's too much. If you want to reduce the size (and also speedup lookup), you could use private-use characters, like so:

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

    Ezio Melotti \report@bugs.python.org\ wrote on Mon, 03 Oct 2011 04:15:51 -0000:

    > But it still has to happen at compile time, of course, so I don't know > what you could do in Python. Is there any way to change how the compiler > behaves even vaguely along these lines?

    I think things like "from __future__ import ..." do something similar, but I'm not sure it will work in this case (also because you will have to provide the list of aliases somehow).

    Ah yes, that's right. Hm. I bet then it *would* be possible, just perhaps a bit of a run-around to get there. Not a high priority, but interesting.

    less readable than:

    def my_capitalize(s): return s[0].upper() + s[1:].lower()

    You could argue that the first is much more explicit and in a way clearer, but overall I think you agree with me that is less readable.

    Certainly.

    It's a bit like the way bug rate per lines of code is invariant across programming languages. When you have more opcodes, it gets harder to understand because there are more interactions and things to remember.

    > That really isn't right. A cased character is one with the Unicode "Cased" > property, and a lowercase character is one wiht the Unicode "Lowercase" > property. The General Category is actually immaterial here.

    You might want to take a look and possibly add a comment on bpo-12204 about this.

    > I've spent all bloody day trying to model Python's islower, isupper, and istitle > functions, but I get all kinds of errors, both in the definitions and in the > models of the definitions.

    If by "model" you mean "trying to figure out how they work", it's probably easier to look at the implementation (I assume you know enough C to understand what they do). You can find the code for str.istitle() at http://hg.python.org/cpython/file/default/Objects/un- icodeobject.c#l10358 and the actual implementation of some macros like Py_UNICODE_ISTITLE at http://hg.python.org/cpython/file/default/Objects/unicodectype.c.

    Thanks, that helps immensely. I'm completely fluent in C. I've gone and built a tags file of your whole v3.2 source tree to help me navigate.

    The main underlying problem is that the internal macros are defined in a way that made sense a long time ago, but no longer do ever since (for example) the Unicode lowercase property stopped being synonymous with GC=Ll and started also including all code points with the Other_Lowercase property as well.

    The originating culprit is Tools/unicode/makeunicodedata.py. It builds your tables only using UnicodeData.txt, which is not enough. For example:

        if category in ["Lm", "Lt", "Lu", "Ll", "Lo"]:
        flags |= ALPHA_MASK
        if category == "Ll":
        flags |= LOWER_MASK
        if 'Line_Break' in properties or bidirectional == "B":
        flags |= LINEBREAK_MASK
        linebreaks.append(char)
        if category == "Zs" or bidirectional in ("WS", "B", "S"):
        flags |= SPACE_MASK
        spaces.append(char)
        if category == "Lt":
        flags |= TITLE_MASK
        if category == "Lu":
        flags |= UPPER_MASK

    It needs to use DerivedCoreProperties.txt to figure out whether something is Other_Uppercase, Other_Lowercase, etc. In particular:

    Alphabetic := Lu+Ll+Lt+Lm+Lo + Nl + Other_Alphabetic
    Lowercase  := Ll + Other_Lowercase
    Uppercase  := Ll + Other_Uppercase

    This affects a lot of things, but you should be able to just fix it in Tools/unicode/makeunicodedata.py and have all of them start working correctly.

    You will probably also want to add

        Py_UCS4 _PyUnicode_IsWord(Py_UCS4 ch)

    that uses the UTS#18 Annex C definition, so that you catch marks, too. That definition is:

    Word := Alphabetic + Mc+Me+Mn + Nd + Pc

    where Alphabetic is defined above to include Nl and Other_Alphabetic.

    Soemwhat related is stuff like this:

        typedef struct {
        const Py_UCS4 upper;
        const Py_UCS4 lower;
        const Py_UCS4 title;
        const unsigned char decimal;
        const unsigned char digit;
        const unsigned short flags;
        } _PyUnicode_TypeRecord;

    There are two different bugs here. First, you are missing

        const Py_UCS4 fold;

    which is another field from UnicodeData.txt, one that is critical for doing case-insensitive matches correctly.

    Second, there's also the problem that Py_UCS4 is an int. That means you are stuck with just the character-based simple versions of upper-, title-, lower-, and foldcase. You need to have fields for the full mappings, which are now strings (well, int arrays) not single ints. I'll use ??? for the int-array type that I don't know:

    const ??? upper_full;
    const ??? lower_full;
    const ??? title_full;
    const ??? fold_full;

    You will also need to extend the API from just

        Py_UCS4 _PyUnicode_ToUppercase(Py_UCS4 ch)

    to something like

    ??? _PyUnicode_ToUppercase_Full(Py_UCS4 ch)

    I don't know what the ??? return type is there, but it's whatever the upper_full filed in _PyUnicode_TypeRecord would be.

    I know that Matthew Barnett has had to cover a bunch of these for his regex module, including generating his own tables. It might be possible to piggy-back on that effort; certainly it would be desirable to try.

    I really don't understand any of these functions. I'm very sad. I think they are wrong, but maybe I am. It is extremely confusing.

    > Shall I file a separate bug report?

    If after reading the code and/or the documentation you still think they are broken and/or that they can be improved, then you can open another issue.

    I handn't actually *looked* at capitalize yet, because I stumbled over these errors in the way-underlying code that necessarily supports it. The errors in definitions explain a lot of what I was

    Ok, more bugs. Consider this:

        static 
        int fixcapitalize(PyUnicodeObject *self)
        {
        Py_ssize_t len = self->length;
        Py_UNICODE *s = self->str;
        int status = 0;
    
        if (len == 0)
            return 0;
        if (Py_UNICODE_ISLOWER(*s)) {
            *s = Py_UNICODE_TOUPPER(*s);
            status = 1;
        }
        s++;
        while (--len > 0) {
            if (Py_UNICODE_ISUPPER(*s)) {
            *s = Py_UNICODE_TOLOWER(*s);
            status = 1;
            }
            s++;
        }
        return status;
        }

    There are several bugs there. First, you have to use the TITLECASE if there is one, and only use the uppercase if there is no titlecase. Uppercase is wrong.

    Second, you cannot decide to do the case change only if it starts out as a certain case. You have to do it unconditionally, especially since your tests for whether something is upper or lower are wrong. For example, Roman numerals, the iota subscript, the circled letters, and a few other things all are case-changing but are not themselves Letters in the GC=Ll/Lu/Lt sense. Also, there are also cased letters in the GC=Lm category, which you miss. Unicode has properties like Cased that you should be using to determine whether something is cased. It also have properties like Changes_When_Uppercased (aka CWU) that tell you whether something will change. For example, most of the small capitals are cased code points that are considered lowercase and which do not change when uppercase. However, The LATIN SMALL CAPITAL R (which is a lowercase code point) actually does have an uppercase mapping. Strange but true.

    Does this help at all? I have to go to a meeting now.

    --tom

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

    The main underlying problem is that the internal macros are defined in a way that made sense a long time ago, but no longer do ever since (for example) the Unicode lowercase property stopped being synonymous with GC=Ll and started also including all code points with the Other_Lowercase property as well.

    Tom: PLEASE focus on one issue at a time. This is about formal aliases and named sequences, NOT about upper and lower case. If you want to have a discussion about upper and lower case, please open a separate issue. There I would explain why I think your reasoning is flawed (i.e. just because your interpretation of Unicode differs from Python's implementation doesn't already make Python's implementation incorrect - just different).

    ezio-melotti commented 13 years ago

    Here is a new patch that stores the names of aliases and named sequences in the Private Use Area.

    To summarize a bit, this is what we want: | 6.0.0 | 3.2.0 | --------+-------+-------+ \N{...} | A | - | .name | - | - | .lookup | A,NS | - |

    I.e., \N{...} should only support aliases, unicodedata.lookup should support aliases and named sequences, unicodedata.name doesn't support either, and when 3.2.0 is used nothing is supported.

    The function calls involved for these 3 functions are:

    \N{...} and .lookup: _getcode _cmpname _getucname _check_alias

    .name: _getucname

    My patch adds an extra arg to _getcode and _getucname (I hope that's fine -- or are they public?).

    _getcode is called by \N{...} and .lookup; both support aliases, so _getcode now resolves aliases by default. Since only .lookup wants named sequences, _getcode now accepts an extra 'with_named_seq' arg and looks up named sequences only when its value is 1. .lookup passes 1, gets the codepoint, and converts it to a sequence. \N{...} passes 0 and doesn't get named sequences.

    _getucname is called by .name and indirectly (through _cmpname) by .lookup and \N{...}. Since _getcode takes care of deciding who gets aliases and sequences, _getucname now accepts an extra 'with_alias_and_seq' arg and looks up aliases and named sequences only when its value is 1. _cmpname passes 1, gets aliases and named sequences and then lets _getcode decide what to do with them. .name passes 0 and doesn't get aliases and named sequences.

    All this happens on 6.0.0 only, when self != NULL (i.e. we are using 3.2.0) named sequences and aliases are ignored.

    The patch doesn't include the changes to unicodename_db.h -- run makeunicodedata.py to get them. I also added more tests to make sure that the names added in the PUA don't leak, and that ucd_3_2_0 is not affected.

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

    Ezio Melotti \report@bugs.python.org\ wrote on Sun, 09 Oct 2011 13:21:00 -0000:

    Here is a new patch that stores the names of aliases and named sequences in the Private Use Area.

    Looks good! Thanks!

    --tom

    ezio-melotti commented 13 years ago

    (I had to re-upload the patch a couple of time to get the review button to work. Apparently if there are some conflicts rietveld fails to apply the patch, whereas hg is able to merge files without problems here. Sorry for the noise.)

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

    If you don't use git-style diffs, Rietveld will much better accommodate patches that don't apply to tip cleanly. Unfortunately, hg git-style diffs don't indicate the base revision, so Rietveld guesses that the base line is tip, and then fails if it doesn't apply exactly.

    ezio-melotti commented 13 years ago

    If the latest patch is fine I'll commit it shortly.

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

    Yes, it looks good. Thank you very much.

    -tom

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

    LGTM

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 13 years ago

    New changeset a985d733b3a3 by Ezio Melotti in branch 'default': bpo-12753: Add support for Unicode name aliases and named sequences. http://hg.python.org/cpython/rev/a985d733b3a3

    ezio-melotti commented 13 years ago

    I committed the patch and the buildbots seem happy. Thanks for the report and the feedback!

    Tom, about the problems you mentioned in msg144836, can you report it in a new issue or, if there are already issues about them, add a message there?

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 13 years ago

    New changeset 329b96fe4472 by Ezio Melotti in branch 'default': bpo-12753: fix compilation on Windows. http://hg.python.org/cpython/rev/329b96fe4472

    abalkin commented 11 years ago

    about the problems you mentioned in msg144836, can you report it in a new issue or, if there are already issues about them, add a message there?

    I believe that would be bpo-4610.