python / cpython

The Python programming language
https://www.python.org
Other
62.32k stars 29.93k forks source link

Add globbing to unicodedata.lookup #79730

Open f5d35338-ebda-4fb2-87aa-ceee75932c6e opened 5 years ago

f5d35338-ebda-4fb2-87aa-ceee75932c6e commented 5 years ago
BPO 35549
Nosy @vstinner, @ezio-melotti, @stevendaprano

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', '3.8', 'expert-unicode'] title = 'Add globbing to unicodedata.lookup' updated_at = user = 'https://bugs.python.org/rominf' ``` bugs.python.org fields: ```python activity = actor = 'rominf' assignee = 'none' closed = False closed_date = None closer = None components = ['Unicode'] creation = creator = 'rominf' dependencies = [] files = [] hgrepos = [] issue_num = 35549 keywords = [] message_count = 4.0 messages = ['332283', '332317', '332318', '332325'] nosy_count = 4.0 nosy_names = ['vstinner', 'ezio.melotti', 'steven.daprano', 'rominf'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue35549' versions = ['Python 3.8'] ```

f5d35338-ebda-4fb2-87aa-ceee75932c6e commented 5 years ago

I propose to add partial_match: bool = False argument to unicodedata.lookup so that the programmer could search Unicode symbols using partial_names.

stevendaprano commented 5 years ago

I love the idea, but dislike the proposed interface.

As a general rule of thumb, Guido dislikes "constant bool parameters", where you pass a literal True or False to a parameter to a function to change its behaviour. Obviously this is not a hard rule, there are functions in the stdlib that do this, but like Guido I think we should avoid them in general.

Instead, I think we should allow the name to include globbing symbols * ? etc. (I think full blown re syntax is overkill.) I have an implementation which I use:

lookup(name) -> single character # the current behaviour

lookup(name_with_glob_symbols) -> list of characters

For example lookup('latin * Z') returns:

['LATIN CAPITAL LETTER Z', 'LATIN SMALL LETTER Z', 'LATIN CAPITAL LETTER D WITH SMALL LETTER Z', 'LATIN LETTER SMALL CAPITAL Z', 'LATIN CAPITAL LETTER VISIGOTHIC Z', 'LATIN SMALL LETTER VISIGOTHIC Z']

A straight substring match takes at worst twelve extra characters:

lookup('*' + name + '*')

and only two if the name is a literal:

lookup('*spam*')

This is less than partial_match=True (18 characters) and more flexible and powerful. There's no ambiguity between the two styles of call because the globbing symbols * ? and [] are never legal in Unicode names. See section 4.8 of

http://www.unicode.org/versions/Unicode11.0.0/ch04.pdf

stevendaprano commented 5 years ago

Here's my implementation:

from unicodedata import name
from unicodedata import lookup as _lookup
from fnmatch import translate
from re import compile, I

_NAMES = None

def getnames():
    global _NAMES
    if _NAMES is None:
        _NAMES = []
        for i in range(0x110000):
            s = name(chr(i), '')
            if s:
                _NAMES.append(s)
    return _NAMES

def lookup(name_or_glob):
    if any(c in name_or_glob for c in '*?['):
        match = compile(translate(name_or_glob), flags=I).match
        return [name for name in getnames() if match(name)]
    else:
        return _lookup(name_or_glob)

The major limitation of my implementation is that it doesn't match name aliases or sequences.

http://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt http://www.unicode.org/Public/11.0.0/ucd/NamedSequences.txt

For example:

lookup('TAMIL SYLLABLE TAA?')  # NamedSequence

ought to return ['தா'] but doesn't.

Parts of the Unicode documentation uses the convention that canonical names are in UPPERCASE, aliases are lowercase, and sequences are in Mixed Case. and I think that we should follow that convention:

http://www.unicode.org/charts/aboutcharindex.html

That makes it easy to see what is the canonical name and what isn't.

f5d35338-ebda-4fb2-87aa-ceee75932c6e commented 5 years ago

I like your proposal with globbing, steven.daprano.

I updated the title.