Open f5d35338-ebda-4fb2-87aa-ceee75932c6e opened 5 years ago
I propose to add partial_match: bool = False argument to unicodedata.lookup so that the programmer could search Unicode symbols using partial_names.
I love the idea, but dislike the proposed interface.
As a general rule of thumb, Guido dislikes "constant bool parameters", where you pass a literal True or False to a parameter to a function to change its behaviour. Obviously this is not a hard rule, there are functions in the stdlib that do this, but like Guido I think we should avoid them in general.
Instead, I think we should allow the name to include globbing symbols * ? etc. (I think full blown re syntax is overkill.) I have an implementation which I use:
lookup(name) -> single character # the current behaviour
lookup(name_with_glob_symbols) -> list of characters
For example lookup('latin * Z') returns:
['LATIN CAPITAL LETTER Z', 'LATIN SMALL LETTER Z', 'LATIN CAPITAL LETTER D WITH SMALL LETTER Z', 'LATIN LETTER SMALL CAPITAL Z', 'LATIN CAPITAL LETTER VISIGOTHIC Z', 'LATIN SMALL LETTER VISIGOTHIC Z']
A straight substring match takes at worst twelve extra characters:
lookup('*' + name + '*')
and only two if the name is a literal:
lookup('*spam*')
This is less than partial_match=True
(18 characters) and more flexible and powerful. There's no ambiguity between the two styles of call because the globbing symbols * ? and [] are never legal in Unicode names. See section 4.8 of
Here's my implementation:
from unicodedata import name
from unicodedata import lookup as _lookup
from fnmatch import translate
from re import compile, I
_NAMES = None
def getnames():
global _NAMES
if _NAMES is None:
_NAMES = []
for i in range(0x110000):
s = name(chr(i), '')
if s:
_NAMES.append(s)
return _NAMES
def lookup(name_or_glob):
if any(c in name_or_glob for c in '*?['):
match = compile(translate(name_or_glob), flags=I).match
return [name for name in getnames() if match(name)]
else:
return _lookup(name_or_glob)
The major limitation of my implementation is that it doesn't match name aliases or sequences.
http://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt http://www.unicode.org/Public/11.0.0/ucd/NamedSequences.txt
For example:
lookup('TAMIL SYLLABLE TAA?') # NamedSequence
ought to return ['தா'] but doesn't.
Parts of the Unicode documentation uses the convention that canonical names are in UPPERCASE, aliases are lowercase, and sequences are in Mixed Case. and I think that we should follow that convention:
http://www.unicode.org/charts/aboutcharindex.html
That makes it easy to see what is the canonical name and what isn't.
I like your proposal with globbing, steven.daprano.
I updated the title.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-feature', '3.8', 'expert-unicode']
title = 'Add globbing to unicodedata.lookup'
updated_at =
user = 'https://bugs.python.org/rominf'
```
bugs.python.org fields:
```python
activity =
actor = 'rominf'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Unicode']
creation =
creator = 'rominf'
dependencies = []
files = []
hgrepos = []
issue_num = 35549
keywords = []
message_count = 4.0
messages = ['332283', '332317', '332318', '332325']
nosy_count = 4.0
nosy_names = ['vstinner', 'ezio.melotti', 'steven.daprano', 'rominf']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue35549'
versions = ['Python 3.8']
```