python / cpython

The Python programming language
https://www.python.org
Other
63.77k stars 30.54k forks source link

Add hangul syllables to unicodedata.decomposititon #88091

Open a7112df5-dd52-4390-ba1b-9d418d47bd52 opened 3 years ago

a7112df5-dd52-4390-ba1b-9d418d47bd52 commented 3 years ago
BPO 43925
Nosy @terryjreedy, @ezio-melotti

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'expert-unicode', '3.11'] title = 'Add hangul syllables to unicodedata.decomposititon' updated_at = user = 'https://bugs.python.org/fredericgrosshans' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'none' closed = False closed_date = None closer = None components = ['Unicode'] creation = creator = 'frederic.grosshans' dependencies = [] files = [] hgrepos = [] issue_num = 43925 keywords = [] message_count = 2.0 messages = ['391715', '391830'] nosy_count = 3.0 nosy_names = ['terry.reedy', 'ezio.melotti', 'frederic.grosshans'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue43925' versions = ['Python 3.11'] ```

a7112df5-dd52-4390-ba1b-9d418d47bd52 commented 3 years ago

Currently (python 3.8.6, unidata_version 12.1.0) unicodedata.decomposition outputs an empty string for hangul syllable (codepoints in the AC00..D7A3 range) while the decomposition is not empty: it is always two characters (either a LV syllable and a T Jamo or a L jamo and a V jamo). This decomposition is dedicible algorithmically (se §3.12 of Unicode Standard). A python version of the algorithm is below (I don’t know C, so I can’t propose a patch).

For each hangul syllable hs, I have used unicodedata.noramize to check that the NFC of the decomposition is indeed hs, that the decomposition is two codepoints long, that the NFD of both hs and the decompotsition coincide

def hangulsyllabledecomposition(c):
    if not 0xAC00 <= ord(c) <= 0xD7A3 : raise ValueError('only Hangul syllables allowed')
    dLV, T = divmod(ord(c) - 0xAC00, 28)
    if T!=0 : #it is a LVT syllable, decomposed into LV:=dLV*19 and T 
        return f'{0xAC00+dLV*28:04X} {0x11A7+T:04X}'
    else : #it is a LVT syllable, decomposed into L , V
        L, V = divmod(dLV,21)
        return f'{0x1100+L:04X} {0x1161+V:04X}'
    # Constants used:
    # 

==============

0xAC00 : first syllable == 1st LV syllable

#                            NB: there is one LV syllable every 28 codepoints
# 0xD7A3 : last Hangul syllable
# 0x1100 : first L jamo
# 0x1161 : first V jamo
# 0x11A7 : one before the 1st T jamo (0x1148), since T=0 means no trailing
#
# (all number below restricted for modern jamos where this algorithm is relevant)
# 19 : Number of L jamos (not used here)
# 21 : Number of V jamos
# 28 : Number of T jamos plus one (since no T jamo for LV syllable)
terryjreedy commented 3 years ago

I verified the claim in 3.19.0a7 freshly compiled today.

>>> import unicodedata as ud
>>> ud.decomposition('\uac00')
''
>>> for cp in range(0xac00, 0xd7a4):
    if (s := ud.decomposition(chr(cp))) != '':
        print(cp, s)

>>