a finer tokenizer than current FINE, to be able cut people names to first and last name

tx0c commented 1 year ago

Describe the feature and the current behavior/state. Current behavior is always treat Personal name as one word,

https://github.com/hankcs/HanLP/issues/1829#issuecomment-1653800092

2.2.1 Personal name Treat it as one word. Don’t give the internal structure unless there is a space between two names (in foreign alphabet).

but this isn't enough for applications scenarios where need a finer tokenizer, e.g. in a search index'ing application, need search query phrase 孔明 be able to match out 諸葛孔明 and also パリピ孔明 in pre-index'ed vectors,

need some config to be able set, let user of the tokenizer to decide always treat names as one word, or to cut first/last in half, especially for

need a config to set minimum length for this behavior, for short names 張飛 岳飛 might be okay to treat as one word, but should have a way to cut longer names as 諸葛青云 龐青云 刘青云 to two parts, enable end user can search by 青云 to find out all text talking about 青云

to even longer foreign names 米蘭昆德拉 唐納德川普 米歇爾·奥巴馬 there should have a way to cut into halves,

compare with the alternative library nodejieba, which has no problem to cut 米兰·昆德拉 to 2 parts:

> var nodejieba = require('nodejieba');
> nodejieba.cut('为什么余华说米兰·昆德拉是三流小说家？')
[
  '为什么', '余',
  '华',     '说',
  '米兰',   '·',
  '昆德拉', '是',
  '三流',   '小说家',
  '？'
]


In [107]: tok(["米蘭昆德拉", "米蘭·昆德拉", "为什么余华说米兰·昆德拉是三流小说家？", "米蘭昆德拉", "唐納德川普", "米歇爾·奥巴馬", "諸葛亮", "諸葛孔明", "諸葛亮先生", "パリピ孔明"])
Out[107]: 
[['米蘭昆德拉'],
 ['米蘭·昆德拉'],
 ['为什么', '余华', '说', '米兰·昆德拉', '是', '三流', '小说家', '？'],
 ['米蘭昆德拉'],
 ['唐納德川普'],
 ['米歇爾·奥巴馬'],
 ['諸葛亮'],
 ['諸葛孔明'],
 ['諸葛亮', '先生'],
 ['パリピ', '孔明']]

from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None, language='zh') # auth不填则匿名，zh中文，mul多语种
HanLP(["米蘭昆德拉", "米蘭·昆德拉", "为什么余华说米兰·昆德拉是三流小说家？", "米蘭昆德拉", "唐納德川普", "米歇爾·奥巴馬", "諸葛亮", "諸葛孔明", "諸葛亮先生", "周恩来", "米蘭市", "洛杉磯縣", "南京市",]) .pretty_print()

Dep 
─── 
┌─► 
└──     Tok 
─── 
米蘭  
昆德拉     Rela 
──── 
name 
root    Po 
── 
NR 
NR  Tok 
─── 
米蘭  
昆德拉     NER Type     
──────────── 
───►LOCATION 
───►PERSON      Tok 
─── 
米蘭  
昆德拉     Po    3  
─────────
NR──┐    
NR──┴►TOP

    Token  
────── 
米蘭·昆德拉  Rela 
──── 
root    Po 
── 
NR  Token  
────── 
米蘭·昆德拉  NER Type   
────────── 
───►PERSON  Token  
────── 
米蘭·昆德拉  Po    3 
────────
NR───►NP

Dep Tree  
───────── 
     ┌──► 
     │┌─► 
┌┬───┴┴── 
││  ┌───► 
││  │┌──► 
││  ││┌─► 
│└─►└┴┴── 
└───────►   Token  
────── 
为什么    
余华     
说      
米兰·昆德拉 
是      
三流     
小说家    
？       Relati 
────── 
advmod 
nsubj  
root   
nsubj  
cop    
amod   
ccomp  
punct   Po 
── 
AD 
NR 
VV 
NR 
VC 
JJ 
NN 
PU  Token  
────── 
为什么    
余华     
说      
米兰·昆德拉 
是      
三流     
小说家    
？       NER Type   
────────── 

───►PERSON 

───►PERSON 

            Token  
────── 
为什么    
余华     
说      
米兰·昆德拉 
是      
三流     
小说家    
？       SRL PA1      
──────────── 
───►ARGM-ADV 
───►ARG0     
╟──►PRED     
◄─┐          
  │          
  ├►ARG1     
◄─┘          
                Token  
────── 
为什么    
余华     
说      
米兰·昆德拉 
是      
三流     
小说家    
？       SRL PA2  
──────── 

───►ARG0 
╟──►PRED 
◄─┐      
◄─┴►ARG1 
            Token  
────── 
为什么    
余华     
说      
米兰·昆德拉 
是      
三流     
小说家    
？       Po    3       4       5       6       7       8 
────────────────────────────────────────────────
AD───────────────────────────────────►ADVP──┐   
NR───────────────────────────────────►NP────┤   
VV──────────────────────────────────┐       │   
NR───────────────────►NP ───┐       ├►VP────┤   
VC──────────────────┐       ├►IP ───┘       ├►IP
JJ───►ADJP──┐       ├►VP ───┘               │   
NN───►NP ───┴►NP ───┘                       │   
PU──────────────────────────────────────────┘   

Dep 
─── 
┌─► 
└──     Tok 
─── 
米蘭  
昆德拉     Rela 
──── 
name 
root    Po 
── 
NR 
NR  Tok 
─── 
米蘭  
昆德拉     NER Type     
──────────── 
───►LOCATION 
───►PERSON      Tok 
─── 
米蘭  
昆德拉     Po    3  
─────────
NR──┐    
NR──┴►TOP

    Token 
───── 
唐納德川普   Rela 
──── 
root    Po 
── 
NR  Token 
───── 
唐納德川普   NER Type   
────────── 
───►PERSON  Token 
───── 
唐納德川普   Po    3 
────────
NR───►NP

    Token   
─────── 
米歇爾·奥巴馬     Rela 
──── 
root    Po 
── 
NR  Token   
─────── 
米歇爾·奥巴馬     NER Type   
────────── 
───►PERSON  Token   
─────── 
米歇爾·奥巴馬     PoS
───
NR 

    Tok 
─── 
諸葛亮     Rela 
──── 
root    Po 
── 
NR  Tok 
─── 
諸葛亮     NER Type   
────────── 
───►PERSON  Tok 
─── 
諸葛亮     Po    3 
────────
NR───►NP

    Toke 
──── 
諸葛孔明    Rela 
──── 
root    Po 
── 
NR  Toke 
──── 
諸葛孔明    NER Type   
────────── 
───►PERSON  Toke 
──── 
諸葛孔明    Po    3 
────────
NR───►NP

Dep 
─── 
┌─► 
└──     Tok 
─── 
諸葛亮 
先生      Relation    
─────────── 
compound:nn 
root            Po 
── 
NR 
NN  Tok 
─── 
諸葛亮 
先生      NER Type   
────────── 
───►PERSON 
            Tok 
─── 
諸葛亮 
先生      Po    3 
────────
NR──┐   
NN──┴►NP

    Tok 
─── 
周恩来     Rela 
──── 
root    Po 
── 
NR  Tok 
─── 
周恩来     NER Type   
────────── 
───►PERSON  Tok 
─── 
周恩来     Po    3 
────────
NR───►NP

    Tok 
─── 
米蘭市     Rela 
──── 
root    Po 
── 
NR  Tok 
─── 
米蘭市     NER Type     
──────────── 
───►LOCATION    Tok 
─── 
米蘭市     Po    3     4   
────────────────
NR───►NP───►FRAG

    Toke 
──── 
洛杉磯縣    Rela 
──── 
root    Po 
── 
NR  Toke 
──── 
洛杉磯縣    NER Type     
──────────── 
───►LOCATION    Toke 
──── 
洛杉磯縣    Po    3     4   
────────────────
NR───►NP───►FRAG

    Tok 
─── 
南京市     Rela 
──── 
root    Po 
── 
NR  Tok 
─── 
南京市     NER Type     
──────────── 
───►LOCATION    Tok 
─── 
南京市     Po    3     4   
────────────────
NR───►NP───►FRAG

Will this change the current api? How? no

Who will benefit with this feature? anyone who need a finer tokenizer,

Are you willing to contribute it (Yes/No): Yes, let me know how

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Python version: 3.10
HanLP version: In [87]: hanlp.version Out[87]: '2.1.0-beta.50'

Any other info

[x] I've carefully completed this form.

tx0c commented 1 year ago

please comment on how can this behavior be implemented? is it even possible with HanLP? don't close right away, let it have some time to debate

hankcs commented 1 year ago

Yes, it is very easy to implement your tokenization standard on top of HanLP's.

import hanlp
from hanlp_trie.dictionary import TrieDict

tok = hanlp.load(hanlp.pretrained.tok.FINE_ELECTRA_SMALL_ZH)

sent = '为什么余华说米兰·昆德拉是三流小说家？唐納德川普'
print(tok(sent))

names_dict = ['米兰', '昆德拉', '唐納德']
trie = TrieDict(dict((name, name) for name in names_dict))

def further_tokenize(words):
    fine_grained = []
    for word in words:
        matches = trie.split(word)
        fine_grained.extend(piece if isinstance(piece, str) else piece[-1] for piece in matches)
    return fine_grained

fine_grained = hanlp.pipeline(tok, further_tokenize)
print(fine_grained(sent))
# ['为什么', '余华', '说', '米兰', '·', '昆德拉', '是', '三流', '小说家', '？', '唐納德', '川普']

You can get a names dictionary from https://github.com/hankcs/HanLP/blob/1.x/data/dictionary/person/nrf.txt or many places on GitHub.

On the contrary, I don't think any alternative libs could offer such freedom, multi-granularity standard, or high acc.

Close whenever it's appropriate for you.

hankcs / HanLP

a finer tokenizer than current FINE, to be able cut people names to first and last name #1831