hankcs / HanLP

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
https://hanlp.hankcs.com/
Apache License 2.0
33.99k stars 10.18k forks source link

a finer tokenizer than current FINE, to be able cut people names to first and last name #1831

Closed tx0c closed 1 year ago

tx0c commented 1 year ago

Describe the feature and the current behavior/state. Current behavior is always treat Personal name as one word,

https://github.com/hankcs/HanLP/issues/1829#issuecomment-1653800092

2.2.1 Personal name Treat it as one word. Don’t give the internal structure unless there is a space between two names (in foreign alphabet).

but this isn't enough for applications scenarios where need a finer tokenizer, e.g. in a search index'ing application, need search query phrase 孔明 be able to match out 諸葛孔明 and also パリピ孔明 in pre-index'ed vectors,

need some config to be able set, let user of the tokenizer to decide always treat names as one word, or to cut first/last in half, especially for

need a config to set minimum length for this behavior, for short names 張飛 岳飛 might be okay to treat as one word, but should have a way to cut longer names as 諸葛青云 龐青云 刘青云 to two parts, enable end user can search by 青云 to find out all text talking about 青云

to even longer foreign names 米蘭昆德拉 唐納德川普 米歇爾·奥巴馬 there should have a way to cut into halves,

compare with the alternative library nodejieba, which has no problem to cut 米兰·昆德拉 to 2 parts:

> var nodejieba = require('nodejieba');
> nodejieba.cut('为什么余华说米兰·昆德拉是三流小说家?')
[
  '为什么', '余',
  '华',     '说',
  '米兰',   '·',
  '昆德拉', '是',
  '三流',   '小说家',
  '?'
]

In [107]: tok(["米蘭昆德拉", "米蘭·昆德拉", "为什么余华说米兰·昆德拉是三流小说家?", "米蘭昆德拉", "唐納德川普", "米歇爾·奥巴馬", "諸葛亮", "諸葛孔明", "諸葛亮先生", "パリピ孔明"])
Out[107]: 
[['米蘭昆德拉'],
 ['米蘭·昆德拉'],
 ['为什么', '余华', '说', '米兰·昆德拉', '是', '三流', '小说家', '?'],
 ['米蘭昆德拉'],
 ['唐納德川普'],
 ['米歇爾·奥巴馬'],
 ['諸葛亮'],
 ['諸葛孔明'],
 ['諸葛亮', '先生'],
 ['パリピ', '孔明']]

from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None, language='zh') # auth不填则匿名,zh中文,mul多语种
HanLP(["米蘭昆德拉", "米蘭·昆德拉", "为什么余华说米兰·昆德拉是三流小说家?", "米蘭昆德拉", "唐納德川普", "米歇爾·奥巴馬", "諸葛亮", "諸葛孔明", "諸葛亮先生", "周恩来", "米蘭市", "洛杉磯縣", "南京市",]) .pretty_print()

Dep 
─── 
┌─► 
└──     Tok 
─── 
米蘭  
昆德拉     Rela 
──── 
name 
root    Po 
── 
NR 
NR  Tok 
─── 
米蘭  
昆德拉     NER Type     
──────────── 
───►LOCATION 
───►PERSON      Tok 
─── 
米蘭  
昆德拉     Po    3  
─────────
NR──┐    
NR──┴►TOP

    Token  
────── 
米蘭·昆德拉  Rela 
──── 
root    Po 
── 
NR  Token  
────── 
米蘭·昆德拉  NER Type   
────────── 
───►PERSON  Token  
────── 
米蘭·昆德拉  Po    3 
────────
NR───►NP

Dep Tree  
───────── 
     ┌──► 
     │┌─► 
┌┬───┴┴── 
││  ┌───► 
││  │┌──► 
││  ││┌─► 
│└─►└┴┴── 
└───────►   Token  
────── 
为什么    
余华     
说      
米兰·昆德拉 
是      
三流     
小说家    
?       Relati 
────── 
advmod 
nsubj  
root   
nsubj  
cop    
amod   
ccomp  
punct   Po 
── 
AD 
NR 
VV 
NR 
VC 
JJ 
NN 
PU  Token  
────── 
为什么    
余华     
说      
米兰·昆德拉 
是      
三流     
小说家    
?       NER Type   
────────── 

───►PERSON 

───►PERSON 

            Token  
────── 
为什么    
余华     
说      
米兰·昆德拉 
是      
三流     
小说家    
?       SRL PA1      
──────────── 
───►ARGM-ADV 
───►ARG0     
╟──►PRED     
◄─┐          
  │          
  ├►ARG1     
◄─┘          
                Token  
────── 
为什么    
余华     
说      
米兰·昆德拉 
是      
三流     
小说家    
?       SRL PA2  
──────── 

───►ARG0 
╟──►PRED 
◄─┐      
◄─┴►ARG1 
            Token  
────── 
为什么    
余华     
说      
米兰·昆德拉 
是      
三流     
小说家    
?       Po    3       4       5       6       7       8 
────────────────────────────────────────────────
AD───────────────────────────────────►ADVP──┐   
NR───────────────────────────────────►NP────┤   
VV──────────────────────────────────┐       │   
NR───────────────────►NP ───┐       ├►VP────┤   
VC──────────────────┐       ├►IP ───┘       ├►IP
JJ───►ADJP──┐       ├►VP ───┘               │   
NN───►NP ───┴►NP ───┘                       │   
PU──────────────────────────────────────────┘   

Dep 
─── 
┌─► 
└──     Tok 
─── 
米蘭  
昆德拉     Rela 
──── 
name 
root    Po 
── 
NR 
NR  Tok 
─── 
米蘭  
昆德拉     NER Type     
──────────── 
───►LOCATION 
───►PERSON      Tok 
─── 
米蘭  
昆德拉     Po    3  
─────────
NR──┐    
NR──┴►TOP

    Token 
───── 
唐納德川普   Rela 
──── 
root    Po 
── 
NR  Token 
───── 
唐納德川普   NER Type   
────────── 
───►PERSON  Token 
───── 
唐納德川普   Po    3 
────────
NR───►NP

    Token   
─────── 
米歇爾·奥巴馬     Rela 
──── 
root    Po 
── 
NR  Token   
─────── 
米歇爾·奥巴馬     NER Type   
────────── 
───►PERSON  Token   
─────── 
米歇爾·奥巴馬     PoS
───
NR 

    Tok 
─── 
諸葛亮     Rela 
──── 
root    Po 
── 
NR  Tok 
─── 
諸葛亮     NER Type   
────────── 
───►PERSON  Tok 
─── 
諸葛亮     Po    3 
────────
NR───►NP

    Toke 
──── 
諸葛孔明    Rela 
──── 
root    Po 
── 
NR  Toke 
──── 
諸葛孔明    NER Type   
────────── 
───►PERSON  Toke 
──── 
諸葛孔明    Po    3 
────────
NR───►NP

Dep 
─── 
┌─► 
└──     Tok 
─── 
諸葛亮 
先生      Relation    
─────────── 
compound:nn 
root            Po 
── 
NR 
NN  Tok 
─── 
諸葛亮 
先生      NER Type   
────────── 
───►PERSON 
            Tok 
─── 
諸葛亮 
先生      Po    3 
────────
NR──┐   
NN──┴►NP

    Tok 
─── 
周恩来     Rela 
──── 
root    Po 
── 
NR  Tok 
─── 
周恩来     NER Type   
────────── 
───►PERSON  Tok 
─── 
周恩来     Po    3 
────────
NR───►NP

    Tok 
─── 
米蘭市     Rela 
──── 
root    Po 
── 
NR  Tok 
─── 
米蘭市     NER Type     
──────────── 
───►LOCATION    Tok 
─── 
米蘭市     Po    3     4   
────────────────
NR───►NP───►FRAG

    Toke 
──── 
洛杉磯縣    Rela 
──── 
root    Po 
── 
NR  Toke 
──── 
洛杉磯縣    NER Type     
──────────── 
───►LOCATION    Toke 
──── 
洛杉磯縣    Po    3     4   
────────────────
NR───►NP───►FRAG

    Tok 
─── 
南京市     Rela 
──── 
root    Po 
── 
NR  Tok 
─── 
南京市     NER Type     
──────────── 
───►LOCATION    Tok 
─── 
南京市     Po    3     4   
────────────────
NR───►NP───►FRAG

Will this change the current api? How? no

Who will benefit with this feature? anyone who need a finer tokenizer,

Are you willing to contribute it (Yes/No): Yes, let me know how

System information

Any other info

tx0c commented 1 year ago

please comment on how can this behavior be implemented? is it even possible with HanLP? don't close right away, let it have some time to debate

hankcs commented 1 year ago

Yes, it is very easy to implement your tokenization standard on top of HanLP's.

import hanlp
from hanlp_trie.dictionary import TrieDict

tok = hanlp.load(hanlp.pretrained.tok.FINE_ELECTRA_SMALL_ZH)

sent = '为什么余华说米兰·昆德拉是三流小说家?唐納德川普'
print(tok(sent))

names_dict = ['米兰', '昆德拉', '唐納德']
trie = TrieDict(dict((name, name) for name in names_dict))

def further_tokenize(words):
    fine_grained = []
    for word in words:
        matches = trie.split(word)
        fine_grained.extend(piece if isinstance(piece, str) else piece[-1] for piece in matches)
    return fine_grained

fine_grained = hanlp.pipeline(tok, further_tokenize)
print(fine_grained(sent))
# ['为什么', '余华', '说', '米兰', '·', '昆德拉', '是', '三流', '小说家', '?', '唐納德', '川普']

You can get a names dictionary from https://github.com/hankcs/HanLP/blob/1.x/data/dictionary/person/nrf.txt or many places on GitHub.

On the contrary, I don't think any alternative libs could offer such freedom, multi-granularity standard, or high acc.

Close whenever it's appropriate for you.