Closed tx0c closed 1 year ago
please comment on how can this behavior be implemented? is it even possible with HanLP? don't close right away, let it have some time to debate
Yes, it is very easy to implement your tokenization standard on top of HanLP's.
import hanlp
from hanlp_trie.dictionary import TrieDict
tok = hanlp.load(hanlp.pretrained.tok.FINE_ELECTRA_SMALL_ZH)
sent = '为什么余华说米兰·昆德拉是三流小说家?唐納德川普'
print(tok(sent))
names_dict = ['米兰', '昆德拉', '唐納德']
trie = TrieDict(dict((name, name) for name in names_dict))
def further_tokenize(words):
fine_grained = []
for word in words:
matches = trie.split(word)
fine_grained.extend(piece if isinstance(piece, str) else piece[-1] for piece in matches)
return fine_grained
fine_grained = hanlp.pipeline(tok, further_tokenize)
print(fine_grained(sent))
# ['为什么', '余华', '说', '米兰', '·', '昆德拉', '是', '三流', '小说家', '?', '唐納德', '川普']
You can get a names dictionary from https://github.com/hankcs/HanLP/blob/1.x/data/dictionary/person/nrf.txt or many places on GitHub.
On the contrary, I don't think any alternative libs could offer such freedom, multi-granularity standard, or high acc.
Close whenever it's appropriate for you.
Describe the feature and the current behavior/state. Current behavior is always treat Personal name as one word,
https://github.com/hankcs/HanLP/issues/1829#issuecomment-1653800092
but this isn't enough for applications scenarios where need a finer tokenizer, e.g. in a search index'ing application, need search query phrase
孔明
be able to match out諸葛孔明
and alsoパリピ孔明
in pre-index'ed vectors,need some config to be able set, let user of the tokenizer to decide always treat names as one word, or to cut first/last in half, especially for
need a config to set minimum length for this behavior, for short names
張飛
岳飛
might be okay to treat as one word, but should have a way to cut longer names as諸葛青云
龐青云
刘青云
to two parts, enable end user can search by青云
to find out all text talking about青云
to even longer foreign names
米蘭昆德拉
唐納德川普
米歇爾·奥巴馬
there should have a way to cut into halves,compare with the alternative library
nodejieba
, which has no problem to cut米兰·昆德拉
to 2 parts:Will this change the current api? How? no
Who will benefit with this feature? anyone who need a finer tokenizer,
Are you willing to contribute it (Yes/No): Yes, let me know how
System information
Any other info