tsroten / pynlpir

A Python wrapper around the NLPIR/ICTCLAS Chinese segmentation software.
MIT License
566 stars 135 forks source link

the format of user dict #41

Closed dengjl closed 8 years ago

dengjl commented 8 years ago

thanks for your work. what's the format of the user dict? I can't import it right

tsroten commented 8 years ago

@dengjl Thanks!

NLPIR has an example you can look at: https://github.com/NLPIR-team/NLPIR/blob/a632f29b2452195d338e8e5e69a49be31dd69604/NLPIR-ICTCLAS/Data/UserDefinedDict.lst

NLPIR的手册说:

用户词典需要注意的事项还包括:
1. 如果用户词有空格,需要采用[]括起来,例如: [Bill Clinton] nrf
2. 如果需要该用户词作为文章的关键词输出,必须用户词性标注为:key,如:科学发展观 key
3. 如果将一个词是人名,同时又希望作为关键词输出,则需要标注为 key_nr,如 钟南山 key_nr
4. 如果将一个词是地名,同时又希望作为关键词输出,则需要标注为 key_ns,如 钓鱼岛 key_ns
5. 如果将一个词是机构名,同时又希望作为关键词输出,则需要标注为 key_nr,如 国安 委 key_nt

The manual also has some more information.

And, you could check out this: https://github.com/NLPIR-team/NLPIR/tree/a632f29b2452195d338e8e5e69a49be31dd69604/NLPIR-ICTCLAS/importuserdict

dengjl commented 8 years ago

thank you very much!

dengjl commented 8 years ago

by the way, does the function pynlpir.segment include the user dictionary automatically? I think the nlpir.ParagraphProcess can do it, but pynlpir.segment seems not work

dengjl commented 8 years ago

Oh I find my mistake, the dictionary is not 'utf-8' coding