hankcs / HanLP

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
https://hanlp.hankcs.com/
Apache License 2.0
33.84k stars 10.12k forks source link

Add a custom dictionary type that supports spaces #1866

Closed wallezhang closed 10 months ago

wallezhang commented 10 months ago

Describe the feature and the current behavior/state.

Currently, custom dictionary can't contain spaces because the segmentation is based on spaces. It is recommended to add a custom dictionary in a format that uses commas to split word, word type, and word frequency.

Will this change the current api? How?

No.

Who will benefit with this feature?

Users who need to include spaces in their custom dictionaries will benefit from this.

Are you willing to contribute it (Yes/No):

Yes.

System information

Any other info

None.

wallezhang commented 10 months ago

I think a new custom dictionary type can be added to ensure that users using older versions can still use custom dictionaries separated by spaces.

hankcs commented 10 months ago

Hi, .csv and .tsv have already been implemented:

https://github.com/hankcs/HanLP/blob/1323221c38e9188b19cef3f770eec40148a459ac/src/main/java/com/hankcs/hanlp/dictionary/DynamicCustomDictionary.java#L235

You can just go ahead and rename your dic to these extensions.

wallezhang commented 10 months ago

Hi, .csv and .tsv have already been implemented:

https://github.com/hankcs/HanLP/blob/1323221c38e9188b19cef3f770eec40148a459ac/src/main/java/com/hankcs/hanlp/dictionary/DynamicCustomDictionary.java#L235

You can just go ahead and rename your dic to these extensions.

Ok, thanks for your reply. I'll try it.