shibing624 / pycorrector

pycorrector is a toolkit for text error correction. 文本纠错,实现了Kenlm,T5,MacBERT,ChatGLM3,Qwen2.5等模型应用在纠错场景,开箱即用。
https://www.mulanai.com/product/corrector/
Apache License 2.0
5.57k stars 1.1k forks source link

混淆集相关方法存在两个 bug #470

Closed treya-lin closed 9 months ago

treya-lin commented 9 months ago

1. kenlm

存在问题:同一个错字重复出现时只修正了第一次出现的case

我发现如果混淆集中同一个词在句子中重复出现,只会修改第一次出现的。

举个例子: 混淆集

莪 我
祢 你

例句

s= "莪想说莪爱祢"
m_custom = Corrector(custom_confusion_path_or_dict = "./my_custom_confusion.txt")
m_custom.correct(s)

结果

{'source': '莪想说莪爱祢', 'target': '我想说莪爱你', 'errors': [('莪', '我', 0), ('祢', '你', 5)]}

第二个”莪“字没有被换掉。

2. confusion pipeline

使用confusion pipeline时,上面同一个例子,但是“莪”字两处都没有被改掉

from pycorrector import ConfusionCorrector
confusion_dict = {"莪": "我", "祢": "你"}
model_confusion = ConfusionCorrector(custom_confusion_path_or_dict=confusion_dict)
model_confusion.correct("莪想说莪爱祢")

结果

{'source': '莪想说莪爱祢',
 'target': '莪想说莪爱你',
 'errors': [('莪', '我', 0), ('祢', '你', 5)]}

检测到第一个'莪',但两处'莪'都没被改掉。

treya-lin commented 9 months ago

I had these two issues fixed. I will submit a PR later.

shibing624 commented 9 months ago

done