混淆集相关方法存在两个 bug

treya-lin commented 9 months ago

1. kenlm

存在问题：同一个错字重复出现时只修正了第一次出现的case

我发现如果混淆集中同一个词在句子中重复出现，只会修改第一次出现的。

举个例子：混淆集

莪 我
祢 你

例句

s= "莪想说莪爱祢"
m_custom = Corrector(custom_confusion_path_or_dict = "./my_custom_confusion.txt")
m_custom.correct(s)

结果

{'source': '莪想说莪爱祢', 'target': '我想说莪爱你', 'errors': [('莪', '我', 0), ('祢', '你', 5)]}

第二个”莪“字没有被换掉。

2. confusion pipeline

使用confusion pipeline时，上面同一个例子，但是“莪”字两处都没有被改掉

from pycorrector import ConfusionCorrector
confusion_dict = {"莪": "我", "祢": "你"}
model_confusion = ConfusionCorrector(custom_confusion_path_or_dict=confusion_dict)
model_confusion.correct("莪想说莪爱祢")

结果

{'source': '莪想说莪爱祢',
 'target': '莪想说莪爱你',
 'errors': [('莪', '我', 0), ('祢', '你', 5)]}

检测到第一个'莪'，但两处'莪'都没被改掉。

treya-lin commented 9 months ago

I had these two issues fixed. I will submit a PR later.

shibing624 commented 9 months ago

done

shibing624 / pycorrector

混淆集相关方法存在两个 bug #470

1. kenlm

2. confusion pipeline