shibing624 / pycorrector

pycorrector is a toolkit for text error correction. 文本纠错,实现了Kenlm,T5,MacBERT,ChatGLM3,LLaMA等模型应用在纠错场景,开箱即用。
https://www.mulanai.com/product/corrector/
Apache License 2.0
5.51k stars 1.09k forks source link

新版本macbert4csc中ConfusionCorrector实现逻辑问题 #494

Closed yongzhuo closed 4 months ago

yongzhuo commented 4 months ago

Describe the Question

Please provide a clear and concise description of what the question is. 新版本macbert4csc中ConfusionCorrector实现逻辑问题,这里需要遍历疑似错误词典,然后每一个都需要re正则,当混淆词典比较大的时候,会特别慢。建议改为前缀树或者其他形式。

def correct(self, sentence: str):
        """
        基于混淆集纠错
        :param sentence: str, 待纠错的文本
        :return: dict, {'source': 'src', 'target': 'trg', 'errors': [(error_word, correct_word, position), ...]}
        """
        corrected_sentence = sentence
        details = []
        # 自定义混淆集加入疑似错误词典
        for err, truth in self.custom_confusion.items():
            for i in re.finditer(err, sentence):
                start, end = i.span()
                corrected_sentence = corrected_sentence[:start] + truth + corrected_sentence[end:]
                details.append((err, truth, start))
        return {'source': sentence, 'target': corrected_sentence, 'errors': details}

实测当混淆词典为1万时,ConfusionCorrector纠正速度为200-300ms每个句子,而macbert4csc推理一条句子,只需要几毫秒几十毫秒

shibing624 commented 4 months ago

fixed, use ahocorasick

yongzhuo commented 4 months ago

get