新版本macbert4csc中ConfusionCorrector实现逻辑问题

yongzhuo commented 4 months ago

Describe the Question

Please provide a clear and concise description of what the question is. 新版本macbert4csc中ConfusionCorrector实现逻辑问题，这里需要遍历疑似错误词典，然后每一个都需要re正则，当混淆词典比较大的时候，会特别慢。建议改为前缀树或者其他形式。

def correct(self, sentence: str):
        """
        基于混淆集纠错
        :param sentence: str, 待纠错的文本
        :return: dict, {'source': 'src', 'target': 'trg', 'errors': [(error_word, correct_word, position), ...]}
        """
        corrected_sentence = sentence
        details = []
        # 自定义混淆集加入疑似错误词典
        for err, truth in self.custom_confusion.items():
            for i in re.finditer(err, sentence):
                start, end = i.span()
                corrected_sentence = corrected_sentence[:start] + truth + corrected_sentence[end:]
                details.append((err, truth, start))
        return {'source': sentence, 'target': corrected_sentence, 'errors': details}

实测当混淆词典为1万时，ConfusionCorrector纠正速度为200-300ms每个句子，而macbert4csc推理一条句子，只需要几毫秒几十毫秒

shibing624 commented 4 months ago

fixed, use ahocorasick

yongzhuo commented 4 months ago

get

shibing624 / pycorrector

新版本macbert4csc中ConfusionCorrector实现逻辑问题 #494

Describe the Question