Please provide a clear and concise description of what the question is.
新版本macbert4csc中ConfusionCorrector实现逻辑问题,这里需要遍历疑似错误词典,然后每一个都需要re正则,当混淆词典比较大的时候,会特别慢。建议改为前缀树或者其他形式。
def correct(self, sentence: str):
"""
基于混淆集纠错
:param sentence: str, 待纠错的文本
:return: dict, {'source': 'src', 'target': 'trg', 'errors': [(error_word, correct_word, position), ...]}
"""
corrected_sentence = sentence
details = []
# 自定义混淆集加入疑似错误词典
for err, truth in self.custom_confusion.items():
for i in re.finditer(err, sentence):
start, end = i.span()
corrected_sentence = corrected_sentence[:start] + truth + corrected_sentence[end:]
details.append((err, truth, start))
return {'source': sentence, 'target': corrected_sentence, 'errors': details}
Describe the Question
Please provide a clear and concise description of what the question is. 新版本macbert4csc中ConfusionCorrector实现逻辑问题,这里需要遍历疑似错误词典,然后每一个都需要re正则,当混淆词典比较大的时候,会特别慢。建议改为前缀树或者其他形式。
实测当混淆词典为1万时,ConfusionCorrector纠正速度为200-300ms每个句子,而macbert4csc推理一条句子,只需要几毫秒几十毫秒