shibing624 / pycorrector

pycorrector is a toolkit for text error correction. 文本纠错,实现了Kenlm,T5,MacBERT,ChatGLM3,Qwen2.5等模型应用在纠错场景,开箱即用。
https://www.mulanai.com/product/corrector/
Apache License 2.0
5.61k stars 1.1k forks source link

使用kenlm规则纠错的三个小建议 #295

Closed wangdabee closed 1 year ago

wangdabee commented 2 years ago

1.源码只支持对句子中第一次出现的混淆集或者专有名词进行改变,因为sentence.find() 只会返回句子中第一次出现的下标,希望可以修改为对出现的所有的的混淆集或者专有名词进行改变。 2.源码只支持长度相等字符的替换,将不对等字数替换后后面的替换会出现错位现象。原因为将长度不对等字符替换后句子已变为替换后的句子,此时之前detect到的候选错误下标已发生改变,后续若还按照之前的下标进行纠错,会发生错位现象。希望可以支持长度不相等字符的混淆集或者专有名词的替换。 3.目前对于混淆集的替换为简单的直接检索替换,希望可以支持模糊匹配替换。

KnightLancelot commented 2 years ago

第二个问题,我已经修复并提交了,不知道什么时候能通过。 不过很简单,你也可以自己修复,其实只需要将推断结果乘一下注意力掩码就行了。 以README.md为例,只需要按照“使用原生transformers库调用纠错:”,改为下面的代码就没问题了。

import operator
import torch
from transformers import BertTokenizer, BertForMaskedLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
model.to(device)

texts = ["今天新情很好", "你找到你最喜欢的工作,我也很高心。"]

text_tokens = None
with torch.no_grad():
    text_tokens = tokenizer(texts, padding=True, return_tensors='pt')
    outputs = model(**text_tokens.to(device))

def get_errors(corrected_text, origin_text):
    sub_details = []
    for i, ori_char in enumerate(origin_text):
        # , '琊'
        if ori_char in [' ', '“', '”', '‘', '’', '\n', '…', '—', '擤']:
            # add unk word
            corrected_text = corrected_text[:i] + ori_char + corrected_text[i:]
            continue
        if i >= len(corrected_text):
            break
        if ori_char != corrected_text[i]:
            if ori_char.lower() == corrected_text[i]:
                # pass english upper char
                corrected_text = corrected_text[:i] + ori_char + corrected_text[i + 1:]
                continue
            sub_details.append((ori_char, corrected_text[i], i, i + 1))
    sub_details = sorted(sub_details, key=operator.itemgetter(2))
    return corrected_text, sub_details

result = []
i = 0
for ids, text in zip(outputs.logits, texts):

    _text = tokenizer.decode((torch.argmax(ids, dim=-1) * text_tokens.attention_mask[i]), skip_special_tokens=True).replace(' ', '')
    corrected_text, details = get_errors(_text, text)

    print(text, ' => ', corrected_text, details)
    result.append((corrected_text, details))
    i += 1
print(result)
shibing624 commented 2 years ago

1、doing 2、done 3、模糊匹配错误率太高,不做。建议采用增加混淆集更可控。

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.(由于长期不活动,机器人自动关闭此问题,如果需要欢迎提问)