modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
6.99k stars 744 forks source link

标点重建模型在推理增加英文标点时将单词拆开 #2194

Open bigcash opened 1 week ago

bigcash commented 1 week ago

如题:使用的模型是iic/punc_ct-transformer_cn-en-common-vocab471067-large docker:modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.1.0-py310-torch2.3.0-tf2.16.1-1.18.0 容器内将funasr(原版本为1.1.6)升级至最新版本funasr==1.1.14后,同样有该问题。

🐛 Bug

from funasr import AutoModel
punc_model='local-model-path'
model = AutoModel(model=punc_model, model_revision="v2.0.4")
punc_results = model.generate(input=['when is Interview: The Documentary playing in Loews Cineplex'])
print(punc_results)

代码输出结果是: [{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': ' When is I nt er view : The Documentary playing in Loews Cineplex.', 'punc_array': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2])}]

Interview这个单词输出时被拆开了!如果将原始文本中的“:”去掉,则没有这个错误。看样子是包含了这个冒号造成的。

Additional context

输入文本中的“Interview: The Documentary”应该是一个电视节目的名字,所以这种是不是因为包含了冒号后,导致tokenize后token序列不一样了,然后导致了后续恢复为文本时的“单词被拆开”的问题?