modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
6.46k stars 687 forks source link

模型输出的文本和时间戳长度不同,如何进行对应? #1795

Closed kirayomato closed 3 months ago

kirayomato commented 4 months ago

Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

我尝试利用funasr为我的视频生成字幕,但是发现识别得到的文本长度和时间戳长度并不相同。请问如何将文本和时间戳进行对应?

Code

model = AutoModel(model="paraformer-zh",
                  vad_model="fsmn-vad",
                  punc_model="ct-punc",
                  # spk_model="cam++"
                  )
res = model.generate(input=video_path,
                     batch_size_s=300,
                     # hotword='魔搭'
                     )
text = res[0]['text']
ts = res[0]['timestamp']
print(len(text), len(ts))

What have you tried?

What's your environment?

LauraGPT commented 3 months ago

sentence_timestamp=True

zhangakun commented 2 months ago

sentence_timestamp=True

but why the lengths is different? I want to add punctuation by other model and need use the timestamp of word. if different,it will can't work...

lixinjie97 commented 2 months ago

sentence_timestamp=True

but why the lengths is different? I want to add punctuation by other model and need use the timestamp of word. if different,it will can't work...

Punctuation does not count.

zhangakun commented 2 months ago

sentence_timestamp=True

but why the lengths is different? I want to add punctuation by other model and need use the timestamp of word. if different,it will can't work...

Punctuation does not count.

I knew it.

import re import string from zhon.hanzi import punctuation

punctuation_zh = punctuation punctuation_en = string.punctuation punctuation_str = punctuation_zh + punctuation_en

res = funasr() text = res[0]['text'] timestamp = res[0]['timestamp'] raw_text = re.sub('[' + punctuation_str + ']', '', text)

but len(raw_text) != len(timestamp).