模型输出的文本和时间戳长度不同，如何进行对应？

modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.

https://www.funasr.com

Other

6.46k stars 687 forks source link

模型输出的文本和时间戳长度不同，如何进行对应？ #1795

Closed kirayomato closed 3 months ago

kirayomato commented 4 months ago

Notice: In order to resolve issues more efficiently, please raise issue following the template. （注意：为了更加高效率解决您遇到的问题，请按照模板提问，补充细节）

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

我尝试利用funasr为我的视频生成字幕，但是发现识别得到的文本长度和时间戳长度并不相同。请问如何将文本和时间戳进行对应？

Code

model = AutoModel(model="paraformer-zh",
                  vad_model="fsmn-vad",
                  punc_model="ct-punc",
                  # spk_model="cam++"
                  )
res = model.generate(input=video_path,
                     batch_size_s=300,
                     # hotword='魔搭'
                     )
text = res[0]['text']
ts = res[0]['timestamp']
print(len(text), len(ts))

What have you tried?

What's your environment?

OS (e.g., Linux): Windows 11
FunASR Version (e.g., 1.0.0): 1.0.27
ModelScope Version (e.g., 1.11.0): 1.12.0
PyTorch Version (e.g., 2.0.0): 2.2.1
How you installed funasr (pip, source): pip
Python version: 3.11.0
GPU (e.g., V100M32): RTX 4060
CUDA/cuDNN version (e.g., cuda11.7): cuda 12.1
Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
Any other relevant information:

LauraGPT commented 3 months ago

sentence_timestamp=True

zhangakun commented 2 months ago

sentence_timestamp=True

but why the lengths is different? I want to add punctuation by other model and need use the timestamp of word. if different,it will can't work...

lixinjie97 commented 2 months ago

sentence_timestamp=True

but why the lengths is different? I want to add punctuation by other model and need use the timestamp of word. if different,it will can't work...

Punctuation does not count.

zhangakun commented 2 months ago

sentence_timestamp=True

but why the lengths is different? I want to add punctuation by other model and need use the timestamp of word. if different,it will can't work...

Punctuation does not count.

I knew it.

import re import string from zhon.hanzi import punctuation

punctuation_zh = punctuation punctuation_en = string.punctuation punctuation_str = punctuation_zh + punctuation_en

res = funasr() text = res[0]['text'] timestamp = res[0]['timestamp'] raw_text = re.sub('[' + punctuation_str + ']', '', text)

but len(raw_text) != len(timestamp).