modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
https://www.funasr.com
Other
6.47k stars 688 forks source link

粤语识别出subword #1445

Closed LRY1994 closed 4 months ago

LRY1994 commented 7 months ago

🐛 Bug

识别出来subword

茂名口音, gt : 好 啲 呢 我 觉 得 pred: ho@@ al@@ ding ne@@ un@@ qu@@ ar@@ ter a

2-28-2_00751262_00752898.zip

To Reproduce

model = AutoModel(model="dengcunqin/speech_paraformer-large_asr_nat-zh-cantonese-en-16k-vocab8501-online", model_revision="master")

encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention

decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention

chunk_size = [0, 10, 5] 

model.generate(input=path,
             chunk_size=chunk_size,
             encoder_chunk_look_back=encoder_chunk_look_back,
             decoder_chunk_look_back=decoder_chunk_look_back,
             is_final=True,
             output_dir=local_path)
LauraGPT commented 7 months ago

@@ means the token is subword. You could concat them via: replace('@@ ', '')

tramphero commented 6 months ago

@@ means the token is subword. You could concat them via: replace('@@ ', '')

Can we perhaps add the post-processing statements for handling subwords to the pipelines for all languages? @LauraGPT