粤语识别出subword - Githubissues

LRY1994 commented 7 months ago

🐛 Bug

识别出来subword

茂名口音， gt : 好啲呢我觉得 pred: ho@@ al@@ ding ne@@ un@@ qu@@ ar@@ ter a

To Reproduce

model = AutoModel(model="dengcunqin/speech_paraformer-large_asr_nat-zh-cantonese-en-16k-vocab8501-online", model_revision="master")

encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention

decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention

chunk_size = [0, 10, 5] 

model.generate(input=path,
             chunk_size=chunk_size,
             encoder_chunk_look_back=encoder_chunk_look_back,
             decoder_chunk_look_back=decoder_chunk_look_back,
             is_final=True,
             output_dir=local_path)

LauraGPT commented 7 months ago

@@ means the token is subword. You could concat them via: replace('@@ ', '')

tramphero commented 6 months ago

@@ means the token is subword. You could concat them via: replace('@@ ', '')

Can we perhaps add the post-processing statements for handling subwords to the pipelines for all languages? @LauraGPT

modelscope / FunASR

粤语识别出subword #1445

🐛 Bug

To Reproduce