speechbrain / speechbrain

A PyTorch-based Speech Toolkit
http://speechbrain.github.io
Apache License 2.0
8.99k stars 1.41k forks source link

Transcription seems shortened #565

Closed arnaudmiribel closed 3 years ago

arnaudmiribel commented 3 years ago

Good evening here,

This looks awesome! I'm trying to get a transcription from the pre-trained French for a .wav file of 53 secs. Here's my code:

from speechbrain.pretrained import EncoderDecoderASR
import torch

asr_model = EncoderDecoderASR.from_hparams(
    source="speechbrain/asr-crdnn-commonvoice-fr",
    savedir="pretrained_models/asr-crdnn-commonvoice-fr",
)

# This below is simply `asr_model.transcribe_file("my_file.wav")`
waveform = asr_model.load_audio("my_file.wav")
batch = waveform.unsqueeze(0)
rel_length = torch.tensor([1.0])
predicted_words, predicted_tokens = asr_model.transcribe_batch(batch, rel_length)
print(predicted_words)

But the predicted words only account for the like 4 or 5 first seconds of the sound. Am I missing something? Sampling issue? (absolute beginner in sound processing here) Thanks a lot for your help!

TParcollet commented 3 years ago

Huuum. Can you check the size of your waveform? To see if it corresponds to the duration of the signal? This model has been trained on audio sampled at 16Khz, is it also the case here?

mravanelli commented 3 years ago

53 seconds is definitely too long, right? The system is trained on CommonVoice which has sentences from 4-8 seconds. Normally, long sentences should be split into smaller chunks with a VAD.

On Mon, 15 Mar 2021 at 14:05, Parcollet Titouan @.***> wrote:

Huuum. Can you check the size of your waveform? To see if it corresponds to the duration of the signal? This model has been trained on audio sampled at 16Khz, is it also the case here?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/speechbrain/speechbrain/issues/565#issuecomment-799631600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA2ZVTY74MUFNWB4SAXHHTTDZEAHANCNFSM4ZHBNPRA .

TParcollet commented 3 years ago

Yes ideally it would be better to remove silences or split the waveform. However, it should still work right ? Like, the model should transcribe whatever is given as an input. What could happen however, is that it starts just outputting blanks or end of sentence tokens.

arnaudmiribel commented 3 years ago

Wow, thanks for answering :-) Here's what I can tell about sampling rate here:

>>> import soundfile as sf
>>> f = sf.SoundFile("my_file.wav")
>>> print('samples = {}'.format(len(f)))
samples = 2564013
>>> print('sample rate = {}'.format(f.samplerate))
sample rate = 48000
>>> print('seconds = {}'.format(len(f) / f.samplerate))
seconds = 53.4169375

Within speechbrain, I get:

waveform = asr_model.load_audio("my_file.wav")
print(waveform.shape)
#  torch.Size([854671])

So I guess there are multiples issues here: at least the sampling is not as the model requires, and the overall length should also be split in order to fit the training data.

TParcollet commented 3 years ago

Try resampling your audio such as with this line: https://github.com/speechbrain/speechbrain/blob/34bcf9d0783cf72a952674032834383194018b7b/recipes/CommonVoice/ASR/seq2seq/train.py#L250

Then try to transcribe again :D

arnaudmiribel commented 3 years ago

So I tried with the following:

waveform = torchaudio.transforms.Resample(16000)(waveform)

and run again, but the output is the same :)

TParcollet commented 3 years ago

Let me try then :p

TParcollet commented 3 years ago

Ok so the same happens for me. I get around 20s out of 53s. Let me investigate this! @JianyuanZhong @30stomercury any ideas on what could cause the transcription to be truncated? I suppose that Eos is emitted too soon?

TParcollet commented 3 years ago

Right, some updates: It's a design problem, for now, only CTC+att-decoding-based models (or CTC) can transcribe long audios but at the cost of a very high-decoding time (except for CTC only). We will integrate online ASR and local-attention in our short-term to do list to facilitate such transcriptions ...

arnaudmiribel commented 3 years ago

Thanks for the response!