Closed arnaudmiribel closed 3 years ago
Huuum. Can you check the size of your waveform? To see if it corresponds to the duration of the signal? This model has been trained on audio sampled at 16Khz, is it also the case here?
53 seconds is definitely too long, right? The system is trained on CommonVoice which has sentences from 4-8 seconds. Normally, long sentences should be split into smaller chunks with a VAD.
On Mon, 15 Mar 2021 at 14:05, Parcollet Titouan @.***> wrote:
Huuum. Can you check the size of your waveform? To see if it corresponds to the duration of the signal? This model has been trained on audio sampled at 16Khz, is it also the case here?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/speechbrain/speechbrain/issues/565#issuecomment-799631600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA2ZVTY74MUFNWB4SAXHHTTDZEAHANCNFSM4ZHBNPRA .
Yes ideally it would be better to remove silences or split the waveform. However, it should still work right ? Like, the model should transcribe whatever is given as an input. What could happen however, is that it starts just outputting blanks or end of sentence tokens.
Wow, thanks for answering :-) Here's what I can tell about sampling rate here:
>>> import soundfile as sf
>>> f = sf.SoundFile("my_file.wav")
>>> print('samples = {}'.format(len(f)))
samples = 2564013
>>> print('sample rate = {}'.format(f.samplerate))
sample rate = 48000
>>> print('seconds = {}'.format(len(f) / f.samplerate))
seconds = 53.4169375
Within speechbrain, I get:
waveform = asr_model.load_audio("my_file.wav")
print(waveform.shape)
# torch.Size([854671])
So I guess there are multiples issues here: at least the sampling is not as the model requires, and the overall length should also be split in order to fit the training data.
Try resampling your audio such as with this line: https://github.com/speechbrain/speechbrain/blob/34bcf9d0783cf72a952674032834383194018b7b/recipes/CommonVoice/ASR/seq2seq/train.py#L250
Then try to transcribe again :D
So I tried with the following:
waveform = torchaudio.transforms.Resample(16000)(waveform)
and run again, but the output is the same :)
Let me try then :p
Ok so the same happens for me. I get around 20s out of 53s. Let me investigate this! @JianyuanZhong @30stomercury any ideas on what could cause the transcription to be truncated? I suppose that Eos is emitted too soon?
Right, some updates: It's a design problem, for now, only CTC+att-decoding-based models (or CTC) can transcribe long audios but at the cost of a very high-decoding time (except for CTC only). We will integrate online ASR and local-attention in our short-term to do list to facilitate such transcriptions ...
Thanks for the response!
Good evening here,
This looks awesome! I'm trying to get a transcription from the pre-trained French for a
.wav
file of 53 secs. Here's my code:But the predicted words only account for the like 4 or 5 first seconds of the sound. Am I missing something? Sampling issue? (absolute beginner in sound processing here) Thanks a lot for your help!