huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.28k stars 26.62k forks source link

speech recognition with speecht5 #27703

Closed poojitharamachandra closed 9 months ago

poojitharamachandra commented 10 months ago

System Info

processor = SpeechT5Processor.from_pretrained("microsoftt5_tts")
model = SpeechT5ForSpeechToText.from_pretrained("microsoftt5_tts")

duration = 10
sampling_rate = 16000
audio = sd.rec(int(sampling_rate * duration), samplerate=sampling_rate, channels=1)
input_features = processor(audio=audio,sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
        output = model(**input_features)
decoded_text = processor.decode(output, skip_special_tokens=True)

---------------> output = model(**input_features)
RuntimeError: Calculated padded input size per channel: (1). Kernel size: (10). Kernel size can't be greater than actual input size

how can i solve this error?

@sanchit-gandhi

Who can help?

No response

Information

Tasks

Reproduction

run the above code snippet

Expected behavior

expected to convert speech to text

ArthurZucker commented 10 months ago

Hey! Seems like you are not using an officially shared snippet / an external library (call to sd.rec) to make sure we can help you, would you mind sharing the full snippet?

poojitharamachandra commented 10 months ago
import sounddevice as sd
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5ForSpeechToText

processor = SpeechT5Processor.from_pretrained("microsoftt5_tts")
model = SpeechT5ForSpeechToText.from_pretrained("microsoftt5_tts")

duration = 10
sampling_rate = 16000
audio = sd.rec(int(sampling_rate * duration), samplerate=sampling_rate, channels=1)
input_features = processor(audio=audio,sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
        output = model(**input_features)
decoded_text = processor.decode(output, skip_special_tokens=True)
ArthurZucker commented 10 months ago

Thanks, but this does not run.

OSError: microsoftt5_tts is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

I could of course check online to see what are the closest names but the point of a reproducer is that I can reproduce.

Would recommend you to make sure the shape you are feeding to the processor is correct. Here is what the doc mentions:

The sequence or batch of sequences to be processed. Each sequence can be a numpy array, a list of float values, a list of numpy arrays or a list of list of float values. This outputs waveform features. Must mono channel audio, not stereo, i.e. single float per timestep.

Here is an example of a working snippet: https://github.com/huggingface/transformers/blob/d8e1ed17ee7e640a1d5ba999345c71d4039a5a34/tests/models/speecht5/test_modeling_speecht5.py#L766

poojitharamachandra commented 10 months ago

do u have any suggestions on how to convert .wav file to numpy array suitable for the model?

ArthurZucker commented 10 months ago

The automatic speech recognition pipeline supports passing wav files (as path to file) and uses ffmpeg see here. An snippet is available here

sanchit-gandhi commented 10 months ago

It looks like you are loading the TTS model, but trying to perform ASR. Here's a code snippet for running inference with the ASR model: https://huggingface.co/microsoft/speecht5_asr#how-to-get-started-with-the-model

Or with the pipeline:

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="microsoft/speecht5_asr")
pipe("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")  # replace input with the path to your audio
poojitharamachandra commented 10 months ago

It looks like you are loading the TTS model, but trying to perform ASR. Here's a code snippet for running inference with the ASR model: https://huggingface.co/microsoft/speecht5_asr#how-to-get-started-with-the-model

Or with the pipeline:

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="microsoft/speecht5_asr")
pipe("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")  # replace input with the path to your audio

this creates too much noise in the generated text

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.