Closed poojitharamachandra closed 9 months ago
Hey! Seems like you are not using an officially shared snippet / an external library (call to sd.rec
) to make sure we can help you, would you mind sharing the full snippet?
import sounddevice as sd
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5ForSpeechToText
processor = SpeechT5Processor.from_pretrained("microsoftt5_tts")
model = SpeechT5ForSpeechToText.from_pretrained("microsoftt5_tts")
duration = 10
sampling_rate = 16000
audio = sd.rec(int(sampling_rate * duration), samplerate=sampling_rate, channels=1)
input_features = processor(audio=audio,sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
output = model(**input_features)
decoded_text = processor.decode(output, skip_special_tokens=True)
Thanks, but this does not run.
OSError: microsoftt5_tts is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
I could of course check online to see what are the closest names but the point of a reproducer is that I can reproduce.
Would recommend you to make sure the shape you are feeding to the processor is correct. Here is what the doc mentions:
The sequence or batch of sequences to be processed. Each sequence can be a numpy array, a list of float values, a list of numpy arrays or a list of list of float values. This outputs waveform features. Must mono channel audio, not stereo, i.e. single float per timestep.
Here is an example of a working snippet: https://github.com/huggingface/transformers/blob/d8e1ed17ee7e640a1d5ba999345c71d4039a5a34/tests/models/speecht5/test_modeling_speecht5.py#L766
do u have any suggestions on how to convert .wav file to numpy array suitable for the model?
It looks like you are loading the TTS model, but trying to perform ASR. Here's a code snippet for running inference with the ASR model: https://huggingface.co/microsoft/speecht5_asr#how-to-get-started-with-the-model
Or with the pipeline:
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="microsoft/speecht5_asr")
pipe("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac") # replace input with the path to your audio
It looks like you are loading the TTS model, but trying to perform ASR. Here's a code snippet for running inference with the ASR model: https://huggingface.co/microsoft/speecht5_asr#how-to-get-started-with-the-model
Or with the pipeline:
from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="microsoft/speecht5_asr") pipe("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac") # replace input with the path to your audio
this creates too much noise in the generated text
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
how can i solve this error?
@sanchit-gandhi
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
run the above code snippet
Expected behavior
expected to convert speech to text