huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.62k stars 26.42k forks source link

Help using Speech2Text #10631

Closed xjdeng closed 3 years ago

xjdeng commented 3 years ago

Hey @patil-suraj (and anyone who can help),

Sorry, I'm still a beginner compared to the rest of the folks here so sorry if my question is a little basic.

But I'm trying to build a pipeline to manually transcribe Youtube videos (that aren't transcribed correctly by Google) and I was considering using your model for it.

Here's my unfinished code on Google Colab; the last line throws an error:

!pip install git+https://github.com/huggingface/transformers
!pip install youtube-dl path.py soundfile librosa sentencepiece torchaudio

import youtube_dl
from path import Path as Path
import tempfile
import textwrap
import librosa
import soundfile as sf
import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")
wrapper = textwrap.TextWrapper(width=70)

mydir = tempfile.TemporaryDirectory()
dirname = mydir.name + "/tmp.wav"

!youtube-dl -o $dirname -ci -f 'bestvideo[ext=mp4]+bestaudio' -x --audio-format wav https://www.youtube.com/watch?v=d5yfUuHYWho

filename = dirname + ".wav"

speech, rate = sf.read(filename)
speech = librosa.resample(speech.T, rate, 16000)

features = processor(speech, sampling_rate=16000, padding=True, return_tensors="pt")

And here's the error produced:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-8-8fc3e2d943e0> in <module>()
----> 1 features = processor(speech, sampling_rate=16000, padding=True, return_tensors="pt")

5 frames
/usr/local/lib/python3.7/dist-packages/torchaudio/compliance/kaldi.py in _get_waveform_and_window_properties(waveform, channel, sample_frequency, frame_shift, frame_length, round_to_power_of_two, preemphasis_coefficient)
    147     assert 2 <= window_size <= len(
    148         waveform), ('choose a window size {} that is [2, {}]'
--> 149                     .format(window_size, len(waveform)))
    150     assert 0 < window_shift, '`window_shift` must be greater than 0'
    151     assert padded_window_size % 2 == 0, 'the padded `window_size` must be divisible by two.' \

AssertionError: choose a window size 400 that is [2, 2]

Can anyone point me in the right direction? Thanks.

elgeish commented 3 years ago

Your speech loading code is incorrect; instead try the following:

from IPython.display import Audio

speech, rate = librosa.load(filename, sr=16000)
Audio(speech, rate=rate)
rodrigoheck commented 3 years ago

When I run this line

processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")

I am getting the following error: "AttributeError: type object 'Speech2TextProcessor' has no attribute 'from_pretrained'". Did this part was recently changed in the repository?

EDIT: sorry, my mistake. The previous installation was causing trouble. After uninstalling everything and installing again it is working fine.

patil-suraj commented 3 years ago

As @elgeish said, the speech loading code was causing the issue. Glad to know that you resolved it!

xjdeng commented 3 years ago

Success! Thanks

On Wed, Mar 10, 2021, 20:25 rodrigoheck @.***> wrote:

When I run this line

processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")

I am getting the following error: "AttributeError: type object 'Speech2TextProcessor' has no attribute 'from_pretrained'". Did this part was recently changed in the repository?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/10631#issuecomment-796383135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECXP3AO6IAKWJYWUGB42BTTDAS2PANCNFSM4Y6OTVBA .

yadgire7 commented 1 year ago

I am using the maestro dataset(audio files transformed to Pytorch tensors). Code: if name == 'main': METADATA = "data/processed.csv" AUDIO_DIR = "data" SAMPLES = 16000 SR = 16000 if torch.cuda.is_available(): device = "cuda" else: device = "cpu"

ds = LoadDataset(metadata_file=METADATA,
                 audio_dir=AUDIO_DIR,
                 sample_rate=SR,
                 num_samples=SAMPLES,
                 device=device
                 )
dataloader = DataLoader(ds)
dataiter = iter(dataloader)
data = next(dataiter)
features = data
SR = 16000
sample_rate = SR
processor = AutoProcessor.from_pretrained(
    "MIT/ast-finetuned-audioset-10-10-0.4593")
model = ASTModel.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

# audio file is decoded on the fly
inputs = processor(features, sampling_rate=sample_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
print(last_hidden_state) 

Error Message: Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration. Some weights of the model checkpoint at MIT/ast-finetuned-audioset-10-10-0.4593 were not used when initializing ASTModel: ['classifier.layernorm.bias', 'classifier.layernorm.weight', 'classifier.dense.weight', 'classifier.dense.bias']

xjdeng commented 1 year ago

@yadgire7

I no longer use this model for speech to text, use Whisper instead.

For music genre classification, try converting the audio into spectrograms and training an image classifier on the spectrograms.

I think you might be able to build a multimodal model that takes both the spectrogram and the transcribed text and tries to classify the music using both inputs, at least I think you could with the Fastai library by defining a text block with an image block though I haven't tried it before.