Help using Speech2Text - Githubissues

xjdeng commented 3 years ago

Hey @patil-suraj (and anyone who can help),

Sorry, I'm still a beginner compared to the rest of the folks here so sorry if my question is a little basic.

But I'm trying to build a pipeline to manually transcribe Youtube videos (that aren't transcribed correctly by Google) and I was considering using your model for it.

Here's my unfinished code on Google Colab; the last line throws an error:

!pip install git+https://github.com/huggingface/transformers
!pip install youtube-dl path.py soundfile librosa sentencepiece torchaudio

import youtube_dl
from path import Path as Path
import tempfile
import textwrap
import librosa
import soundfile as sf
import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")
wrapper = textwrap.TextWrapper(width=70)

mydir = tempfile.TemporaryDirectory()
dirname = mydir.name + "/tmp.wav"

!youtube-dl -o $dirname -ci -f 'bestvideo[ext=mp4]+bestaudio' -x --audio-format wav https://www.youtube.com/watch?v=d5yfUuHYWho

filename = dirname + ".wav"

speech, rate = sf.read(filename)
speech = librosa.resample(speech.T, rate, 16000)

features = processor(speech, sampling_rate=16000, padding=True, return_tensors="pt")

And here's the error produced:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-8-8fc3e2d943e0> in <module>()
----> 1 features = processor(speech, sampling_rate=16000, padding=True, return_tensors="pt")

5 frames
/usr/local/lib/python3.7/dist-packages/torchaudio/compliance/kaldi.py in _get_waveform_and_window_properties(waveform, channel, sample_frequency, frame_shift, frame_length, round_to_power_of_two, preemphasis_coefficient)
    147     assert 2 <= window_size <= len(
    148         waveform), ('choose a window size {} that is [2, {}]'
--> 149                     .format(window_size, len(waveform)))
    150     assert 0 < window_shift, '`window_shift` must be greater than 0'
    151     assert padded_window_size % 2 == 0, 'the padded `window_size` must be divisible by two.' \

AssertionError: choose a window size 400 that is [2, 2]

Can anyone point me in the right direction? Thanks.

elgeish commented 3 years ago

Your speech loading code is incorrect; instead try the following:

from IPython.display import Audio

speech, rate = librosa.load(filename, sr=16000)
Audio(speech, rate=rate)

rodrigoheck commented 3 years ago

When I run this line

processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")

I am getting the following error: "AttributeError: type object 'Speech2TextProcessor' has no attribute 'from_pretrained'". Did this part was recently changed in the repository?

EDIT: sorry, my mistake. The previous installation was causing trouble. After uninstalling everything and installing again it is working fine.

patil-suraj commented 3 years ago

As @elgeish said, the speech loading code was causing the issue. Glad to know that you resolved it!

xjdeng commented 3 years ago

Success! Thanks

On Wed, Mar 10, 2021, 20:25 rodrigoheck @.***> wrote:

When I run this line

processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")

I am getting the following error: "AttributeError: type object 'Speech2TextProcessor' has no attribute 'from_pretrained'". Did this part was recently changed in the repository?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/10631#issuecomment-796383135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECXP3AO6IAKWJYWUGB42BTTDAS2PANCNFSM4Y6OTVBA .

yadgire7 commented 1 year ago

I am using the maestro dataset(audio files transformed to Pytorch tensors). Code: if name == 'main': METADATA = "data/processed.csv" AUDIO_DIR = "data" SAMPLES = 16000 SR = 16000 if torch.cuda.is_available(): device = "cuda" else: device = "cpu"

ds = LoadDataset(metadata_file=METADATA,
                 audio_dir=AUDIO_DIR,
                 sample_rate=SR,
                 num_samples=SAMPLES,
                 device=device
                 )
dataloader = DataLoader(ds)
dataiter = iter(dataloader)
data = next(dataiter)
features = data
SR = 16000
sample_rate = SR
processor = AutoProcessor.from_pretrained(
    "MIT/ast-finetuned-audioset-10-10-0.4593")
model = ASTModel.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

# audio file is decoded on the fly
inputs = processor(features, sampling_rate=sample_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
print(last_hidden_state)

Error Message: Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration. Some weights of the model checkpoint at MIT/ast-finetuned-audioset-10-10-0.4593 were not used when initializing ASTModel: ['classifier.layernorm.bias', 'classifier.layernorm.weight', 'classifier.dense.weight', 'classifier.dense.bias']

This IS expected if you are initializing ASTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing ASTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Traceback (most recent call last): File "model.py", line 40, in inputs = processor(features, sampling_rate=sample_rate, return_tensors="pt") File "/home/yadgire/.local/lib/python3.7/site-packages/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py", line 183, in call features = [self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in raw_speech] File "/home/yadgire/.local/lib/python3.7/site-packages/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py", line 183, in features = [self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in raw_speech] File "/home/yadgire/.local/lib/python3.7/site-packages/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py", line 105, in _extract_fbank_features frame_shift=10, File "/home/yadgire/.local/lib/python3.7/site-packages/torchaudio/compliance/kaldi.py", line 592, in fbank waveform, channel, sample_frequency, frame_shift, frame_length, round_to_power_of_two, preemphasis_coefficient File "/home/yadgire/.local/lib/python3.7/site-packages/torchaudio/compliance/kaldi.py", line 143, in _get_waveform_and_window_properties window_size, len(waveform) AssertionError: choose a window size 400 that is [2, 1] @xjdeng Can you please help me out with this?

xjdeng commented 1 year ago

@yadgire7

I no longer use this model for speech to text, use Whisper instead.

For music genre classification, try converting the audio into spectrograms and training an image classifier on the spectrograms.

I think you might be able to build a multimodal model that takes both the spectrogram and the transcribed text and tries to classify the music using both inputs, at least I think you could with the Fastai library by defining a text block with an image block though I haven't tried it before.

huggingface / transformers

Help using Speech2Text #10631