Closed xjdeng closed 3 years ago
Your speech loading code is incorrect; instead try the following:
from IPython.display import Audio
speech, rate = librosa.load(filename, sr=16000)
Audio(speech, rate=rate)
When I run this line
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")
I am getting the following error: "AttributeError: type object 'Speech2TextProcessor' has no attribute 'from_pretrained'". Did this part was recently changed in the repository?
EDIT: sorry, my mistake. The previous installation was causing trouble. After uninstalling everything and installing again it is working fine.
As @elgeish said, the speech loading code was causing the issue. Glad to know that you resolved it!
Success! Thanks
On Wed, Mar 10, 2021, 20:25 rodrigoheck @.***> wrote:
When I run this line
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")
I am getting the following error: "AttributeError: type object 'Speech2TextProcessor' has no attribute 'from_pretrained'". Did this part was recently changed in the repository?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/10631#issuecomment-796383135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECXP3AO6IAKWJYWUGB42BTTDAS2PANCNFSM4Y6OTVBA .
I am using the maestro dataset(audio files transformed to Pytorch tensors). Code: if name == 'main': METADATA = "data/processed.csv" AUDIO_DIR = "data" SAMPLES = 16000 SR = 16000 if torch.cuda.is_available(): device = "cuda" else: device = "cpu"
ds = LoadDataset(metadata_file=METADATA,
audio_dir=AUDIO_DIR,
sample_rate=SR,
num_samples=SAMPLES,
device=device
)
dataloader = DataLoader(ds)
dataiter = iter(dataloader)
data = next(dataiter)
features = data
SR = 16000
sample_rate = SR
processor = AutoProcessor.from_pretrained(
"MIT/ast-finetuned-audioset-10-10-0.4593")
model = ASTModel.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
# audio file is decoded on the fly
inputs = processor(features, sampling_rate=sample_rate, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
print(last_hidden_state)
Error Message: Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration. Some weights of the model checkpoint at MIT/ast-finetuned-audioset-10-10-0.4593 were not used when initializing ASTModel: ['classifier.layernorm.bias', 'classifier.layernorm.weight', 'classifier.dense.weight', 'classifier.dense.bias']
@yadgire7
I no longer use this model for speech to text, use Whisper instead.
For music genre classification, try converting the audio into spectrograms and training an image classifier on the spectrograms.
I think you might be able to build a multimodal model that takes both the spectrogram and the transcribed text and tries to classify the music using both inputs, at least I think you could with the Fastai library by defining a text block with an image block though I haven't tried it before.
Hey @patil-suraj (and anyone who can help),
Sorry, I'm still a beginner compared to the rest of the folks here so sorry if my question is a little basic.
But I'm trying to build a pipeline to manually transcribe Youtube videos (that aren't transcribed correctly by Google) and I was considering using your model for it.
Here's my unfinished code on Google Colab; the last line throws an error:
And here's the error produced:
Can anyone point me in the right direction? Thanks.