Closed uriii3 closed 2 years ago
I think I didn't mention that this issue is about this blog: https://huggingface.co/blog/asr-chunking, by @Narsil. Thank you anyone that might help!
1- you pass the processor you desire to the pipeline function and it takes it into account (didn't work for me, I don't know if I am doing something wrong).
It should work, but do you have an example code ?
by language models, I assume you're referring do n-grams that are used with kenlm
and pyctcdecode
right ?
The logic to determine wether to use this or not is brittle, and most importantly if either are missing, it will be skipped entirely (with a warning).
We do that because both dependencies are optional.
If you had a code sample, we could try to reproduce and see what's going wrong.
2- look at the function "pipeline" and how it cuts the audio to reproduce the same procedure into my own code (where I am using the model and processor that I desire).
Cutting the audio and keeping track of it is tricky business as there are multiple units possible: time in seconds, time in number of samples (which depends on sampling rate), time in logits space at the very least.
In addition, striding
information needs to be kept, and various bound checks have to manage the odd cases. All the code is in the actual pipeline code: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/automatic_speech_recognition.py
This might help too: https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/pipelines#pipeline-chunk-batching
(Though it doesn't go in the actual audio chunking/striding part, for that the blog is the best doc we have)
Hey, thank you for your quick response!
The code that doesn't work with the processor (with the language model) it's this one:
# Import necessary library
import librosa
from transformers import pipeline
audio, rate = librosa.load("../108.wav", sr = 16000)
pipe = pipeline(model="facebook/wav2vec2-base-100h", processor="patrickvonplaten/wav2vec2-base-100h-with-lm")
# stride_length_s is a tuple of the left and right stride length.
# With only 1 number, both sides get the same stride, by default
# the stride_length on one side is 1/6th of the chunk_length_s
output = pipe(audio, chunk_length_s=30, stride_length=(4,2))
print(output)
I was trying to make a mixture with the code from your blog (https://huggingface.co/blog/asr-chunking) and the code that appears on the blog you cited (https://huggingface.co/blog/wav2vec2-with-ngram): (all this while trying to process some files from a "usa" folder and saving the .txt file)
from transformers import AutoProcessor, AutoModelForCTC
import time
import torch
import librosa
import os
print(torch.__version__)
start = time.time()
processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
country="usa"
recordings_path = "../archive/recordings/" + pais + "/"
count = 0
for file in os.listdir(recordings_path):
if file.endswith(".wav"):
text_path = './archive_' + pais + '_lm_v2/'+file[:-4]+'.txt'
if not os.path.exists(text_path):
print(file)
count = count+1
# Loading the audio file
audio, rate = librosa.load(recordings_path+file , sr = 16000)
print(rate)
inputs = processor(audio, sampling_rate=rate, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
# Looks like no "argmax" is being used and steps directly to the decoder (which makes sense)!!
transcription = processor.batch_decode(logits.numpy()).text
final_transcript = transcription[0].lower()
# Printing the transcription
print("saving txt")
with open(text_path, 'w') as f:
f.write(final_transcript)
This code works fine but I want now to extend it to a folder with larger audiofiles and would like to be able to mix them both "easily" if that could work. I understand that all the part from inputs to final transcript could be done with the "chunking method" you explained using the pipeline.
What makes sense to me is that the decoder (with the language model as I understand it) doesn't need the usual "argmax" as it does it by themself later on.
I don't know if I explained myself clearly enough, I'm new to reporting this kind of issues, sorry if that makes work harder for you:(.
And actually this morning what seems to be working is this code which I found here (https://github.com/huggingface/transformers/issues/14162) by @anton-l and I reworked it a little bit:
import torch
import librosa
from transformers import AutoModelForCTC, AutoProcessor
sample_rate = 16000
model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
audio, _ = librosa.load(path_to_recording, sr=sample_rate)
chunk_duration = 10 # sec
padding_duration = 0 # sec
chunk_len = chunk_duration*sample_rate
input_padding_len = int(padding_duration*sample_rate)
output_padding_len = model._get_feat_extract_output_lengths(input_padding_len)
all_preds = []
for start in range(input_padding_len, len(audio)-input_padding_len, chunk_len):
chunk = audio[start-input_padding_len:start+chunk_len+input_padding_len]
input_values = processor(chunk, sampling_rate=sample_rate, return_tensors="pt")
with torch.no_grad():
logits = model(**input_values).logits
logits = logits[output_padding_len:len(logits)-output_padding_len]
#predicted_ids = torch.argmax(logits, dim=-1) #i need to take these out because the decoder needs all the 32 possibilities as i understand this
all_preds.append(logits)
a0 = torch.cat(all_preds, dim=1)
transcription= processor.batch_decode(a0.numpy()).text
print(transcription[0].lower())
And the code, with the changes made works really well for chunking but when the padding is different than 0 it doesn't work. Also it is strange how the padding just adds more columns and doesn't actually overlap the chunks. Anyway that seems to be working now (with padding_duration=0) to chunk the files and be able to process them.
Hi @uriii3 ,
Can you try ?
from transformers import pipeline
pipe = pipeline(model="facebook/wav2vec2-base-100h", feature_extractor=AutoFeatureExtractor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm"), tokenizer=AutoTokenizer.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm"))
# Or more simply don't know why you use a different model
pipe = pipeline(model="patrickvonplaten/wav2vec2-base-100h-with-lm", )
output = pipe("../108.wav", chunk_length_s=30, stride_length=(4,2))
print(output)
processor
is NOT used by pipelines, only the raw tokenizer
and feature_extractor
(which processor
is only a thing wrapper on).
Does that work ?
That seems to work perfectly!
In my head the language model was supposed to be in the processor (the model was only for the audio to token part). Thank you very much for the help!
tokenizer
= text <-> input_ids
(so ngram model is in here for the reverse arrow).
feature_extractor
= audio <-> tensor
processor
= wrapper on both (make better shareable snippets when not using pipelines)
The pipelines
also handles raw audio file to actual audio (through ffmpeg, saves you the librosa
calls, and is generally faster at decompressing audio, and can do it in a streaming fashion, when plugging into a mic for instance).
That being said, you should have received a better error message that processor
wasn't being used.
Thank you so much for the explanation!
But then, if the feature_extractor is the one responsible for audio <-> tensor, what does the actual "model" do?
And, if you don't mind, is there any pre-trained model that I can also try? I've been looking for it and I haven't been able to found any.
Thanks for all the help!
model does tensor <-> tensor
logic.
Usually in takes some kind of tensor representation of your data, and outputs something like logits.
And, if you don't mind, is there any pre-trained model that I can also try? I've been looking for it and I haven't been able to found any.
Can't say anything better than this: https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads
Heey, I know this is closed from a while ago, but I'm still wondering about your comment on how the tokenizer was using the n-gram model:
tokenizer
=text <-> input_ids
(so ngram model is in here for the reverse arrow).feature_extractor
=audio <-> tensor
processor
= wrapper on both (make better shareable snippets when not using pipelines)
What it looks like here https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline.decoder and here https://huggingface.co/blog/wav2vec2-with-ngram (on the 4th part) the actual language model is in the decoder (which I assume is different than the tokenizer). Is that right?
Thanks for all your help in resolving the issue!
You are entirely correct ! The decoder
is different from the tokenizer
.
My bad.
Good morning and thank you very much for the blog, it is well appreciated and useful!
I wanted to ask a question but in the blog there is no option to do so and I hoped this would be good enough.
I was wondering how you could apply the chunking to a model where I am using a processor that uses language models (as in the blog cited in the section).
2 methods occur to me: 1- you pass the processor you desire to the pipeline function and it takes it into account (didn't work for me, I don't know if I am doing something wrong).
2- look at the function "pipeline" and how it cuts the audio to reproduce the same procedure into my own code (where I am using the model and processor that I desire).
First one looks the most easy but doesn't work, second one looks like I won't be the first to do it so I was hoping that maybe there was some documentation. I don't know if you could point me to the direction of finding such resolution.
I don't know if I'm expressing myself correctly, I hope it makes sense.
Thank you for the article and the work!:)