Doubt about the second to last section

uriii3 commented 2 years ago

Good morning and thank you very much for the blog, it is well appreciated and useful!

I wanted to ask a question but in the blog there is no option to do so and I hoped this would be good enough.

I was wondering how you could apply the chunking to a model where I am using a processor that uses language models (as in the blog cited in the section).

2 methods occur to me: 1- you pass the processor you desire to the pipeline function and it takes it into account (didn't work for me, I don't know if I am doing something wrong).

2- look at the function "pipeline" and how it cuts the audio to reproduce the same procedure into my own code (where I am using the model and processor that I desire).

First one looks the most easy but doesn't work, second one looks like I won't be the first to do it so I was hoping that maybe there was some documentation. I don't know if you could point me to the direction of finding such resolution.

I don't know if I'm expressing myself correctly, I hope it makes sense.

Thank you for the article and the work!:)

uriii3 commented 2 years ago

I think I didn't mention that this issue is about this blog: https://huggingface.co/blog/asr-chunking, by @Narsil. Thank you anyone that might help!

Narsil commented 2 years ago

1- you pass the processor you desire to the pipeline function and it takes it into account (didn't work for me, I don't know if I am doing something wrong).

It should work, but do you have an example code ? by language models, I assume you're referring do n-grams that are used with kenlm and pyctcdecode right ? The logic to determine wether to use this or not is brittle, and most importantly if either are missing, it will be skipped entirely (with a warning). We do that because both dependencies are optional.

If you had a code sample, we could try to reproduce and see what's going wrong.

2- look at the function "pipeline" and how it cuts the audio to reproduce the same procedure into my own code (where I am using the model and processor that I desire).

Cutting the audio and keeping track of it is tricky business as there are multiple units possible: time in seconds, time in number of samples (which depends on sampling rate), time in logits space at the very least.

In addition, striding information needs to be kept, and various bound checks have to manage the odd cases. All the code is in the actual pipeline code: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/automatic_speech_recognition.py This might help too: https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/pipelines#pipeline-chunk-batching (Though it doesn't go in the actual audio chunking/striding part, for that the blog is the best doc we have)

uriii3 commented 2 years ago

Hey, thank you for your quick response!

The code that doesn't work with the processor (with the language model) it's this one:

# Import necessary library
import librosa
from transformers import pipeline

audio, rate = librosa.load("../108.wav", sr = 16000)

pipe = pipeline(model="facebook/wav2vec2-base-100h", processor="patrickvonplaten/wav2vec2-base-100h-with-lm")
# stride_length_s is a tuple of the left and right stride length.
# With only 1 number, both sides get the same stride, by default
# the stride_length on one side is 1/6th of the chunk_length_s
output = pipe(audio, chunk_length_s=30, stride_length=(4,2))

print(output)

I was trying to make a mixture with the code from your blog (https://huggingface.co/blog/asr-chunking) and the code that appears on the blog you cited (https://huggingface.co/blog/wav2vec2-with-ngram): (all this while trying to process some files from a "usa" folder and saving the .txt file)

from transformers import AutoProcessor, AutoModelForCTC
import time
import torch
import librosa
import os

print(torch.__version__)

start = time.time()

processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")

country="usa"
recordings_path = "../archive/recordings/" + pais + "/"

count = 0

for file in os.listdir(recordings_path):
    if file.endswith(".wav"):
        text_path = './archive_' + pais + '_lm_v2/'+file[:-4]+'.txt'
        if not os.path.exists(text_path):
            print(file)
            count = count+1
            # Loading the audio file
            audio, rate = librosa.load(recordings_path+file , sr = 16000)
            print(rate)

            inputs = processor(audio, sampling_rate=rate, return_tensors="pt")

            with torch.no_grad():
                logits = model(**inputs).logits

            # Looks like no "argmax" is being used and steps directly to the decoder (which makes sense)!!
            transcription = processor.batch_decode(logits.numpy()).text
            final_transcript = transcription[0].lower()

            # Printing the transcription
            print("saving txt")
            with open(text_path, 'w') as f:
                f.write(final_transcript)

This code works fine but I want now to extend it to a folder with larger audiofiles and would like to be able to mix them both "easily" if that could work. I understand that all the part from inputs to final transcript could be done with the "chunking method" you explained using the pipeline.

What makes sense to me is that the decoder (with the language model as I understand it) doesn't need the usual "argmax" as it does it by themself later on.

I don't know if I explained myself clearly enough, I'm new to reporting this kind of issues, sorry if that makes work harder for you:(.

uriii3 commented 2 years ago

And actually this morning what seems to be working is this code which I found here (https://github.com/huggingface/transformers/issues/14162) by @anton-l and I reworked it a little bit:

import torch
import librosa
from transformers import AutoModelForCTC, AutoProcessor

sample_rate = 16000

model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")

audio, _ = librosa.load(path_to_recording, sr=sample_rate)

chunk_duration = 10 # sec
padding_duration = 0 # sec

chunk_len = chunk_duration*sample_rate
input_padding_len = int(padding_duration*sample_rate)
output_padding_len = model._get_feat_extract_output_lengths(input_padding_len)

all_preds = []
for start in range(input_padding_len, len(audio)-input_padding_len, chunk_len):
    chunk = audio[start-input_padding_len:start+chunk_len+input_padding_len]

    input_values = processor(chunk, sampling_rate=sample_rate, return_tensors="pt")
    with torch.no_grad():
        logits = model(**input_values).logits
        logits = logits[output_padding_len:len(logits)-output_padding_len]
        #predicted_ids = torch.argmax(logits, dim=-1) #i need to take these out because the decoder needs all the 32 possibilities as i understand this
        all_preds.append(logits)

a0 = torch.cat(all_preds, dim=1)

transcription= processor.batch_decode(a0.numpy()).text

print(transcription[0].lower())

And the code, with the changes made works really well for chunking but when the padding is different than 0 it doesn't work. Also it is strange how the padding just adds more columns and doesn't actually overlap the chunks. Anyway that seems to be working now (with padding_duration=0) to chunk the files and be able to process them.

Narsil commented 2 years ago

Hi @uriii3 ,

Can you try ?

from transformers import pipeline

pipe = pipeline(model="facebook/wav2vec2-base-100h", feature_extractor=AutoFeatureExtractor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm"), tokenizer=AutoTokenizer.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm"))
# Or more simply don't know why you use a different model
pipe = pipeline(model="patrickvonplaten/wav2vec2-base-100h-with-lm", )

output = pipe("../108.wav", chunk_length_s=30, stride_length=(4,2))

print(output)

processor is NOT used by pipelines, only the raw tokenizer and feature_extractor (which processor is only a thing wrapper on).

Does that work ?

uriii3 commented 2 years ago

That seems to work perfectly!

In my head the language model was supposed to be in the processor (the model was only for the audio to token part). Thank you very much for the help!

Narsil commented 2 years ago

tokenizer = text <-> input_ids (so ngram model is in here for the reverse arrow). feature_extractor = audio <-> tensor

processor = wrapper on both (make better shareable snippets when not using pipelines)

The pipelines also handles raw audio file to actual audio (through ffmpeg, saves you the librosa calls, and is generally faster at decompressing audio, and can do it in a streaming fashion, when plugging into a mic for instance).

Narsil commented 2 years ago

That being said, you should have received a better error message that processor wasn't being used.

uriii3 commented 2 years ago

Thank you so much for the explanation!

But then, if the feature_extractor is the one responsible for audio <-> tensor, what does the actual "model" do?

And, if you don't mind, is there any pre-trained model that I can also try? I've been looking for it and I haven't been able to found any.

Thanks for all the help!

Narsil commented 2 years ago

model does tensor <-> tensor logic.

Usually in takes some kind of tensor representation of your data, and outputs something like logits.

And, if you don't mind, is there any pre-trained model that I can also try? I've been looking for it and I haven't been able to found any.

Can't say anything better than this: https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads

uriii3 commented 2 years ago

Heey, I know this is closed from a while ago, but I'm still wondering about your comment on how the tokenizer was using the n-gram model:

tokenizer = text <-> input_ids (so ngram model is in here for the reverse arrow). feature_extractor = audio <-> tensor

processor = wrapper on both (make better shareable snippets when not using pipelines)

What it looks like here https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline.decoder and here https://huggingface.co/blog/wav2vec2-with-ngram (on the 4th part) the actual language model is in the decoder (which I assume is different than the tokenizer). Is that right?

Thanks for all your help in resolving the issue!

Narsil commented 2 years ago

You are entirely correct ! The decoder is different from the tokenizer. My bad.

huggingface / blog

Doubt about the second to last section #581