kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
421 stars 89 forks source link

Unexpected spacing with Huggingface wav2vec library #25

Closed rhamnett closed 3 years ago

rhamnett commented 3 years ago

Hi, when using huggingface and FB wav2vec, I'm getting some missing spaces when using various language models which I have created - including a simple LM with just a few phrases. Please can you assist with what is wrong?

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, Wav2Vec2FeatureExtractor
# from datasets import load_dataset
import soundfile as sf
import torch

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-robust-ft-swbd-300h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-robust-ft-swbd-300h")

# !wget https://dldata-public.s3.us-east-2.amazonaws.com/1919-142785-0028.wav

filename = "1919-142785-0028.wav"

audio, sampling_rate = sf.read(filename)

input_values = processor(audio, return_tensors="pt", padding="longest", sampling_rate=sampling_rate).input_values  # Batch size 1

# retrieve logits
logits2 = model(input_values).logits.cpu().detach().numpy()[0]

from pyctcdecode import Alphabet, BeamSearchDecoderCTC, LanguageModel

kenlm_model = kenlm.Model('/root/DeepSpeech/data/lm/lm.binary')
lm=LanguageModel(kenlm_model,alpha =0.169, beta = 0.055)
# make alphabet
vocab_list = list(processor.tokenizer.get_vocab().keys())
# convert ctc blank character representation
vocab_list[0] = ""
# replace special characters
vocab_list[1] = "⁇"
vocab_list[2] = "⁇"
vocab_list[3] = "⁇"
# convert space character representation
vocab_list[4] = " "
# specify ctc blank char index, since conventionally it is the last entry of the logit matrix
alphabet = Alphabet.build_alphabet(vocab_list, ctc_token_idx=0)

hotwords = [ "ringing", "up"]

# build the decoder and decode the logits
decoder = BeamSearchDecoderCTC(alphabet,lm)
decoder.decode(logits2)

result:

No LM: ["OH HELLO IT'S PAKER ID HER I'M BRINGING UP THIS PAPER OL ODAY HELLO MY SISTER'S BEEN BUSY"]

With LM: "OH HELLO IT'S PAKERIDHER I'MSBRINGINGUPTHIS PAPERAALTODAY HELLOMY SISTER'SBEEN BUSY"

No LM:

"WANTED CHIEF JUSTICE OF THE MASSACHUSETTS SUPREME COURT IN APRIL THE S J C 'S CURRENT LEADER EDWARD HENNISE REACHES THE MANDATORY RETIREMENT AGE OF SEVENTY AND THE SUCCESSOR IS EXP"

With LM:

"WANTED CHIEF JUSTICE OF THE MASSACHUSETTS SUPREME COURT IN APRIL THE S C'S CURRENT LEADER "EDWARDHENNISE" REACHES THE MANDATORY RETIREMENT AGE OF SEVENTY AND THE SUCCESSOR IS EXP"

No LM: 'BOIL THEM BEFORE THEYARE PUT INTO THE SOUP OR OTHER DISH THEY MAYBE INTENDED FOR'

With LM: 'BOIL THEM BEFORE THEY ARE PUT INTO THE SOUP OR OTHER DISH THEY MAY BE INTENDED FOR'

gkucsko commented 3 years ago

hey, two quick thoughts on things to try: 1) case information: looks like the vocabulary is all uppercase. is the language model also all upper case or maybe lower case? 2) unigrams are missing: can you explicitly pass all known unigrams into the language model? that way the decoder can build a trie under the hood to efficiently decode partial words

rhamnett commented 3 years ago

Ahh great suggestions, thanks! Will revert back

gkucsko commented 3 years ago

closing for now, unless issues are persisting

rhamnett commented 3 years ago

Thanks your suggestions about case helped, much appreciated. Not tried the second option.