kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
422 stars 89 forks source link

Alphabet conversion from Hugging Faces do not work #3

Closed flariut closed 3 years ago

flariut commented 3 years ago

Following the tutorial:

from pyctcdecode import Alphabet, BeamSearchDecoderCTC

vocab_dict = {'<pad>': 0, '<s>': 1, '</s>': 2, '<unk>': 3, '|': 4, 'E': 5, 'T': 6, 'A': 7, 'O': 8, 'N': 9, 'I': 10, 'H': 11, 'S': 12, 'R': 13, 'D': 14, 'L': 15, 'U': 16, 'M': 17, 'W': 18, 'C': 19, 'F': 20, 'G': 21, 'Y': 22, 'P': 23, 'B': 24, 'V': 25, 'K': 26, "'": 27, 'X': 28, 'J': 29, 'Q': 30, 'Z': 31}

# make alphabet
vocab_list = list(vocab_dict.keys())
# convert ctc blank character representation
vocab_list[0] = ""
# replace special characters
vocab_list[1] = "⁇"
vocab_list[2] = "⁇"
vocab_list[3] = "⁇"
# convert space character representation
vocab_list[4] = " "
# specify ctc blank char index, since conventially it is the last entry of the logit matrix
alphabet = Alphabet.build_bpe_alphabet(vocab_list, ctc_token_idx=0)

Results in:

ValueError: Unknown BPE format for vocabulary. Supported formats are 1) ▁ for indicating a space and 2) ## for continuation of a word.

I'm trying to use a HuggingFaces model with a KenLM decoding but I can't get past this point. Thanks in advance.

poneill commented 3 years ago

Can reproduce

gkucsko commented 3 years ago

Thanks for reaching out. Looking at your vocabulary, it doesn't look like it is BPE style, just regular characters. You should be able to instead use Alphabet.build_alphabet. Does that work for you?

flariut commented 3 years ago

The vocabulary presented in the example is exactly the one that shows up in the tutorial. In my code, I tried following similar steps as my vocabulary (a spanish one) follows the same structure. I did make that example work as a BPE Alphabet by changing the line:

vocab_list[0] = ""
# for this:
vocab_list[0] = '▁⁇▁'

With that change I could get into the decoding stage (using build_ctcdecoder and a kenlm model, setting ctc_token_idx=0) but sadly my decoded text output is almost the same the greedy decoder from Transformers library could get without any language model. As you suggested, I tried passing it as a regular vocab, but I get the error:

ValueError: For non-bpe alphabet only length 1 entries and blank token are allowed.

So I changed the conversion to:

vocab_list[0] = ""
vocab_list[1] = "|"
vocab_list[2] = "|"
vocab_list[3] = "|"
vocab_list[4] = " "

Then I could get it to decode, only to give the same results than before, but this time even without spaces (even if [0] is " ").

Here's some more info on my specific code. It's literally the example code from the huggingface wih a 30seg custom audio, and your "quick start" example for decoding, but if you wish I can share my full code.

# My vocab
{'<pad>': 0, '<s>': 1, '</s>': 2, '<unk>': 3, '|': 4, 'E': 5, 'A': 6, 'O': 7, 'S': 8, 'N': 9, 'R': 10, 'L': 11, 'I': 12, 'D': 13, 'U': 14, 'T': 15, 'C': 16, 'M': 17, 'P': 18, 'B': 19, 'Q': 20, 'Y': 21, 'H': 22, 'G': 23, 'V': 24, 'Í': 25, 'Á': 26, 'F': 27, 'Ó': 28, 'J': 29, 'É': 30, 'Z': 31, 'Ñ': 32, 'X': 33, 'Ú': 34, "'": 35, 'K': 36, 'W': 37, 'Ü': 38, '-': 39}

# My Acoustic model
'https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-spanish'

# My language model (mls_lm_spanish.tar.gz)
'http://www.openslr.org/94/'

# Output from transformers
predicted_sentences = processor.batch_decode(predicted_ids)
'NO SO FUE PANAMÁ YO LUJER MUARIÓ EL SODECI LE FUECIL YO NUNCA E DISTO LUMIR TODOPORTUÑA BELO NESE CHOUTOMA Y ES VERDAD QUE ESTATOONRATO QUE B HABLA L ALTEADO NO LE PERMANADIYO PERO AELLATLLOMA QUETOLES TODO ECERTÓ DESAGIRAS FÁCIL AENMAJONTI MIYNAUNO QUIL IEN DIRECTO SOLAS ORAS ARECUARENTYSIETE BAJO CON EL ÚLTIMO QUE DICE ASÍ ESTAS TRISTEZAS O PENAS DE LUI MIVEES VERDAD QUE LA MUJER QUE DI EN ARGENTINA ES REALMENTE LA MADRE LUIS MIGUEL Y LO TIENE MUY'

# Output from pyctcdecode build_ctcdecoder, with "bpe" vocab created as we discussed
'NO ESO FUE PANAMÁ YO LUJER MUARIÓ EL SODECHI LE FUECIL YO NUNCA E DISTO LUMIR TODOPORTUÑA BELO NESE CHOUTOMA Y ES VERDAD QUE ESTATOONRATO QUE B HABLA L ALTEADO NO LE PERMANADIYO PERO AELLALOMA QUE TOLES TODO ECERTÓ DESAGIRAS FÁCIL AENMAJONTI MIYNAUNO QUIL IEN DIRECTO SOBLAS ORAS A RECUARENTYSIETE BAJO CON EL ÚLTIMO QUE DICE ASÍ ESTAS TRISTEZAS O PENAS DE LUI MIVELES VERDAD QUE LA MUJER QUE DI EN ARGENTINA ES REALMENTE LA MADRE LUIS MIGUEL Y LO TIENE MUY'

# Output from pyctcdecode build_ctcdecoder, with "regular" vocab created as we discussed
'NOESOFUENPANAMÁYOLUJERMUARIÓELSODECHILEFUECILSYONUNCAHEDISTOLUMIRTODOPORTUÑABELONESECHOUTOMAYESVERDADQUESTATOONRATOQUEBHABLALALTEADONOLEPERMANADIYOPEROAELLATLLOMAQUETOLESTODOECERTÓDESAGIRASFÁCILAENMAJONTIMIYNAUNOQUILIENDIRECTOSASOLASORASARECUARENTYSIETEBAJOCONEL ÚLTIMOQUEDICEASÍESTASTRISTEZASOPENASDELUIMIVEESVERDADQUELAMUJERQUEDIENA ARGENTINAES REALMENTEEALAMADREDALUISMIGUEL YLOTIENEMUY'

# I tried changing alpha, beta values without any luck

I understand this is under development, but I stumbled upon your code some days ago and thought it's very cool and have the features I need, so at least I hope I can contribute this way sharing my experience trying to use it. Many thanks in advance!

gkucsko commented 3 years ago

Thanks for catching this, I'll work on updating the BPE alphabet parsing logic. It's a little tricky to make it bullet proof because different tokenizers follow different conventions, so this type of feedback is very helpful. As for you particular example, I'll have a look at the models and see what I can do. We've mostly used it for english so far, so i'll have a look in more detail to see what's going on.

gkucsko commented 3 years ago

In the meantime, let's try to get the "regular" vocabulary to work since this should be the right thing to use here. Also, if you want good results then you should pass a list of unigrams to the language model (this is important to make good beam proposals before scoring them with kenlm). Can you try to follow the below snippet and see what you get?

from pyctcdecode import Alphabet, BeamSearchDecoderCTC, LanguageModel

# make alphabet
vocab_list = list(asr_processor.tokenizer.get_vocab().keys())
# convert ctc blank character representation
vocab_list[0] = ""
# replace special characters
vocab_list[1] = "⁇"
vocab_list[2] = "⁇"
vocab_list[3] = "⁇"
# convert space character representation
vocab_list[4] = " "
# specify ctc blank char index, since conventially it is the last entry of the logit matrix
alphabet = Alphabet.build_alphabet(vocab_list, ctc_token_idx=0)

kenlm_model = kenlm.Model("my_model.bin")
with open("my_word_list.txt") as f:
    unigrams = set(f.read().split())

# build the language model
language_model = LanguageModel(kenlm_model, unigrams)

# build the decoder and decode the logits
decoder = BeamSearchDecoderCTC(alphabet, language_model)
decoder.decode(logits)
flariut commented 3 years ago

Hi Georg, first of all thanks for your support. I tried exactly what you told me, using the supplied "vocab_counts.txt" in my language model (I'm guessing that what you mean by unigram list is a list of unique words used in the model, or I am understanding it wrong?), but still results are far from even legible: NOESOFUENPANAMÁ YOLUJEMÓELSODEC FUECILSYONUNCA ISTOLUMIRTOOPORTUÑA BELONESECHOTOM YESVERDADQUE ESTATORTOQUEHABLA LALTEADONOLE PERMANADIOPERO QUETOLESTODOCERTÓ ESAGIRASFÁCIL DIRECTOASOLAORAS RECUARENTYSIETE BAJOCONEL ÚLTIMOQUEDICE ASÍESTASTRISTEZAS OPENASDELUIMIVEE VERDADQUELAMUJER QUEDIEN ARGENTINAES REALMENTE LAMADRE LUISMIGUEL YLOTIENEMUY I know we are off the original issue now, but I can't seem to find what I'm missing to improve the prediction from the acoustic model. Here's the full code I'm using:

import torch
import torch.nn.functional as F
import librosa
import os
import kenlm
import faulthandler
from pyctcdecode import Alphabet, BeamSearchDecoderCTC, LanguageModel
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor, Wav2Vec2CTCTokenizer

faulthandler.enable()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

acoustic_model = "grosman-wav2vec2-large-xlsr-53-spanish"
language_model = "spanish-lm-compiled.bin"

processor = Wav2Vec2Processor.from_pretrained(acoustic_model)
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(acoustic_model)
model = Wav2Vec2ForCTC.from_pretrained(acoustic_model)

def path_to_rosa_audio(path):
    files = os.listdir(path)
    audio_lst = []
    for f in files[70:71]:
        audio, sampling_rate = librosa.load(f"{path}/{f}", sr=16_000)
        audio_lst.append(audio)
    return audio_lst

path = "prueba_seg_2"
audio_list = path_to_rosa_audio(path)

print("Por setear input")
inputs = processor(audio_list,
                   sampling_rate=16_000,
                   return_tensors="pt",
                   padding=True)

print("Por setear modelo acustico")
with torch.no_grad():
    logits = model(inputs.input_values,
                   attention_mask=inputs.attention_mask).logits

print(logits)
print(logits[0][0].sum())
print(logits.shape)

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)
print(predicted_sentences)

vocab_dict = tokenizer.get_vocab()
print(vocab_dict)

sort_vocab = sorted((value, key) for (key,value) in vocab_dict.items())
vocab = [x[1] for x in sort_vocab]
print(vocab)

print("Por llamar a decode")

kenlm_model = kenlm.Model(language_model)

vocab_list = vocab
# convert ctc blank character representation
vocab_list[0] = ""
# replace special characters
vocab_list[1] = "⁇"
vocab_list[2] = "⁇"
vocab_list[3] = "⁇"
# convert space character representation
vocab_list[4] = " "
# specify ctc blank char index, since conventially it is the last entry of the logit matrix
alphabet = Alphabet.build_alphabet(vocab_list, ctc_token_idx=0)

with open("mls_lm_spanish/vocab_counts.txt") as f:
    unigrams = set([l.split('\t')[0] for l in f.read().split('\n')])

# build the language model
lm = LanguageModel(kenlm_model, unigrams)

# build the decoder and decode the logits
decoder = BeamSearchDecoderCTC(alphabet, lm)
text = decoder.decode(logits.numpy()[0])
print(text)

Thanks!

gkucsko commented 3 years ago

Hmm, that's odd. Let me try and help figure this out, it's nice to test this use case. Could you maybe share a google drive link together with a list of unigrams and single audio file example with me? you can try sharing it with pyctcdecode-maintainer [at] kensho.com or if that doesn't work you can use georg [at] kensho...

poneill commented 3 years ago

@flariut for reference, what range of values for alpha and beta did you try? At first glance this looks a lot like a tuning issue: the n-gram character statistics look correct, but the decoder appears to be falling into the trap of creating long OOVs, suggesting that it might penalizing the word count of the label too strongly. What do you get by setting beta=0?

flariut commented 3 years ago

Patrick, I've tried the values that show on quick start (a=0.5, b=1), also the above code is using the default values as they're not assigned, and as you say I'm trying now to set it to a range of values from 0 to extreme values like 10. No tuning seems to output correct spanish words or correct spacings. For another reference, here's the same audio transcripted by flashlight's wav2letter and the same language model, with lm_weight (I assume that's alpha) = 1.5: eso panamá y si le fué quien yo no he dicho el vito eso tomaría verdad que te toda la de que habla no le perdona pero ahí está espera no mal dos dos estoica tea tenia alguno que las une vamos por el último que dice así esta tristeza pena de no la verdad y la mujer que die argentina es realmente la madre no miguel y no tiene As you can see, the phrase also doesn't have any sense, but at least the words are correct spanish. Sorry for being this pragmatic, but I'm working with unlabeled data, and scoring isn't really an issue for me. I just need the model to output correct spanish words with certain degree of legibility. Georg, let me make some more tests to see if I can sort this out, and I email you. Thank you very much for your help.

gkucsko commented 3 years ago

btw make sure that alphabet, acoustic model vocab, unigrams, LM all use the same casing. i noticed that some of your posts are all upper case but the most recent example is lowercase. maybe double check that they are all the same when doing the decoding

flariut commented 3 years ago

Georg, I think that was the culprit of my problem. After some tests now I can confirm the language model is being applied and the alpha and beta settings change a lot the results, so it's up to my particular use case to test and tune. Thanks a lot for your time and hope my little inconvience at least serves to refine the tutorials :)

gkucsko commented 3 years ago

no worries, glad it's working for you now. I'll make sure to add some warnings to detect if there is a mismatch!

gkucsko commented 3 years ago

adding extra protection in #4