kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Decode Result is bad or My Model is Bad ? #397

Closed ridhoalattas closed 2 years ago

ridhoalattas commented 2 years ago

Hai guys i already using KenLM language model to decode from wav2vec into decode in some case result is great but on the other cases result is mixing all text become one string. im already using 3-grams, 4-grams, and 5-grams and the result is just the same.

Here is the result

apaan apa ngesdengklokanjrguakepikiran lagi rengasrengasringasapaapayagedelanjut

my expectation result is

apaan apa ngesdeng klok anjr gua kepikiran lagi rengas rengas ringas apa apa ya gedelanjut

this is my code :

def get_decoder_ngram_model(tokenizer, ngram_lm_path):
    vocab_dict = tokenizer.get_vocab()
    sort_vocab = sorted((value, key) for (key, value) in vocab_dict.items())

    vocab = [x[1] for x in sort_vocab][:-2]
    vocab_list = vocab

    vocab_list[tokenizer.pad_token_id] = ""
    vocab_list[tokenizer.unk_token_id] = ""
    vocab_list[tokenizer.word_delimiter_token_id] = " "

    alphabet = Alphabet.build_alphabet(vocab_list, ctc_token_idx=tokenizer.pad_token_id)
    lm_model = kenlm.Model(ngram_lm_path)
    decoder = BeamSearchDecoderCTC(alphabet, language_model=LanguageModel(lm_model))

    return decoder

beam_search_output = ngram_lm_model.decode(logits=logits[0][0])

is there any solution for this case?? @kpu @patrickvonplaten

thank you guys

kpu commented 2 years ago

BeamSearchDecoderCTC is third-party software. It's up to you to debug things down to a minimal example: https://en.wikipedia.org/wiki/Minimal_reproducible_example . That exercise will also inform you which repository to consult. (Returning a probability you don't like is not a bug.) There is no need to @ every contributor.

ridhoalattas commented 2 years ago

i see the problem because third-party software. ill try to find out to infer without BeamSearchDecoderCTC. Sorry to mentioning all contributor because ive got this problem for a few day