kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

How does kenlm actually work in deepspeech? #303

Closed murtuzamdahod closed 4 years ago

murtuzamdahod commented 4 years ago

I have read many theoretical blogs which explains how language model is used but i see very few in implementations other than just the text generation.

Are there any resources or snippets where i can actually look at the piece of code to understand how kenlm works with deepspeech??

My end goal is to build a language model like kenlm but not using it with deepspeech. I want to use it to correct my own vocabularies.

kpu commented 4 years ago

DeepSpeech appears to use it as a feature in a beam search decoder:

https://github.com/mozilla/DeepSpeech/blob/master/native_client/ctcdecode/ctc_beam_search_decoder.cpp

And I think there's some documentation here: https://github.com/mozilla/DeepSpeech/blob/becc3d9745b6b3b21bb2843922b4f0c0252ed7df/doc/Decoder.rst .

But isn't this really more of a question for deepspeech?

murtuzamdahod commented 4 years ago

As i said, My end goal is to build a language model like kenlm but not using it with deepspeech. I want to use it to correct my own vocabularies.

I don't see any documentation to implement this

kpu commented 4 years ago

When you say "build a language model like kenlm" does that mean:

  1. You want to write your own software to implement a language model, as an alternative to kenlm, in which case go read https://dash.harvard.edu/bitstream/handle/1/25104739/tr-10-98.pdf?sequence=1 or any of the more neural work these days

  2. You want to create a file with probabilities using kenlm, in which case go to https://neural.mt/code/kenlm/estimation/

murtuzamdahod commented 4 years ago

I want to use kenlm on my dataset. I have 20lakh food item names and i want to train it using kenlm so that i can get correct names in the output.

For eg:

IN : "Cheeessseee Pijja"

OUT: "Cheese Pizza"

I believe even in deepspeech it works in a similar manner. I have the dataset with correct vocabularies shown in the output.

Will the https://neural.mt/code/kenlm/estimation/ work for this case??

kpu commented 4 years ago

You are of course welcome to build a language model on your data using kenlm. I provide the probabilities of strings. It's up to you to find or write a tool to do the task you want, possibly using these probabilities. And it might use beam search. Have fun.

murtuzamdahod commented 4 years ago

When you say "build a language model like kenlm" does that mean:

  1. You want to write your own software to implement a language model, as an alternative to kenlm, in which case go read https://dash.harvard.edu/bitstream/handle/1/25104739/tr-10-98.pdf?sequence=1 or any of the more neural work these days
  2. You want to create a file with probabilities using kenlm, in which case go to https://neural.mt/code/kenlm/estimation/

What does model.score() gives? How do I find similar words using this scores?

murtuzamdahod commented 3 years ago

Hello @kpu, Thanks for the previous help. I was able to understand the concept of using a language model with a CTC beam search decoder. Are you aware of any good implementations where I can use kenlm language model with a CTC beam search in Python? Because I am not working much with C or Java.