pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.48k stars 641 forks source link

Need more detail and tutorial on how to use the language model to decrease the word rate error. #2526

Open JonathanSum opened 2 years ago

JonathanSum commented 2 years ago

šŸ“š The doc issue

  1. How do we build our own language model and add it to the language model, such as wav2vec2? However many of the solutions from the doc require using another library.

  2. If 1 requires training the language model again, then It looks like we can use our own text file for the language model to form a bean search image https://github.com/facebookresearch/fairseq/issues/3157

I was working on a project for deaf students to have a subtitle. You know they found out that after wav2vec2 using a language model, such as n-gram the word rate error will be dropped. Thus, I was thinking to add a lecture note or textbook to decrease the WRE for college class subtitling. But a lot of language model implementation for Pytorch audio model requires other library, such as KenLM. But I was thinking if it is a n-gram model, it shouldn't be difficult to have it in Pytorch. If we want to deploy it in other language, such as Javascript, it will require ONNX in Pytorch, so we may need to write the language model in Pytorch rather than in KenLM

First, this has been asked that it looks like we do not need to train the language model(such as n-gram) again. We just need to put the text file that has all the possible words that we want the n-gram model to do the beam search. image But you can see the doc only gives you one line of code to "short cut" everything without telling the user how to use their own text file.

Again, if we look at the doc, we see "Builds CTC beam search decoder from Flashlight". Thus, how do we use our own language model? Again, my point of using my own language model is not because I have some powerful transformer models. It is I need to be clear on how the model handles the process of wav2vec2 output to text with the language model. Thus this issue was proposed and asked, and I feel it was not explained detailly. https://github.com/facebookresearch/fairseq/issues/3157

Suggestion: I prefer HuBERT since it is smaller than Wav2vec2.

carolineechen commented 2 years ago

Hi @JonathanSum, thanks for the question.

For beam search decoding, the language model is trained externally and plays a part in the decoder scoring function, affecting the WER. We currently only support n-gram language models, which you can train from scratch using KenLM. We plan to provide support for neural network LMs and build-your-own LMs in a future release.

The lexicon file consists of the possible words, and should correspond to the words found in the language model. This file is used to restrict the possible outputs of the decoder, and is not a replacement for the language model. It is formatted like follows, with a word and its corresponding spelling (in the form of the acoustic model tokens) on each line

able a b l e |
about a b o u t |
above a b o v e |
...
hello h e l l o |
...

Once you have the language model (arpa or bin file) and the corresponding lexicon files, you can pass them in to construct the decoder.

Some additional sources/documentation for how to use the decoder

taylorchu commented 2 years ago

@carolineechen do you know why it does not use ngram where n >= 2 in kenlm? it seems to only use unigrams in kenlm.

thanks!

carolineechen commented 2 years ago

@taylorchu Where are you seeing this? We do support KenLM n-grams where n >= 2 (the one in the tutorial linked above uses a 4-gram LM). You can download or train an n-gram KenLM and feed the path in to the decoder.

taylorchu commented 2 years ago

Let me link the issue here too: https://github.com/flashlight/text/issues/6

JonathanSum commented 2 years ago

@carolineechen

I didn't mean using a lexicon file to replace the language model. I was saying if it consists of the possible word, can we add the additional possible word without training it again?

In addition, I hope everything will be pure Pytorch. I don't understand why it has to use KenLM for LM or Pytorch lighting for Hubert. It is just like what the function said: "it is "wrapped for LM" ". Thus, I have no idea how to make it work on Pytorch with the KenLM inside for the language model.

In the end, everything just becomes very complicated to work with. It is super unclear how to use our own lexicon file or how to use those extra libraries. Is that I have to learn to use KenLM to make them work? This is no longer to be a Pytorch.

carolineechen commented 2 years ago

@JonathanSum TorchAudio unfortunately does not support n-gram language model training, so if you want to use one, you can train one using KenLM. KenLM will output an arpa or binary file that you can then pass in to the TorchAudio decoder to decode with a language model. The lexicon file should contain the same set of words that the LM is trained on, otherwise if you add a word to the lexicon file that is not in the LM, then the LM would return a score of 0 when it sees that word. I'm not fully sure the performance consequences in that scenario.

taylorchu commented 2 years ago

I looked into this flashlight decoder this weekend. Models like hubert or wav2vec have mostly spelling mistakes (ie. they output similar sounded words). The simple ones can be fixed by having lexicon files, while the harder ones are not going to be easily fixed by old-school ngram models (unless you have a huge one like 5-10GB kenlm tries. )