@kpu Facing error while training a character ngram model using kenlm #435

@kpu I want to train a character ngram model for Bangla language. I have preprocessed my corpus so that it looks like this, here is a small demo:

Here I have appended all the meaningful sentence in my dataset, one after another in a single line of a .txt file. All the characters in each word of each sentence has been space separated and representing word boundaries with a |. The training data size is around 7GB which is quite big in terms of text.

I want to train the 6gram model using the command: ./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" Demo sample of how path_to_my_preprocessed_text_corpus.txt file looks like is shown above.

Running the command: ./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" gives the following error:

=== 1/5 Counting and sorting n-grams === Reading /home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/our_data/train_processed_char_level_git_data_proper_nouns_ai4bharat.txt ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Unigram tokens 2832827518 types 64 === 2/5 Calculating and sorting adjusted counts === Chain sizes: 1:768 2:6642523648 3:12454731776 4:19927572480 5:29061042176 6:39855144960 /home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'. Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 2 because we didn't observe any 1-grams with adjusted count 1; Is this small or artificial data? Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback

Aborted (core dumped)

But when I run the training using the same command but with --discount_fallback there error does not persist anymore and training starts, the command with --discount_fallback is: ./kenlm/build/bin/lmplz -o 6 --discount_fallback --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" . My question is why is this? and when I run training using --discount_fallback will there be anything wrong with the model?

Originally posted by @amitbcp in https://github.com/kpu/kenlm/issues/302#issuecomment-698249425

kpu / kenlm

@kpu Facing error while training a character ngram model using kenlm #435 #436