Here I have appended all the meaningful sentence in my dataset, one after another in a single line of a .txt file. All the characters in each word of each sentence has been space separated and representing word boundaries with a |. The training data size is around 7GB which is quite big in terms of text.
I want to train the 6gram model using the command:
./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa"
Demo sample of how path_to_my_preprocessed_text_corpus.txt file looks like is shown above.
Running the command: ./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" gives the following error:
=== 1/5 Counting and sorting n-grams ===
Reading /home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/our_data/train_processed_char_level_git_data_proper_nouns_ai4bharat.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Unigram tokens 2832827518 types 64
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:768 2:6642523648 3:12454731776 4:19927572480 5:29061042176 6:39855144960
/home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 2 because we didn't observe any 1-grams with adjusted count 1; Is this small or artificial data?
Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback
Aborted (core dumped)
But when I run the training using the same command but with --discount_fallback there error does not persist anymore and training starts, the command with --discount_fallback is: ./kenlm/build/bin/lmplz -o 6 --discount_fallback --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" . My question is why is this? and when I run training using --discount_fallback will there be anything wrong with the model?
@kpu I want to train a character ngram model for Bangla language. I have preprocessed my corpus so that it looks like this, here is a small demo:
| অ প ে ক ্ ষ া | ক র ত ে ন | উ প ভ ো গ | ক র ত ে ন | ত া র | উ জ ্ জ ্ ব ল | উ প স ্ থ ি ত ি | এ ই | স র ক া র | ল ু ট ে র া | ত ো ষ ণ ক া র ী ঃ | র ু ম ি ন | ফ া র হ া ন া | হ ্ য া ঁ | আ প ন া র | র ে জ ি স ্ ট ্ র ে শ ন | ফ র ্ ম | স ম ্ প ন ্ ন | ক র া র | প র | প র ি ব র ্ ত ন | ক র া | স ম ্ ভ ব | স া ধ া র ণ ত | ন ি শ ্ চ ি ত ক র ণ | এ ব ং | চ া ল া ন |......
Here I have appended all the meaningful sentence in my dataset, one after another in a single line of a .txt file. All the characters in each word of each sentence has been space separated and representing word boundaries with a |. The training data size is around 7GB which is quite big in terms of text.
I want to train the 6gram model using the command:
./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa"
Demo sample of how path_to_my_preprocessed_text_corpus.txt file looks like is shown above.Running the command:
./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa"
gives the following error:=== 1/5 Counting and sorting n-grams === Reading /home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/our_data/train_processed_char_level_git_data_proper_nouns_ai4bharat.txt ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Unigram tokens 2832827518 types 64 === 2/5 Calculating and sorting adjusted counts === Chain sizes: 1:768 2:6642523648 3:12454731776 4:19927572480 5:29061042176 6:39855144960 /home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'. Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 2 because we didn't observe any 1-grams with adjusted count 1; Is this small or artificial data? Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback
Aborted (core dumped)
But when I run the training using the same command but with
--discount_fallback
there error does not persist anymore and training starts, the command with--discount_fallback
is:./kenlm/build/bin/lmplz -o 6 --discount_fallback --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa"
. My question is why is this? and when I run training using --discount_fallback will there be anything wrong with the model?Originally posted by @amitbcp in https://github.com/kpu/kenlm/issues/302#issuecomment-698249425