kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.46k stars 514 forks source link

Facing error while training a character ngram model using kenlm #435

Open fkhan98 opened 1 year ago

fkhan98 commented 1 year ago

I want to train a character ngram model for Bangla language. I have preprocessed my corpus so that it looks like this, here is a small demo:

| অ প ে ক ্ ষ া | ক র ত ে ন | উ প ভ ো গ | ক র ত ে ন | ত া র | উ জ ্ জ ্ ব ল | উ প স ্ থ ি ত ি | এ ই | স র ক া র | ল ু ট ে র া | ত ো ষ ণ ক া র ী ঃ | র ু ম ি ন | ফ া র হ া ন া | হ ্ য া ঁ | আ প ন া র | র ে জ ি স ্ ট ্ র ে শ ন | ফ র ্ ম | স ম ্ প ন ্ ন | ক র া র | প র | প র ি ব র ্ ত ন | ক র া | স ম ্ ভ ব | স া ধ া র ণ ত | ন ি শ ্ চ ি ত ক র ণ | এ ব ং | চ া ল া ন |......

Here I have appended all the meaningful sentence in my dataset, one after another in a single line of a .txt file. All the characters in each word of each sentence has been space separated and representing word boundaries with a |. The training data size is around 7GB which is quite big in terms of text.

I want to train the 6gram model using the command: ./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" Demo sample of how path_to_my_preprocessed_text_corpus.txt file looks like is shown above.

Running the command: ./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" gives the following error:

=== 1/5 Counting and sorting n-grams === Reading /home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/our_data/train_processed_char_level_git_data_proper_nouns_ai4bharat.txt ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100


Unigram tokens 2832827518 types 64 === 2/5 Calculating and sorting adjusted counts === Chain sizes: 1:768 2:6642523648 3:12454731776 4:19927572480 5:29061042176 6:39855144960 /home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'. Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 2 because we didn't observe any 1-grams with adjusted count 1; Is this small or artificial data? Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback

Aborted (core dumped)

But when I run the training using the same command but with --discount_fallback there error does not persist anymore and training starts, the command with --discount_fallback is: ./kenlm/build/bin/lmplz -o 6 --discount_fallback --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" . My question is why is this? and when I run training using --discount_fallback will there be anything wrong with the model?