BadDiscount with larger corpus

maxwellzh commented 2 years ago

Hi @kpu , I met a weird issue: training the n-gram model with relative small corpus was OK, but it raised baddiscount error with even more corpus

Training the N-gram model with corpus of 20 million lines

=== 1/5 Counting and sorting n-grams ==
Reading corpus-20mil
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100                                                              
****************************************************************************************************                                                              
Unigram tokens 1466966264 types 5148                                                                                                                              
=== 2/5 Calculating and sorting adjusted counts ===                                                                                                               
Chain sizes: 1:61776 2:21082781696 3:39530217472 4:63248347136 5:92237176832                                                                                      
Statistics:                                                                                                                                                       
1 5148 D1=0.333333 D2=0.5 D3+=0.777778                                                                                                                            
2 5545233 D1=0.556011 D2=1.03531 D3+=1.50558                                                                                                                      
3 94854713 D1=0.696653 D2=1.07888 D3+=1.42399                                                                                                                     
4 365940949 D1=0.796763 D2=1.13498 D3+=1.38969                                                                                                                    
5 692032280 D1=0.801519 D2=1.17756 D3+=1.37174                                                                                                                    
Memory estimate for binary LM:                                                                                                                                    
type       MB                                                                                                                                                     
probing 22553 assuming -p 1.5                                                                                                                                     
probing 25221 assuming -r models -p 1.5                                                                                                                           
trie     9509 without quantization                                                                                                                                
trie     4999 assuming -q 8 -b 8 quantization                                                                                                                     
trie     8332 assuming -a 22 array pointer compression                                                                                                            
trie     3822 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:61776 2:88723728 3:1897094260 4:8782582776 5:19376903840
....

Training the N-gram model with corpus of 20 million lines (same as in 1.) + 5 million lines


=== 1/5 Counting and sorting n-grams ===
Reading corpus-25mil
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 1763707292 types 5148
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:61776 2:21082781696 3:39530217472 4:63248347136 5:92237176832
tools/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 2 because we didn't observe any 1-grams with adjusted count 1; Is this small or artificial data?
Try deduplicating the input.  To override this error for e.g. a class-based model, rerun with --discount_fallback

Aborted (core dumped)



Is this expected? I know `--discount_fallback` can suppress the issue, but just wonder why larger size of corpus causes the error. Thanks.

maxwellzh commented 2 years ago

Edited: with bug in my text processing fixed, this issue remains.

wwfcnu commented 1 year ago

I also encountered this problem. Adding --discount_fallback can solve it, but I want to know if there will be any problems with the model after adding this option.

kpu / kenlm

BadDiscount with larger corpus #375