kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.51k stars 511 forks source link

BadDiscount with larger corpus #375

Open maxwellzh opened 2 years ago

maxwellzh commented 2 years ago

Hi @kpu , I met a weird issue: training the n-gram model with relative small corpus was OK, but it raised baddiscount error with even more corpus

  1. Training the N-gram model with corpus of 20 million lines
    === 1/5 Counting and sorting n-grams ==
    Reading corpus-20mil
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100                                                              
    ****************************************************************************************************                                                              
    Unigram tokens 1466966264 types 5148                                                                                                                              
    === 2/5 Calculating and sorting adjusted counts ===                                                                                                               
    Chain sizes: 1:61776 2:21082781696 3:39530217472 4:63248347136 5:92237176832                                                                                      
    Statistics:                                                                                                                                                       
    1 5148 D1=0.333333 D2=0.5 D3+=0.777778                                                                                                                            
    2 5545233 D1=0.556011 D2=1.03531 D3+=1.50558                                                                                                                      
    3 94854713 D1=0.696653 D2=1.07888 D3+=1.42399                                                                                                                     
    4 365940949 D1=0.796763 D2=1.13498 D3+=1.38969                                                                                                                    
    5 692032280 D1=0.801519 D2=1.17756 D3+=1.37174                                                                                                                    
    Memory estimate for binary LM:                                                                                                                                    
    type       MB                                                                                                                                                     
    probing 22553 assuming -p 1.5                                                                                                                                     
    probing 25221 assuming -r models -p 1.5                                                                                                                           
    trie     9509 without quantization                                                                                                                                
    trie     4999 assuming -q 8 -b 8 quantization                                                                                                                     
    trie     8332 assuming -a 22 array pointer compression                                                                                                            
    trie     3822 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
    === 3/5 Calculating and sorting initial probabilities ===
    Chain sizes: 1:61776 2:88723728 3:1897094260 4:8782582776 5:19376903840
    ....
  2. Training the N-gram model with corpus of 20 million lines (same as in 1.) + 5 million lines
    
    === 1/5 Counting and sorting n-grams ===
    Reading corpus-25mil
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    Unigram tokens 1763707292 types 5148
    === 2/5 Calculating and sorting adjusted counts ===
    Chain sizes: 1:61776 2:21082781696 3:39530217472 4:63248347136 5:92237176832
    tools/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'.
    Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 2 because we didn't observe any 1-grams with adjusted count 1; Is this small or artificial data?
    Try deduplicating the input.  To override this error for e.g. a class-based model, rerun with --discount_fallback

Aborted (core dumped)



Is this expected? I know `--discount_fallback` can suppress the issue, but just wonder why larger size of corpus causes the error. Thanks.
maxwellzh commented 2 years ago

Edited: with bug in my text processing fixed, this issue remains.

wwfcnu commented 1 year ago

I also encountered this problem. Adding --discount_fallback can solve it, but I want to know if there will be any problems with the model after adding this option.