Is dedup samples while training mandate step ?

kpu / kenlm

KenLM: Faster and Smaller Language Model Queries

http://kheafield.com/code/kenlm/

Other

2.5k stars 513 forks source link

Is dedup samples while training mandate step ? #179

Closed manishbansal-fk closed 6 years ago

manishbansal-fk commented 6 years ago

I am using Kenlm for spell-correction task over search queries. But training LM with user queries (queries might be repeated over time) as it is, the error which pops out is BadDiscountException. Then falling back to --discount_fallback handles this situation (when KN closed-form estimate fails) by overridden discount values(D1,D2,D3+). But again this comes with warning "you should deduplicate your corpus instead".

Ideally LanguageModel is all about capturing the strength of how probable this sequence over training corpus. So if we "dedup" training data then we are saying wrong spellings are equally likely to correct spellings.

Is this a limitation of statistical LMs ? Is there any gain in trying out Neural Language Models ?

@kpu Please share your thoughts.

kpu commented 6 years ago

Are you doing this at the character level or word level? If it's the character level, don't worry, it's just complaining that there are no singleton letters.

Search queries are short enough that there will be duplication, though you should also consider if there are canned queries in the mix like somebody linking to a query page.

We often see the problem with web crawling where the boilerplate is repeated over and over again.

Yes there is a gain in neural language models.

manishbansal-fk commented 6 years ago

LM is trained at word level (Upto order of 4). So, overall score is *LM Error Model Score** (Weighted edit distance) to rank all query level candidates generated by various means. Training corpus is all user search queries (plain/not touched) for last 6 months (which includes both incorrect & correct words)

Could you please add details regarding this "though you should also consider if there are canned queries in the mix like somebody linking to a query page"

Given above explanation, what setting ( e.g. break each line down into a series of words that build up. Sample query= X Y Z => break it into i. X -> Y ii. _ X Y -> Z ) you suggest for training neural LM.

manishbansal-fk commented 6 years ago

@kpu please share your valuable comments.

kpu commented 6 years ago

Canned queries are repeated because somebody linked to them, suggested them etc. Like this: http://lmgtfy.com/?q=canned+queries . Consider whether the use of language is natural in your application.

This isn't a bug with the software. I'm available for consulting.

manishbansal-fk commented 6 years ago

Thanks 👍