Crash where suffix of an ngram not present in count file

What steps will reproduce the problem?

Create a large counts file in which there an ngram (e.g. "foo bar baz") whose 
suffix ngram ("bar baz") doesn't exist earlier in the file.

Run `estimate-ngram -wl lm.arpa -counts counts` on it.

Note this doesn't always happen consistently for me with smaller count files, 
but seems to replicate fairly consistently with larger (or at least 
middle-sized) files.

What is the expected output? What do you see instead?

I'd ideally expect it allow a language model to be built in this case, even if 
it means removing/skipping over the ngram in question, or making some
assumption about the count for the missing suffix (e.g. same as the  
higher-order ngram).

I realise that these missing suffixes won't occur if I use MITLM itself to 
compute the counts from a corpus, however if dealing with large amounts of 
count-based source data from some other tools/sources, it's possible for these 
kinds of constraints to be violated accidentally due to data corruption or bugs 
beyond your control, and so it would be convenient if MITLM could cope 
gracefully with these cases.

Alternatively if this is a WONTFIX then it would be good to at least document 
what the constraint is on acceptable input for counts files, and give a more 
friendly error message if the constraint is violated, so people know how to fix 
up their input files in order to get MITLM to work.

Currently what you see is:

estimate-ngram: src/NgramModel.cpp:811: void 
mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs != 
NgramVector::Invalid)' failed.
Aborted (core dumped)

What version of the product are you using? On what operating system?

Built from latest github master, Ubuntu 14.04.1

Cheers!

Original issue reported on code.google.com by matt...@swiftkey.com on 11 Feb 2015 at 12:08

jayurbain / mitlm

Crash where suffix of an ngram not present in count file #45