The crash only happens if the ngram order is higher than 1, and only if the #
occurs at the start of a token.
I'm guessing this is because it interprets a # at the beginning of a line in a
text counts file as a comment and skips it, meaning a unigram beginning with a
# is missing from the term dictionary when it's encountered in a later bigram.
What steps will reproduce the problem?
$ estimate-ngram -wc counts -text <(echo 'a #hashtag')
0.001 Loading corpus /dev/fd/63...
0.002 Smoothing[1] = ModKN
0.002 Smoothing[2] = ModKN
0.002 Smoothing[3] = ModKN
0.002 Set smoothing algorithms...
0.002 Saving counts to counts...
$ cat counts
<s> 1
a 1
#hashtag 1
<s> a 1
a #hashtag 1
#hashtag </s> 1
<s> a #hashtag 1
a #hashtag </s> 1
$ estimate-ngram -counts counts -wl lm.arpa
0.001 Loading counts counts...
estimate-ngram: src/NgramModel.cpp:800: void
mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs !=
NgramVector::Invalid)' failed.
Aborted (core dumped)
What version of the product are you using? On what operating system?
Built from latest master on github. Ubuntu 14.04.1
Original issue reported on code.google.com by matt...@swiftkey.com on 10 Feb 2015 at 6:39
Original issue reported on code.google.com by
matt...@swiftkey.com
on 10 Feb 2015 at 6:39