imim / mitlm

Automatically exported from code.google.com/p/mitlm
Other
0 stars 0 forks source link

Tokens beginning with # cause a crash when using count files #44

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
The crash only happens if the ngram order is higher than 1, and only if the # 
occurs at the start of a token.

I'm guessing this is because it interprets a # at the beginning of a line in a 
text counts file as a comment and skips it, meaning a unigram beginning with a 
# is missing from the term dictionary when it's encountered in a later bigram.

What steps will reproduce the problem?

$ estimate-ngram -wc counts -text <(echo 'a #hashtag')
0.001   Loading corpus /dev/fd/63...
0.002   Smoothing[1] = ModKN
0.002   Smoothing[2] = ModKN
0.002   Smoothing[3] = ModKN
0.002   Set smoothing algorithms...
0.002   Saving counts to counts...

$ cat counts
<s>     1
a       1
#hashtag        1
<s> a   1
a #hashtag      1
#hashtag </s>   1
<s> a #hashtag  1
a #hashtag </s> 1

$ estimate-ngram -counts counts -wl lm.arpa
0.001   Loading counts counts...
estimate-ngram: src/NgramModel.cpp:800: void 
mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs != 
NgramVector::Invalid)' failed.
Aborted (core dumped)

What version of the product are you using? On what operating system?

Built from latest master on github. Ubuntu 14.04.1

Original issue reported on code.google.com by matt...@swiftkey.com on 10 Feb 2015 at 6:39