Open GoogleCodeExporter opened 8 years ago
Hi, you could probably overcome this by mapping all unicode words in your
corpora to
some IDs, and running mitlm on the mapped corpora. The resulting language model
will
contain only IDs, so you also need to do inverse-mapping back to unicode
strings.
If your unicode characters can be converted to some 8bit charset, try to use
iconv.
For Czech I use 'iconv -f utf8 -t iso8859-2 < corpora_unicode.txt >
corpora_iso.txt'
Good luck,
Miso
Original comment by michal.f...@gmail.com
on 23 Feb 2010 at 9:22
UTF-8 without BOM seems to work fine under Windows (0.4) and under Ubuntu
(r50). Interestingly, it doesn't work under WINE (0.4).
This is the beauty of UTF-8.
Note: Haven't checked UTF-8 with BOM. It would be nice if mitlm would ignore
the BOM if it doesn't already.
Original comment by adubin...@almson.net
on 2 Nov 2012 at 7:58
Issue 21 has been merged into this issue.
Original comment by giuliop...@gmail.com
on 3 Feb 2013 at 6:27
Going to open a separate ticket for this, but in case anyone else is looking at
this issue for a solution MITLM seemingly taking a dislike to certain
characters in its input, I found that when using count files, a # character at
the start of a token will cause it to crash with:
estimate-ngram: src/NgramModel.cpp:800: void
mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs !=
NgramVector::Invalid)' failed.
Aborted (core dumped)
I'm guessing it interprets # as a comment if it occurs at the start of a line
of text in the counts file. Not very helpful, especially since estimate-ngram
-wc will itself write out lines beginning with # if a token beginning with #
(like a hashtag) occurs in the source text.
Original comment by matt...@swiftkey.com
on 10 Feb 2015 at 6:31
Original issue reported on code.google.com by
gsrvijay...@gmail.com
on 22 Jan 2010 at 6:07