Closed GoogleCodeExporter closed 8 years ago
adding '&' to the PAT_NOT_WORD_BOUNDARY of UnicodeTokenizer gives the better
output. Of course then it is not really a UnicodeTokenizer but a HtmlTokenizer.
It might be better to combine the regexp of that tokenizer class and the isWord
method into a new class HtmlWordCounter giving it a public static method which
does the word counting so that other projects can easily reuse it.
Original comment by massey1...@gmail.com
on 22 Jan 2012 at 10:42
The input to UnicodeTokenizer is Unicode text, not HTML-escaped text. If you
want to use UnicodeTokenizer, you have to prepare the input appropriately.
As you have pointed out, you want a HtmlTokenizer. Boilerpipe takes care of
HTML entity resolution via SAX parsing, so there is no need to replicate that
functionality here.
Marking as WontFix.
Original comment by ckkohl79
on 22 Jan 2012 at 10:51
Original issue reported on code.google.com by
massey1...@gmail.com
on 22 Jan 2012 at 10:36