word counting code does not account for & being special html symbol.

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. make the method de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.isWord 
public
2. in UnicodeTokenizer.java import static that method
3. add the following main method to UnicodeTokenizer.java : 

    public static void main(String[] args) {
        String html = "A few years later, in 1823, another Knickerbocker, Clement C. Moore, offered his own riff on Irving&rsquo;s version of St. Nicholas. Moore&rsquo;s instantly popular poem &ldquo;A Visit from Saint Nicholas&rdquo; introduced the slightly cloying, but instantly and sensationally popular, symbol of the season&mdash;a &ldquo;chubby and plump...right jolly old elf.&rdquo; (There are those who contend that an author named Henry Livingston Jr. penned the poem, but that&rsquo;s another story altogether.)"; 
        final String[] tokens = UnicodeTokenizer.tokenize(html);
        for( String s : tokens ){
            if( isWord(s) ){
                System.out.println("isWord: "+s);
            } else {
                System.out.println("!isWord: "+s);
            }
        }
    }

What is the expected output? What do you see instead?

That html is from 
http://www.smithsonianmag.com/arts-culture/A-Mischevious-St-Nick-from-the-Americ
an-Art-Museum.html 

It uses &rsquo; such as "Irving&rsquo;s version of St. Nicholas. Moore&rsquo;s 
instantly". The logic used by BoilderPipe does not account for that and in the 
program above with output: 

isWord: Irving
!isWord: &
isWord: rsquo;s
isWord: version
isWord: of
isWord: St.
isWord: Nicholas.
isWord: Moore
!isWord: &
isWord: rsquo;s
isWord: instantly

which shows that it is breaking up "Irving's" and "Moore's" into two words 
where they are one.

Original issue reported on code.google.com by massey1...@gmail.com on 22 Jan 2012 at 10:36

GoogleCodeExporter commented 9 years ago

adding '&' to the PAT_NOT_WORD_BOUNDARY of UnicodeTokenizer gives the better 
output. Of course then it is not really a UnicodeTokenizer but a HtmlTokenizer. 
It might be better to combine the regexp of that tokenizer class and the isWord 
method into a new class HtmlWordCounter giving it a public static method which 
does the word counting so that other projects can easily reuse it.

Original comment by massey1...@gmail.com on 22 Jan 2012 at 10:42

GoogleCodeExporter commented 9 years ago

The input to UnicodeTokenizer is Unicode text, not HTML-escaped text. If you 
want to use UnicodeTokenizer, you have to prepare the input appropriately.

As you have pointed out, you want a HtmlTokenizer. Boilerpipe takes care of 
HTML entity resolution via SAX parsing, so there is no need to replicate that 
functionality here.

Marking as WontFix.

Original comment by ckkohl79 on 22 Jan 2012 at 10:51

Changed state: WontFix

matanox / boilerpipe

word counting code does not account for & being special html symbol. #35