UTF characters are not handled correctly

vietlong2110 / boilerpipe

Automatically exported from code.google.com/p/boilerpipe

0 stars 0 forks source link

UTF characters are not handled correctly #28

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

The following test case fails:

ArticleExtractor extractor = ArticleExtractor.INSTANCE;
TextDocument textDoc = new BoilerpipeSAXInput(HTMLFetcher.fetch(new 
URL("http://de.wikipedia.org/wiki/Barack_Obama")).toInputSource()).getTextDocume
nt();
assertEquals("Barack Obama – Wikipedia", textDoc.getTitle());

The attached patch fixes the issue.

Original issue reported on code.google.com by florian....@gmail.com on 26 Jul 2011 at 7:13

Attachments:

utf8.patch

GoogleCodeExporter commented 9 years ago

I can't trigger the error with the trunk version of boilerpipe.

Could you please re-test?

Original comment by ckkohl79 on 22 Jan 2012 at 11:11

GoogleCodeExporter commented 9 years ago

No response.

Original comment by ckkohl79 on 21 Mar 2012 at 9:27

Changed state: Invalid

GoogleCodeExporter commented 9 years ago

No response.

Original comment by ckkohl79 on 21 Mar 2012 at 9:27