Open GoogleCodeExporter opened 8 years ago
Thanks for reporting.
This seems to be caused by a bug in NekoHTML 1.9.13
The corresponding stacktrace points at
"org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)"
The problem seems to go away after an update to NekoHTML 1.9.15.
Could you please confirm this?
Before upgrading boilerpipe to NekoHTML 1.9.15, I will have to perform some
extra checks, especially to ensure we don't get any regressions in terms of
extraction quality.
Best,
Christian
Original comment by ckkohl79
on 14 May 2012 at 4:44
Thanks for quick-response.
As you've stated, the problem has gone away with NekoHTML 1.9.15.
Below is the list of changes in NekoHTML since ver.1.9.13 (which has been
released on 2 Sept 2009):
- Version 1.9.15 (3 Aug 2011)
Avoid using a synchronized structure (here java.util.Properties) to store built-in entities that are loaded at startup (#3001745), change INS to inline element, change BUTTON to inline element. don't parse body of IFRAME, add new feature http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe to allow empty IFRAME tags (default is false), make detected encoding available as Locator2.getEncoding() (#3381270).
- Version 1.9.14 (2 Feb 2010)
Don't parse body of NOFRAMES (fixes StackOverflowError reported in #2854697), TABLE can have multiple THEAD, TBODY and TFOOT (patch provided by Ahmed Ashour, #2893796), trim encoding found in meta tag (#2904817), fix ArrayIndexOutOfBoundException on empty attribute when using feature normalize-attrs(#2838901), recognize tags even if the > of the opening tag is missing (#2886227), only end TABLE can close a table (#2913095), fix StackOverflowError when parsing document fragment (#2911449), fix NullPointerException occurring with the insert-namespaces feature (#2942363).
I'm not pretty sure but I guess these changes do not affect the BoilerPipe's
extraction quality.
Looking forward to hearing about the result of your regression tests.
Regards,
Gural
Original comment by gural.vu...@gmail.com
on 14 May 2012 at 7:16
Original issue reported on code.google.com by
gural.vu...@gmail.com
on 14 May 2012 at 2:56Attachments: