janih / boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages
2 stars 0 forks source link

StackOverflowError when page includes another <body> part in <noframes> #50

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
- ArticleExtractor cannot process a web page having two <body> parts (like the 
attached page) and results "java.lang.StackOverflowError". 

What is the expected output? What do you see instead?
- "noframes" part is for browsers that do not support frames, so boilerpipe 
should not take this part into consideration.

What version of the product are you using? On what operating system?
- boilerpipe 1.2.0 on Linux/Windows

Original issue reported on code.google.com by gural.vu...@gmail.com on 14 May 2012 at 2:56

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks for reporting.

This seems to be caused by a bug in NekoHTML 1.9.13

The corresponding stacktrace points at 
"org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)"

The problem seems to go away after an update to NekoHTML 1.9.15.
Could you please confirm this?

Before upgrading boilerpipe to NekoHTML 1.9.15, I will have to perform some 
extra checks, especially to ensure we don't get any regressions in terms of 
extraction quality.

Best,
Christian

Original comment by ckkohl79 on 14 May 2012 at 4:44

GoogleCodeExporter commented 9 years ago
Thanks for quick-response.

As you've stated, the problem has gone away with NekoHTML 1.9.15. 

Below is the list of changes in NekoHTML since ver.1.9.13 (which has been 
released on 2 Sept 2009):
- Version 1.9.15 (3 Aug 2011)
    Avoid using a synchronized structure (here java.util.Properties) to store built-in entities that are loaded at startup (#3001745), change INS to inline element, change BUTTON to inline element. don't parse body of IFRAME, add new feature http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe to allow empty IFRAME tags (default is false), make detected encoding available as Locator2.getEncoding() (#3381270). 
- Version 1.9.14 (2 Feb 2010)
    Don't parse body of NOFRAMES (fixes StackOverflowError reported in #2854697), TABLE can have multiple THEAD, TBODY and TFOOT (patch provided by Ahmed Ashour, #2893796), trim encoding found in meta tag (#2904817), fix ArrayIndexOutOfBoundException on empty attribute when using feature normalize-attrs(#2838901), recognize tags even if the > of the opening tag is missing (#2886227), only end TABLE can close a table (#2913095), fix StackOverflowError when parsing document fragment (#2911449), fix NullPointerException occurring with the insert-namespaces feature (#2942363). 

I'm not pretty sure but I guess these changes do not affect the BoilerPipe's 
extraction quality.

Looking forward to hearing about the result of your regression tests.

Regards,
Gural

Original comment by gural.vu...@gmail.com on 14 May 2012 at 7:16