google-code-export / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 1 forks source link

boilerpipe crash #29

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Try to extract that url:
http://sourceforge.net/projects/xampp/files/XAMPP%20Windows/1.7.4/xampp-win32-1.
7.4-VC6-installer.exe/download
I have used ArticleExtractor.
It throws few times:
Warning: SAX input contains nested A elements -- You have probably hit a bug in 
your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML 
externally and feed it to boilerpipe again. Trying to recover somehow...
and then crashes with OutOfMemoryException

I'm using version 1.2.0. I have tested on Windows and on Ubuntu as well.

Original issue reported on code.google.com by fzr...@gmail.com on 29 Jul 2011 at 1:27

GoogleCodeExporter commented 9 years ago
The input was no HTML (application/x-msdos-program instead), boilerpipe 
nevertheless accepted it and NekoHTML choked on it.

In the meantime, in boilerpipe trunk checks were added to only fetch text/html 
content, and throw an exception otherwise. boilerpipe-web has additional checks 
(e.g., content length).

In both cases, the NekoHTML bug exception will not appear anymore.

Original comment by ckkohl79 on 22 Jan 2012 at 11:03