janih / boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages
2 stars 0 forks source link

Limit the parsing depth of the html parsing to avoid out of memory situations #71

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

(using ver. 1.2.0)
1. HTMLParse "http://worldwidescience.org/topicpages/s.html". ArticleExtractor 
is just fine for demonstration purposes.

With 8GB of JVM-memory, this will result in an out of memory exception. 

Attached is a patch, which allows limiting the amount of TextBlocks being 
created/appended by boilerpipe. If that limit is reached, boilerpipe will 
ignore all further content from the parsed input.

Original issue reported on code.google.com by mstr...@gmail.com on 25 Nov 2013 at 4:29

Attachments:

GoogleCodeExporter commented 9 years ago
Please change type to "enhancement"

Original comment by mstr...@gmail.com on 26 Nov 2013 at 8:13