janih / boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages
2 stars 0 forks source link

BoilerplateBlockFilter ignores labelToKeep #65

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Process a BoilerplateBlockFilter with e.g. labelToKeep = 
"de.l3s.boilerpipe/HEADING"
2. See that text blocks with the HEADING label are not kept.

What is the expected output? What do you see instead?
textBlocks with label of labelToKeep should be kept

What version of the product are you using? On what operating system?

Please provide any additional information below.
I suggest changing line 60-62 from:
if (!tb.isContent() && (labelToKeep == null || 
!tb.hasLabel(DefaultLabels.TITLE))) {

to:

if (!tb.isContent() && (labelToKeep == null || !tb.hasLabel(labelToKeep))) {

Original issue reported on code.google.com by abbymill...@gmail.com on 3 Apr 2013 at 2:31