prnawa / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

Possible improvement to TerminatingBlocksFinder #12

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The following block of code:
final String text = tb.getText().trim();
if (text.startsWith("Comments")
  || N_COMMENTS.matcher(text).find()
  || text.contains("What you think...")
  || text.contains("add your comment")
  || text.contains("Add your comment")
  || text.contains("Add Your Comment")
  || text.contains("Add Comment")
  || text.contains("Reader views")
  || text.contains("Have your say")
  || text.contains("Have Your Say")
  || text.contains("Reader Comments")
  || text.equals("Thanks for your comments - this feedback is now closed")
  || text.startsWith("© Reuters")
  || text.startsWith("Please rate this")

Might be rewritten as:
final String text = tb.getText().trim().toLowerCase();
if (text.startsWith("comments")
  || N_COMMENTS.matcher(text).find()
  || text.contains("what you think...")
  || text.contains("add your comment")
  || text.contains("add comment")
  || text.contains("reader views")
  || text.contains("have your say")
  || text.contains("reader comments")
  || text.equals("thanks for your comments - this feedback is now closed")
  || text.startsWith("© reuters")
  || text.startsWith("please rate this")

It would catch more cases this way and be easier to maintain.

Also, I saw the Washington Post use "Post a Comment", so it could be good to 
add that one as well.

Original issue reported on code.google.com by benjamin...@gmail.com on 21 Nov 2010 at 8:15

GoogleCodeExporter commented 9 years ago
Hi Benjamin,

thanks for your suggestion.

I have evaluated the proposed changes on the L3S-GN1 dataset and can confirm 
that it actually slightly improves precision while minimally reducing recall 
(thus improving F1), while not slowing down processing.

I have also added the "Post a comment", and moreover changed the Pattern 
matcher in the original code to a string-based comparison, which saves some 
more nanoseconds ;)

The changes are in SVN trunk and will be included in the next release.

Thanks!
Christian

Original comment by ckkohl79 on 21 Nov 2010 at 1:40