janih / boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages
2 stars 0 forks source link

DocumentTitleMatchClassifier should include the « and • characters #43

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I have run across a few news articles that use these characters.

The following articles use the « character (\u00AB):
http://philadelphia.cbslocal.com/2012/02/06/report-1-in-5-children-exposed-to-se
condhand-smoke-in-cars/
http://blog.mediaglobal.org/?p=448

I haven't seen too many of them but it looks like the first part is always the 
title.  It might be safe to assume that parts[0] is the title after performing 
the split.

The following article uses the • character (\u2022):
http://ictsd.org/i/news/biores/128000/

Original issue reported on code.google.com by tucker...@gmail.com on 22 Mar 2012 at 6:05