liamzebedee / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

Support HTML5 elements #31

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Now that HTML5 becomes more pervasive on the web, it might be worth considering 
additional parsing support in places, one example being the recently added 
image extractor. HTML5 includes <figure> and <figcaption> for adding semantics 
to images, especially the figcaption element is of interest since the text 
could be used to determine image relevancy in relation to the extracted 
document text.

Original issue reported on code.google.com by misja.ho...@gmail.com on 18 Oct 2011 at 9:03

GoogleCodeExporter commented 9 years ago
NAV, FOOTER, and HEADER should also help eliminate chunks of unwanted text.

Original comment by tucker...@gmail.com on 15 Mar 2012 at 8:13

GoogleCodeExporter commented 9 years ago
Sample HTML5 article with appropriate use of some of the tags mentioned above:
http://www.forbes.com/sites/forbestravelguide/2012/01/19/the-best-international-
airports-for-layovers/

Original comment by tucker...@gmail.com on 22 Mar 2012 at 8:53