Description of different extractors?

GoogleCodeExporter commented 8 years ago

I see that you recently added the canola extractor. Is this extractor better 
for general web text?

Could you provide a high-level summary of the different extractors, and the 
type of pages they work best on? This would be very useful documentation.

Original issue reported on code.google.com by tur...@gmail.com on 21 Feb 2011 at 8:13

GoogleCodeExporter commented 8 years ago

The CanolaExtractor has been trained on Canola documents. Its main purpose is 
to demonstrate the competitiveness of such a simple classifier (based on number 
of words/densities) for the Canola corpus evaluation. I would not recommend it 
for other purposes.

I'd recommend using ArticleExtractor for any type of news articles and 
DefaultExtractor (or maybe LargestContentExtractor) for the rest. YMMV.

I have provided some [Benchmarks 
http://code.google.com/p/boilerpipe/wiki/Benchmarks] on the L3S-GN1 news 
corpus, as an initial starting point.

Original comment by ckkohl79 on 23 Feb 2011 at 8:09

GoogleCodeExporter commented 8 years ago

Original comment by ckkohl79 on 23 Feb 2011 at 8:10

Added labels: Priority-Low, Type-Other
Removed labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 8 years ago

Original comment by ckkohl79 on 6 Jul 2011 at 2:53

Changed state: Done

plar / boilerpipe

Description of different extractors? #18