ushahidi / SwiftRiver-Core

SwiftRiver Core Applications
6 stars 3 forks source link

Readability #9

Open 69mb opened 12 years ago

69mb commented 12 years ago

Implement this --> http://code.google.com/p/arc90labs-readability/ as pre-processing before the drop is posted to the db.

ekala commented 12 years ago

I looked into Readability a couple of months back. It's not as polished but not a bad start for an article cleaner. There's a thorough analysis of such kinds of tools here: http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/

69mb commented 12 years ago

Implement this as a http service that will be used by the Swifriver-Core/rss and swiftriver-plugin-rss specifically to clean up the content.

We could use FiveFilter's method(http://www.keyvan.net/2011/03/content-extraction/):

  1. Check for hnews micro formatting.
  2. If that fails, use the Instapaper style site patterns.
  3. Fall back to readability
ekala commented 12 years ago

+1. Great find!