Open 69mb opened 12 years ago
I looked into Readability a couple of months back. It's not as polished but not a bad start for an article cleaner. There's a thorough analysis of such kinds of tools here: http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/
Implement this as a http service that will be used by the Swifriver-Core/rss and swiftriver-plugin-rss specifically to clean up the content.
We could use FiveFilter's method(http://www.keyvan.net/2011/03/content-extraction/):
+1. Great find!
Implement this --> http://code.google.com/p/arc90labs-readability/ as pre-processing before the drop is posted to the db.