1. Read pipes systems
2. Save seed urls to database
3. crawl seed URLs through nutch (also look at recrawl job system)
4. Query all or most URLs possibly looking for content information
5. Feed URLs and possibly content data to amazon s3
Use solr also for searching
Original issue reported on code.google.com by berlin.b...@gmail.com on 28 Nov 2007 at 6:13
Original issue reported on code.google.com by
berlin.b...@gmail.com
on 28 Nov 2007 at 6:13