liseryang / openbotlist

Automatically exported from code.google.com/p/openbotlist
0 stars 0 forks source link

Haskell, solr, amazon s3 (search payload) #18

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
1. Read pipes systems
2. Save seed urls to database
3. crawl seed URLs through nutch (also look at recrawl job system)
4. Query all or most URLs possibly looking for content information
5. Feed URLs and possibly content data to amazon s3

Use solr also for searching

Original issue reported on code.google.com by berlin.b...@gmail.com on 28 Nov 2007 at 6:13

GoogleCodeExporter commented 9 years ago

Original comment by berlin.b...@gmail.com on 2 Apr 2008 at 9:01