System for spider updates (new database design)

----------------
Spider Update System
1. No web to binary writes

Pre: crawl from ???

(1-2) Raw format
Purpose: from web crawl to spider system

binary_database_on_file_system/distributed_amazon/remote_file_system

(2-3) Filtered/Clean/Important format
Purpose: clean bad/duplicate links to web system(archive)
binary_database_on_file_system/distributed_amazon/remote_file_system

(3-4) Purpose: from binary files to web; from archive to front end
(lose content information?)
archive_partition_system(s)

User/Rover submitted links:

Straight to most recent (entity_links?) and archived

(4-end) Display popular links to user
web_links_front_end

FAQ:
1. Where do you get the raw format seed data from?
A: Ideally from the web front end?

2. How much should be stored in the web front end?
A: 100,000-200,000 links

3. Are web/rover links submitted twice?
A. Yes, in the web link table and in the archive table

4. Can you purge the web links table?
A: Yes, you cannot purge the archive 

5. How much data can you find in the archive_partition system?
A: Infinite Dmoz has the following:
    4,830,584 sites - 75,151 editors - over 590,000 categories
    100 million article links? -- comments? linked by objectid?
    Possibly 100 tables?

------

Original issue reported on code.google.com by berlin.b...@gmail.com on 31 Dec 2007 at 9:39

liseryang / openbotlist

System for spider updates (new database design) #45