netarchivesuite / solrwayback

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
Apache License 2.0
101 stars 21 forks source link

Outline hardware and setup for different types of archives #349

Open tokee opened 1 year ago

tokee commented 1 year ago

Starting from scratch, it is very hard to guess what the requirements are for a SolrWayback setup. There should be a guide with common setups that outline hardware, overall setup and challenges, i.e.

0-100GB of WARCs

Index workflow, search engine and frontend should be able to run using a total of 4GB of RAM on just about any current machine. In case of crash: Reindex.

100GB-1TB of WARCs

SSD highly recommended, 4 CPU's, 8GB of RAM (need to test this - might need 10-12), single machine setup or 2 machines for redundancy, WARC index logistics from command line

1TB-50TB of WARCs, single collection

SSD essential, RAM for caching, separation of index & search, multi machine, fully live index, WARC index logistics possible from command line but consider Hadoop/netsearch/generic workflow engine

1TB-50TB of WARCs, multi collection

Same as single collection, but consider freezing finished collections

50TB-1PB of WARCs

As above, but automated logistics system, freezing of finished collections and highly recommended, focus on Solr sharding practical limitations

2PB-5PB of WARCs

If everything is to be searched in the same cloud, strong focus on freezing and minimizing of shard/collection count vs. single shard size maximum om ~1TB is needed

5PB+ of WARCs

Uncharted territory. Trivial to do by using multiple separate clouds, but hard if full corpus search is needed. Can be helped by compromising on indexed text size and features.

VictorHarbo commented 1 year ago

Slowly building wiki. Here is a link to the page related to this issue: https://github.com/netarchivesuite/solrwayback/wiki/Requirements-for-different-archive-sizes