Starting from scratch, it is very hard to guess what the requirements are for a SolrWayback setup. There should be a guide with common setups that outline hardware, overall setup and challenges, i.e.
0-100GB of WARCs
Index workflow, search engine and frontend should be able to run using a total of 4GB of RAM on just about any current machine. In case of crash: Reindex.
100GB-1TB of WARCs
SSD highly recommended, 4 CPU's, 8GB of RAM (need to test this - might need 10-12), single machine setup or 2 machines for redundancy, WARC index logistics from command line
1TB-50TB of WARCs, single collection
SSD essential, RAM for caching, separation of index & search, multi machine, fully live index, WARC index logistics possible from command line but consider Hadoop/netsearch/generic workflow engine
1TB-50TB of WARCs, multi collection
Same as single collection, but consider freezing finished collections
50TB-1PB of WARCs
As above, but automated logistics system, freezing of finished collections and highly recommended, focus on Solr sharding practical limitations
2PB-5PB of WARCs
If everything is to be searched in the same cloud, strong focus on freezing and minimizing of shard/collection count vs. single shard size maximum om ~1TB is needed
5PB+ of WARCs
Uncharted territory. Trivial to do by using multiple separate clouds, but hard if full corpus search is needed. Can be helped by compromising on indexed text size and features.
Starting from scratch, it is very hard to guess what the requirements are for a SolrWayback setup. There should be a guide with common setups that outline hardware, overall setup and challenges, i.e.
0-100GB of WARCs
Index workflow, search engine and frontend should be able to run using a total of 4GB of RAM on just about any current machine. In case of crash: Reindex.
100GB-1TB of WARCs
SSD highly recommended, 4 CPU's, 8GB of RAM (need to test this - might need 10-12), single machine setup or 2 machines for redundancy, WARC index logistics from command line
1TB-50TB of WARCs, single collection
SSD essential, RAM for caching, separation of index & search, multi machine, fully live index, WARC index logistics possible from command line but consider Hadoop/netsearch/generic workflow engine
1TB-50TB of WARCs, multi collection
Same as single collection, but consider freezing finished collections
50TB-1PB of WARCs
As above, but automated logistics system, freezing of finished collections and highly recommended, focus on Solr sharding practical limitations
2PB-5PB of WARCs
If everything is to be searched in the same cloud, strong focus on freezing and minimizing of shard/collection count vs. single shard size maximum om ~1TB is needed
5PB+ of WARCs
Uncharted territory. Trivial to do by using multiple separate clouds, but hard if full corpus search is needed. Can be helped by compromising on indexed text size and features.