When delivering batches of generated documents to Solr, the best throughput is achieved by minimizing the number of deliveries. Currently this is set with warc.solr.batch_size, which states the maximum number of documents. This is problematic: If the documents are tiny, performance will be sub-optimal. If the documents are huge, the JVM might get an Out Of Memory.
There should be an alternative limit in bytes, roughly counted by looking at string lengths in the generated documents.
When delivering batches of generated documents to Solr, the best throughput is achieved by minimizing the number of deliveries. Currently this is set with
warc.solr.batch_size
, which states the maximum number of documents. This is problematic: If the documents are tiny, performance will be sub-optimal. If the documents are huge, the JVM might get an Out Of Memory.There should be an alternative limit in bytes, roughly counted by looking at string lengths in the generated documents.