ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

Batch size should have a byte limit #203

Closed tokee closed 3 years ago

tokee commented 5 years ago

When delivering batches of generated documents to Solr, the best throughput is achieved by minimizing the number of deliveries. Currently this is set with warc.solr.batch_size, which states the maximum number of documents. This is problematic: If the documents are tiny, performance will be sub-optimal. If the documents are huge, the JVM might get an Out Of Memory.

There should be an alternative limit in bytes, roughly counted by looking at string lengths in the generated documents.