Major clean up of the whole fileoutput/solr/elastic mess in WARCIndexerCommand: Previously half of the code for that class was special casing the different output possibilities. This has now been generalised to a DocumentConsumer interface, with implementations for the different outputs and a factory method for creating it.
As part of the clean up, argument parsing code has been reduced and refactored.
As part of the refactor, max size in bytes has been added as a flush trigger for DocumentConsumers.
This has been tested for
Output to a single GZipped file
Output to multiple files
Direct indexing in Solr
Not tested in Elasticsearch as my machine (hopefully temporarily) refuses to run the Elasticsearch Docker image.
Major clean up of the whole fileoutput/solr/elastic mess in
WARCIndexerCommand
: Previously half of the code for that class was special casing the different output possibilities. This has now been generalised to aDocumentConsumer
interface, with implementations for the different outputs and a factory method for creating it.As part of the clean up, argument parsing code has been reduced and refactored.
As part of the refactor, max size in bytes has been added as a flush trigger for
DocumentConsumer
s.This has been tested for
Not tested in Elasticsearch as my machine (hopefully temporarily) refuses to run the Elasticsearch Docker image.