ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

Document consumer #258

Closed tokee closed 3 years ago

tokee commented 3 years ago

Major clean up of the whole fileoutput/solr/elastic mess in WARCIndexerCommand: Previously half of the code for that class was special casing the different output possibilities. This has now been generalised to a DocumentConsumer interface, with implementations for the different outputs and a factory method for creating it.

As part of the clean up, argument parsing code has been reduced and refactored.

As part of the refactor, max size in bytes has been added as a flush trigger for DocumentConsumers.

This has been tested for

Not tested in Elasticsearch as my machine (hopefully temporarily) refuses to run the Elasticsearch Docker image.