ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Add option for JSONL output #299

Closed anjackson closed 1 year ago

anjackson commented 1 year ago

Ahead of #297 a first step that can run on older Hadoop is to just output a JSONL version of the records. This could then be sent to Solr or made into Parquet as needed.

This will be added to the Hadoop version first, following a similar pattern to the hadoop3 branch where an explicit Memento bean that covers the Solr schema is provided. This then drives the JSON serialisation, but can also define the Parquet schema. Some outstanding issues are:

For further enhancements, see #307

anjackson commented 1 year ago

Verified as running fine on old Hadoop

$ hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-3.2.0-SNAPSHOT-job.jar uk.bl.wa.hadoop.indexer.WARCIndexerRunner -i files.txt -o jsonl-test --jsonl --no-solr -s h
ttp://null

The CLI could do we cleaning up, maybe making it more consistent with the none-Hadoop version.

anjackson commented 1 year ago

I'm going to split this ticket into base JSONL functionality and more open-ended questions and options. We need to understand the overall usage/workflow and verify that this is at all useful before working on it any more.

anjackson commented 1 year ago

Added ssdeep in 3a7013ea