Add option for JSONL output

anjackson commented 1 year ago

Ahead of #297 a first step that can run on older Hadoop is to just output a JSONL version of the records. This could then be sent to Solr or made into Parquet as needed.

This will be added to the Hadoop version first, following a similar pattern to the hadoop3 branch where an explicit Memento bean that covers the Solr schema is provided. This then drives the JSON serialisation, but can also define the Parquet schema. Some outstanding issues are:

[x] Add JSONL output to Hadoop mode.
[x] Add 'all in' default configuration for dataset extraction and use that for JSONL output.
[x] Add the JSONL option to the non-Hadoop command-line version, using the 'all in' config.
[x] Set the default main class for both CLI and Hadoop jars.
[x] Make the Hadoop CLI better (e.g. no Solr and no-solr are contradictory), more consistent with the normal CLI.
[ ] ~~Fill in source_file_path in local CLI mode (source_file == source_file_path???)~~ See #308
[x] How to store the ssdeep hash?
[x] Store ssdeep hash in JSONL in the usual long form: blockSize:Hash:HashOf2xBlockSize:filename

For further enhancements, see #307

anjackson commented 1 year ago

Verified as running fine on old Hadoop

$ hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-3.2.0-SNAPSHOT-job.jar uk.bl.wa.hadoop.indexer.WARCIndexerRunner -i files.txt -o jsonl-test --jsonl --no-solr -s h
ttp://null

The CLI could do we cleaning up, maybe making it more consistent with the none-Hadoop version.

anjackson commented 1 year ago

I'm going to split this ticket into base JSONL functionality and more open-ended questions and options. We need to understand the overall usage/workflow and verify that this is at all useful before working on it any more.

anjackson commented 1 year ago

Added ssdeep in 3a7013ea

ukwa / webarchive-discovery

Add option for JSONL output #299