Closed anjackson closed 1 year ago
Verified as running fine on old Hadoop
$ hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-3.2.0-SNAPSHOT-job.jar uk.bl.wa.hadoop.indexer.WARCIndexerRunner -i files.txt -o jsonl-test --jsonl --no-solr -s h
ttp://null
The CLI could do we cleaning up, maybe making it more consistent with the none-Hadoop version.
I'm going to split this ticket into base JSONL functionality and more open-ended questions and options. We need to understand the overall usage/workflow and verify that this is at all useful before working on it any more.
Added ssdeep in 3a7013ea
Ahead of #297 a first step that can run on older Hadoop is to just output a JSONL version of the records. This could then be sent to Solr or made into Parquet as needed.
This will be added to the Hadoop version first, following a similar pattern to the
hadoop3
branch where an explicitMemento
bean that covers the Solr schema is provided. This then drives the JSON serialisation, but can also define the Parquet schema. Some outstanding issues are:no-solr
are contradictory), more consistent with the normal CLI.Fill inSee #308source_file_path
in local CLI mode (source_file == source_file_path
???)ssdeep
hash?blockSize:Hash:HashOf2xBlockSize:filename
For further enhancements, see #307