Closed anjackson closed 2 years ago
Can use DOCUMENTS_FOUND_DB_URI
environment variable to change location of database.
Also the CLI has been modified to allow the DB URI to be set from there too.
Okay, new Airflow version addresses these issues.
The core document harvester is working as before, via the
docharv process
command. But, some improvements would make it easier.warc/revisit
entries, which behave like new records. They need to be classified as revisits so they don't keep getting re-checked.duplicate:digest
.TODAY - X DAYS
, add a configurable job name.import-jsonl
to pull document JSONL records in from plain files.Some example documents