It seems the 2018 and 2019 domain crawls may not have been CDX indexed. We need to design a suitable Airflow DAG that will be able to perform these backfill tasks.
The idomatic Airflow version would be a proper backfilling task, with a start date in e.g. 2010, using the last-modified date of the files on HDFS, and where each chunk loops through the total available to be indexed. e.g. an @monthly task, that lists all WARCs corresponding to that previous month, and then indexes them in chunks of e.g. 2000 WARCs.
This would mean changing the windex utility to (a) be able to filter on a date block instead of X years back, and (b) able to loop over all matching WARCs rather than just running one batch.
It seems the 2018 and 2019 domain crawls may not have been CDX indexed. We need to design a suitable Airflow DAG that will be able to perform these backfill tasks.
The idomatic Airflow version would be a proper backfilling task, with a start date in e.g. 2010, using the last-modified date of the files on HDFS, and where each chunk loops through the total available to be indexed. e.g. an
@monthly
task, that lists all WARCs corresponding to that previous month, and then indexes them in chunks of e.g. 2000 WARCs.This would mean changing the
windex
utility to (a) be able to filter on a date block instead of X years back, and (b) able to loop over all matching WARCs rather than just running one batch.