ukwa / ukwa-services

Deployment configuration for all UKWA services stacks.
Apache License 2.0
4 stars 5 forks source link

Make CDX index backfill workflow for DC2018 DC2019 #85

Closed anjackson closed 1 year ago

anjackson commented 2 years ago

It seems the 2018 and 2019 domain crawls may not have been CDX indexed. We need to design a suitable Airflow DAG that will be able to perform these backfill tasks.

The idomatic Airflow version would be a proper backfilling task, with a start date in e.g. 2010, using the last-modified date of the files on HDFS, and where each chunk loops through the total available to be indexed. e.g. an @monthly task, that lists all WARCs corresponding to that previous month, and then indexes them in chunks of e.g. 2000 WARCs.

This would mean changing the windex utility to (a) be able to filter on a date block instead of X years back, and (b) able to loop over all matching WARCs rather than just running one batch.

anjackson commented 1 year ago

I've used a much simpler backfill operation. Ongoing.