We have had missed documents from the document harvester, and after some analysis, it became clear there were recent crawl logs that had not been been processed.
The Airflow task that processes logs looks up which files to analyse is based on last-modified dates, so today's task processes yesterday's log file(s). However, for whatever reason, we now find that the upload of the crawl logs is happening close to midnight, so TrackDB has not yet been updated when the log analysis job runs. This means the job runs successfully, but does not process any/all logs, because some only appear in TrackDB later.
In the short term, this has been dealt with by re-running all recent, relevant DAGs (via Airflow UI), and modifying the DAG to run at 4am in the future (d3f5c4564c64b9828e5399aed281521558c77c61).
However, this will fail if there is an extended TrackDB outage. A more robust solution would ensure the TrackDB has been updated first, perhaps used a daily Airflow Dataset to keep tabs on when TrackDB is up to date.
We have had missed documents from the document harvester, and after some analysis, it became clear there were recent crawl logs that had not been been processed.
The Airflow task that processes logs looks up which files to analyse is based on last-modified dates, so today's task processes yesterday's log file(s). However, for whatever reason, we now find that the upload of the crawl logs is happening close to midnight, so TrackDB has not yet been updated when the log analysis job runs. This means the job runs successfully, but does not process any/all logs, because some only appear in TrackDB later.
In the short term, this has been dealt with by re-running all recent, relevant DAGs (via Airflow UI), and modifying the DAG to run at 4am in the future (d3f5c4564c64b9828e5399aed281521558c77c61).
However, this will fail if there is an extended TrackDB outage. A more robust solution would ensure the TrackDB has been updated first, perhaps used a daily Airflow Dataset to keep tabs on when TrackDB is up to date.