Thinking this could have happened because at the time of running the job, the missed reports were not in BQ table, and got inserted with a delay. So to fix this, we could look for reports don't have a label and with reported_at before the last successful classification run.
I've noticed some occasional unclassified reports that were missed by the
broken_site_report_ml
job.Currently we select reports for classification based on a condition where
reported_at
is after the last successful classification run: https://github.com/mozilla/docker-etl/blob/625d82e8c2102a1e1078e9ca0869401122e6d3ca/jobs/broken-site-report-ml/broken_site_report_ml/main.py#L146-L157Thinking this could have happened because at the time of running the job, the missed reports were not in BQ table, and got inserted with a delay. So to fix this, we could look for reports don't have a label and with
reported_at
before the last successful classification run.