mozilla / docker-etl

Collection of dockerized ETL jobs managed by data engineering.
Mozilla Public License 2.0
19 stars 15 forks source link

Process unclassified reports that were missed in broken_site_report_ml job #166

Closed ksy36 closed 10 months ago

ksy36 commented 10 months ago

I've noticed some occasional unclassified reports that were missed by the broken_site_report_ml job.

Currently we select reports for classification based on a condition where reported_at is after the last successful classification run: https://github.com/mozilla/docker-etl/blob/625d82e8c2102a1e1078e9ca0869401122e6d3ca/jobs/broken-site-report-ml/broken_site_report_ml/main.py#L146-L157

Thinking this could have happened because at the time of running the job, the missed reports were not in BQ table, and got inserted with a delay. So to fix this, we could look for reports don't have a label and with reported_at before the last successful classification run.