m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Reporting transient or permanent failures, and job retries #1100

Open stephen-soltesz opened 1 year ago

stephen-soltesz commented 1 year ago

Recently, the ParserFailureRateTooHighOrMissing alert fired https://github.com/m-lab/dev-tracker/issues/727 due to an actual spike in task errors (individual archives).

Upon investigation, it was due to ETLSourceError, which can be due to transient connectivity problems between the parser and GCS API servers. This is something we cannot control directly. The alert resolved on its own when the connectivity was restored.

Ideally:

Currently: