mozilla / ActiveData-ETL

The ETL process responsible for filling ActiveData
Mozilla Public License 2.0
1 stars 5 forks source link

Monitor ETL Pipelines #41

Open klahnakoski opened 6 years ago

klahnakoski commented 6 years ago

The ETL pipeline is fed by Amzon's SQS: Blocks of ETL work are put on the queue for the workers to grab-and-process. ETL machines are spot instances, so they are often killed partway through their work. When this happens the work item pulled off the queue is not returned back to the queue, but neither is the work item confirmed; it is in pending state, waiting 6hours before SQS puts it back on the queue for re-processing. This happens for processing failures too, including out-of-memory failures which Python usually aborts without notification.

The SQS will put all pending items back on the queue eventually; until that happens the work item is invisible; there is no way to know if the work was done, or will be redone later.

There are a few classes of solution I see:

  1. Return the work back to the queue upon termination signal (or processing failure) as best we can
  2. Find an SQS solution?
  3. Setup out own task queue that can be queried so it can be monitored.