mitodl / ol-data-platform

Pipeline definitions for managing data flows to power analytics at MIT Open Learning
BSD 3-Clause "New" or "Revised" License
38 stars 6 forks source link

Improve dagster monitoring on airbyte_asset_sync #964

Closed rachellougee closed 9 months ago

rachellougee commented 10 months ago

Description/Context

In dagster pipeline, there is a job airbyte_asset_sync that is responsible for the database sync from various sources in airbyte and running dbt models afterward. Currently, there is no notification if/when the job fails - either due to one of source sync or errors in dbt models. Sometimes the job failure would go unnoticed for several days until someone logs in https://pipelines.odl.mit.edu/locations/lakehouse-assets-graph/jobs/airbyte_asset_sync/runs to read the error logs. This affects the freshness of our data models. If the failure is due to the source sync, dbt build would not be triggered at all.

We should improve the monitoring and alerting around this job. Mike suggested the slack notification, so maybe we can send a summary to data-platform-alerts slack channel with the following details in case of job failure:

Plan/Design

TBD

pdpinch commented 10 months ago

Why do we have a Slack channel for alerts, but it's only QA errors?

https://mitodl.slack.com/archives/C056G5XMBL4/p1705385653025719

blarghmatey commented 10 months ago

It has QA and production errors routed to it, it's just that only QA is generating errors from Airbyte. It does not yet alert on the dbt failures because those happen in Dagster. That's the part that I'm working on addressing now.

pdpinch commented 10 months ago

I tried to clarify the title of this issue. Please fix if I'm wrong.