mitodl / ol-data-platform

Pipeline definitions for managing data flows to power analytics at MIT Open Learning
BSD 3-Clause "New" or "Revised" License
37 stars 6 forks source link

Production dbt failures #685

Closed blarghmatey closed 1 year ago

blarghmatey commented 1 year ago

Description/Context

The dbt runs in production are producing a handful of errors that should be cleaned up. The error log is:

1468 of 1870 ERROR creating sql table model ol_warehouse_production_intermediate.int__mitxpro__b2becommerce_b2border  [ERROR in 2.39s]
1756 of 1870 FAIL 12597563 not_null_int__combined__courserun_enrollments_courserun_id  [FAIL 12597563 in 2.93s]
1757 of 1870 FAIL 85667 not_null_int__combined__courserun_enrollments_courserun_title  [FAIL 85667 in 3.40s]
1850 of 1870 ERROR creating sql table model ol_warehouse_production_mart.marts__micromasters_summary  [ERROR in 0.61s]
1864 of 1870 FAIL 5 not_null_int__micromasters__program_certificates_user_edxorg_username  [FAIL 5 in 0.94s]
Database Error in model int__mitxpro__b2becommerce_b2border (models/intermediate/mitxpro/int__mitxpro__b2becommerce_b2border.sql)
  TrinoUserError(type=USER_ERROR, name=TABLE_NOT_FOUND, message="line 20:19: Table 'ol_data_lake_production.ol_warehouse_production_intermediate.int__salesforce__opportunity' does not exist", query_id=20230505_200853_81206_u66wp)
  compiled Code at target/run/open_learning/models/intermediate/mitxpro/int__mitxpro__b2becommerce_b2border.sql
Failure in test not_null_int__combined__courserun_enrollments_courserun_id (models/intermediate/combined/_combined_models.yml)
  Got 12597563 results, configured to fail if != 0
Failure in test not_null_int__combined__courserun_enrollments_courserun_title (models/intermediate/combined/_combined_models.yml)
  Got 85667 results, configured to fail if != 0
Database Error in model marts__micromasters_summary (models/marts/micromasters/marts__micromasters_summary.sql)
  TrinoUserError(type=USER_ERROR, name=PERMISSION_DENIED, message="Access Denied: Cannot create table ol_data_lake_production.ol_warehouse_production_mart.marts__micromasters_summary__dbt_tmp: Role accountadmin does not have the privilege CREATE_TABLE on the schema ol_data_lake_production.ol_warehouse_production_mart", query_id=20230505_202141_89994_u66wp)
  compiled Code at target/run/open_learning/models/marts/micromasters/marts__micromasters_summary.sql
Failure in test not_null_int__micromasters__program_certificates_user_edxorg_username (models/intermediate/micromasters/_int_micromasters__models.yml)
  Got 5 results, configured to fail if != 0

Acceptance Criteria

blarghmatey commented 1 year ago

The issue with the create table permissions was due to a child role having a deny statement for creating tables which propagated upward. For now that permission setting has been removed to unblock the dbt runs.

blarghmatey commented 1 year ago

The most recent dbt build output is:

Finished running 209 table models, 1651 tests, 10 view models in 0 hours 41 minutes and 52.53 seconds (2512.53s).

Completed with 4 errors and 0 warnings:

Database Error in model stg__micromasters__app__postgres__grades_combinedcoursegrade (models/staging/micromasters/stg__micromasters__app__postgres__grades_combinedcoursegrade.sql)
  TrinoQueryError(type=INTERNAL_ERROR, name=GENERIC_INTERNAL_ERROR, message="Failed communicating with server: https://mitol.galaxy.starburst.io/api/v1/galaxy/security/trino/entity/table/c-4004614063/ol_warehouse_production_staging/stg__micromasters__app__postgres__grades_combinedcoursegrade__dbt_tmp/:create", query_id=20230508_172400_77476_6xy4u)
  compiled Code at target/run/open_learning/models/staging/micromasters/stg__micromasters__app__postgres__grades_combinedcoursegrade.sql

Database Error in model int__mitxpro__b2becommerce_b2border (models/intermediate/mitxpro/int__mitxpro__b2becommerce_b2border.sql)
  TrinoUserError(type=USER_ERROR, name=TABLE_NOT_FOUND, message="line 20:19: Table 'ol_data_lake_production.ol_warehouse_production_intermediate.int__salesforce__opportunity' does not exist", query_id=20230508_174508_00203_6xy4u)
  compiled Code at target/run/open_learning/models/intermediate/mitxpro/int__mitxpro__b2becommerce_b2border.sql

Failure in test not_null_int__combined__courserun_enrollments_courserun_id (models/intermediate/combined/_combined_models.yml)
  Got 12600270 results, configured to fail if != 0

  compiled Code at target/compiled/open_learning/models/intermediate/combined/_combined_models.yml/not_null_int__combined__courserun_enrollments_courserun_id.sql

Failure in test not_null_int__combined__courserun_enrollments_courserun_title (models/intermediate/combined/_combined_models.yml)
  Got 85666 results, configured to fail if != 0

  compiled Code at target/compiled/open_learning/models/intermediate/combined/_combined_models.yml/not_null_int__combined__courserun_enrollments_courserun_title.sql

Done. PASS=1769 WARN=0 ERROR=4 SKIP=97 TOTAL=1870
blarghmatey commented 1 year ago

The most recent production deploy of Dagster/dbt code is May 4th. Deploying the latest as of now and will re-run to see what the error output is afterwards.

rachellougee commented 1 year ago

https://github.com/mitodl/ol-data-platform/pull/688 is merged, which should fix the first 3 errors in https://pipelines.odl.mit.edu/runs/c822b251-e5c7-4243-a561-7aff967f60ea?logFileKey=xslqhdla

rachellougee commented 1 year ago

airbyte_asset_sync job ran successfully last night, dbt models test failures are all resolved. I will continue reviewing the job log in dagster, but this can be closed

rachellougee commented 1 year ago

Job has been running successfully 4 days in a row, but there are Trino Internal errors in today's run

Database Error in model __micromasters_course_certificates_dedp_from_micromasters (models/intermediate/micromasters/subqueries/__micromasters_course_certificates_dedp_from_micromasters.sql)
TrinoQueryError(type=INTERNAL_ERROR, name=GENERIC_INTERNAL_ERROR, message="Failed communicating with server: https://mitol.galaxy.starburst.io/api/v1/galaxy/security/trino/entity/table/c-4004614063/ol_warehouse_production_intermediate/__micromasters_course_certificates_dedp_from_micromasters__dbt_tmp/:create", query_id=20230516_090255_15910_3thya)
compiled Code at target/run/open_learning/models/intermediate/micromasters/subqueries/__micromasters_course_certificates_dedp_from_micromasters.sql

I also encountered a similar error when testing locally

13:32:03  Runtime Error in model stg__micromasters__app__postgres__courses_program (models/staging/micromasters/stg__micromasters__app__postgres__courses_program.sql)
13:32:03    Runtime Error
13:32:03      TrinoQueryError(type=INTERNAL_ERROR, name=GENERIC_INTERNAL_ERROR, message="Unexpected response status (Bad Gateway) performing operation: entity renamed
13:32:03      502 Bad Gateway
13:32:03      Unable to reach the origin service. The service may be down or it may not be responding to traffic from cloudflared
13:32:03      ", query_id=20230516_131344_85991_3thya)
13:32:03  
13:32:03  Done. PASS=32 WARN=0 ERROR=1 SKIP=8 TOTAL=41

It doesn't seem like data or dbt error, @blarghmatey @quazi-h do you have any idea what it might be?

rachellougee commented 1 year ago

The job ran fine and the models materialized this morning. @blarghmatey do we need to investigate the intermittent GENERIC_INTERNAL_ERROR?

blarghmatey commented 1 year ago

I upgraded the production cluster to be a paid tier, and running in fault-tolerant mode which seems to have resolved these errors.