spotify / styx

"The path to execution", Styx is a service that schedules batch data processing jobs in Docker containers on Kubernetes.
Apache License 2.0
267 stars 50 forks source link

🐛 BUGFIX: Handle Flyte error codes that come from Dynamic Workflows #1083

Closed brandon-segal closed 1 year ago

brandon-segal commented 1 year ago

Description

While working with the deployment of the dynamic workflow, it was found that the workflow would result return an error code of RetriesExhaused|User:NotReady when there was a dependency missing instead of User:NotReady, which is the typical error code when a dependency was missing. Styx uses these error codes returned by Flyte to determine what status the Styx workflow instance should be, and if it is User:NotReady, the system will return a 20 error code for a missing dependency. (relevant code) With the Flyte team's help, the issue could be tracked to a set of locations in the Flyte propeller code.

Ideal Behavior

Styx returns a Missing Dependencies error code when the error code contains User:NotReady

Current Behavior

Styx returns an unknown error error code when the error code is not exactly User:NotReady

Possible Cause

Within the Flyte propeller code base, it was found that dynamic workflows will raise a RetryableFailure status if any dynamically generated nodes fail (relevant code). Once this status is raised for the dynamic workflow, the Flyte propeller will prepend the error code with RetriesExhaused| before the dynamic node's original error code (relevant code).

The impact is that any dynamic workflow cannot raise a User:NotReady in a way Styx can identify. This will result in erroneously labeling workflows as having unknown errors when the team may be raising error codes known to the Styx service but not recognized due to the RetriesExhaused string prepended to it.

Suggested Remediation

A possible remediation to this to allow dynamic workflows to raise specific Styx errors is to remove the RetriesExhaused| String before matching it to any of the known error codes.

andresgomezfrr commented 1 year ago

hey @brandon-segal ! Could you share a flyte execution with this error to check exactly the flye error?

brandon-segal commented 1 year ago

@andresgomezfrr

flytectl get execution -p dataplatform-insights-pipelines -d production xxxxxxxxxxxxxxx -o yaml
closure:
  createdAt: "2023-05-28T02:32:45.280650787Z"
  duration: 293.582359335s
  error:
    code: RetriesExhausted|USER:NotReady
    kind: USER
    message: |-
      [1/1] currentAttempt done. Last Error: USER::Traceback (most recent call last):

            File "/usr/src/app/.venv/lib/python3.8/site-packages/flytekit/exceptions/scopes.py", line 203, in user_entry_point
              return wrapped(*args, **kwargs)
            File "/usr/src/app/.venv/lib/python3.8/site-packages/spotify_dbt_flytekit/tasks/dbt_task.py", line 42, in wrapper
              handle_dbt_flyte_errors(out)
            File "/usr/src/app/.venv/lib/python3.8/site-packages/spotify_dbt_flytekit/clients/dbt_cli/handle_errors.py", line 22, in handle_dbt_flyte_errors
              raise FlyteMissingDependencyException(

      Message:

          ('Missing Dependency in DBT Script', 'model.dataplatform_insights.stg_cp__ui_components', 'error', 'Compilation Error in model stg_cp__ui_components (models/staging/client-platform/stg_cp__ui_components.sql)\n  404 Error: partition not found for:`client-platform-insights-1`.`ui`.`ui_components` for partition 2023-05-26 00:00:00\n  \n  > in macro check_dependencies (macros/dependencies/check_dependencies.sql)\n  > called by macro run_hooks (macros/materializations/hooks.sql)\n  > called by macro create_or_replace_view (macros/materializations/models/view/create_or_replace_view.sql)\n  > called by macro materialization_view_bigquery (macros/materializations/view.sql)\n  > called by model stg_cp__ui_components (models/staging/client-platform/stg_cp__ui_components.sql)')

      User error.
  phase: FAILED
  startedAt: "2023-05-28T02:32:50.396389232Z"
  stateChangeDetails:
    occurredAt: "2023-05-28T02:32:45.280650787Z"
  updatedAt: "2023-05-28T02:37:43.978748335Z"
  workflowId:
    domain: production
    name: dataplatform_insights_pipelines.workflows.dynamic_dbt_build.dynamic_dbt_build
    project: dataplatform-insights-pipelines
    resourceType: WORKFLOW
    version: 30286103-da38-417b-b774-2498293066a8
id:
  domain: production
  name: kgdjf7mhot5bglibbpwr
  project: dataplatform-insights-pipelines
spec:
  annotations:
    values:
      STYX_COMPONENT_ID: dataplatform-insights-pipelines
      STYX_EXECUTION_ID: styx-run-30d25fb0-ee9f-4265-a021-7da312b45a59
      STYX_PARAMETER: "2023-05-26"
      STYX_TRIGGER_ID: natural-trigger
      STYX_TRIGGER_TYPE: natural
      STYX_WORKFLOW_ID: dataplatform-insights-pipelines.production.dataplatform_insights_dbt
      styx-execution-id: styx-run-30d25fb0-ee9f-4265-a021-7da312b45a59
      styx-workflow-instance: dataplatform-insights-pipelines#dataplatform-insights-pipelines.production.dataplatform_insights_dbt#2023-05-26
  labels:
    values:
      STYX_COMPONENT_ID: dataplatform-insights-pipelines
      STYX_EXECUTION_ID: styx-run-30d25fb0-ee9f-4265-a021-7da312b45a59
      STYX_PARAMETER: "2023-05-26"
      STYX_TRIGGER_ID: natural-trigger
      STYX_TRIGGER_TYPE: natural
      STYX_WORKFLOW_ID: dataplatform-insights-pipelinesproductiondataplatform_in6f5f0c2
      declarative-project-namespace: dataplatform-insights
      ghe-org: dataplatform-insights
      ghe-repo: dataplatform-insights-pipelines
  launchPlan:
    domain: production
    name: dataplatform_insights_dbt_lp
    project: dataplatform-insights-pipelines
    resourceType: LAUNCH_PLAN
    version: 30286103-da38-417b-b774-2498293066a8
  metadata:
    mode: SCHEDULED
    systemMetadata:
      executionCluster: flyte-production-regional
  securityContext:
    runAs: {}