kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.6k stars 1.62k forks source link

[backend] `set_retry` for pipelines does not work #11288

Open ianbenlolo opened 2 weeks ago

ianbenlolo commented 2 weeks ago

Environment

Steps to reproduce

In the docs here it says "Pipelines can themselves be used as components in other pipelines, just as you would use any other single-step component in a pipeline".

I was testing this out to see if a pipeline within a pipeline can be retried but i can't get it to work. Here is what I've tried (based on this.)

from kfp import compiler
from kfp import dsl

@dsl.component
def print_op1(msg: str) -> str:
    print(msg)
    return msg

@dsl.container_component
def print_op2(msg: str):
    return dsl.ContainerSpec(
        image='alpine',
        command=['echo', msg],
    )

@dsl.component
def fail_job():
    raise ValueError('This job failed')

@dsl.pipeline
def inner_pipeline(msg: str):
    task = print_op1(msg=msg)

    fail_job().after(task).set_retry(num_retries = 2)

    print_op2(msg=task.output)

@dsl.pipeline(name='pipeline-in-pipeline')
def my_pipeline():
    op1_out = print_op1(msg='Hello')
    inner_out = inner_pipeline(msg='world').set_retry(num_retries=10).after(op1_out)
    print_op1(msg='bye').after(inner_out)

if __name__ == '__main__':
    compiler.Compiler().compile(
        pipeline_func=my_pipeline,
        package_path=__file__.replace('.py', '.yaml'))

The fail_job will retry, but the pipeline-in-pipeline does not. Am i wrong in my thinking?

Expected result

The pipeline-in-pipeline should retry as well.

This is related to my discussion here but making an issue for visibility.


Impacted by this bug? Give it a 👍.

Faakhir30 commented 2 weeks ago

@ianbenlolo set_retry is not working even without nested pipelines, see #9950 , I've also tried it without nested pipelines. It doesn't reties even once based on this spec:

      tasks:
        fail-job:
          cachingOptions:
            enableCache: true
          componentRef:
            name: comp-fail-job
          dependentTasks:
          - print-op1
          retryPolicy:
            backoffDuration: 0s
            backoffFactor: 2.0
            backoffMaxDuration: 3600s
            maxRetryCount: 2
          taskInfo:
            name: fail-job

I didn't try with vertexAI, but I guess set_retry is also not supported by VertexAI yet, see https://issuetracker.google.com/issues/226569351

ianbenlolo commented 2 weeks ago

@Faakhir30 Please see my comment in the original thread. It works for me on vertex ai. The issue is pipelines-in-pipelines that do not.