kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.62k stars 1.63k forks source link

[backend] Retry not working in 2.3.0 #11329

Open asaff1 opened 4 weeks ago

asaff1 commented 4 weeks ago

Environment

Steps to reproduce

Created any pipeline that fails. I created this simple pipeline. Uploaded from the UI "Upload pipeline". Start a run of this pipeline with fail = true. The run fails as expected. Try to click the "Retry" button in the UI, the workflow will be stuck at pending, no retry will be done.

import os
from kfp import dsl
from kfp.dsl import Dataset, Model, Input, Output
from typing import Optional, List

@dsl.component(base_image="python:3.9")
def say_hello(name: str, number: int, opt_int: Optional[int]) -> List[str]:
    hello_text = f'Hello, {name}! number={number} opt_int={opt_int}'
    print(hello_text)
    return ["hello", name, str(number), str(opt_int)]
    #return hello_text

@dsl.component(base_image="python:3.9")
def say_hello_list(list_hello: List[str]) -> str:
    hello_text = f'Hello list_hello={list_hello}'
    print(hello_text)
    return hello_text

@dsl.component(base_image="python:3.9", packages_to_install=['pandas==1.3.5', "numpy==1.*"])
def create_dataset(iris_dataset: Output[Dataset], fail: bool):
    import pandas as pd
    import random
    r = random.random()
    #print("Failing at random < 0.5. r=", r)
    #if r < 0.5:
#    raise ValueError(f"Failing at random! r={r}")
    if fail:
        raise ValueError("fail=true, failing!")
    df = pd.DataFrame({"name": ["a", "b", "c"], "age": [11, 12, 13]})
    with open(iris_dataset.path, 'w') as f:
        df.to_csv(f)

@dsl.component(base_image="python:3.9", packages_to_install=['pandas==1.3.5', "numpy==1.*"])
def print_dataset(ds: Input[Dataset], a: int):
    print(f"hello a={a}")
    print(ds)

@dsl.pipeline
def hello_pipeline(recipient: str, number: int, opt_int: Optional[int], fail: bool) -> str:
    hello_task = say_hello(name=recipient, number=number, opt_int=opt_int)
    hello_list_task = say_hello_list(list_hello=hello_task.output)
    create_dataset_task = create_dataset(fail=fail)
    create_dataset_task.set_caching_options(False)
    print_dataset(ds=create_dataset_task.output, a=44)
    return hello_list_task.output

from kfp import compiler
compiler.Compiler().compile(hello_pipeline,  os.path.basename(__file__).replace(".py", ".yaml"))

Expected result

Retry should work. By the way, I also deployed before kubeflow pipelines 2.0.3, and retry did work as expected. So it must be something that broke in the recent releases.

Materials and Reference


Impacted by this bug? Give it a 👍.