Feature request - Suuport Argo suspense and resume on KFP

amitripshtos commented 4 years ago

Hey, Sometimes we have pipelines that need to be suspend for some time in order to do human interactions such DB writes / using 3rd party apps and so on..

Argo supports suspense of a workflow and resuming it and I thought it's a good feature to have a SuspendOp in the dsl that suspense the workflow and a resume API endpoint that resumes the pipeline when it reaches a suspend Op.

[x] DSL - Add SuspendOp to KFP python packge's DSL
[x] Backend -Add to KFP backend API endpoint for pipeline resume
[x] Frontend - Add to KFP frontend a resume button if needed
[x] Frontend - for a suspend component in the DAG, show it as suspended and not as running

I'm willing to help writing the feature if I get some guidance on where should I do it. Thanks, Amit

nrchakradhar commented 4 years ago

@amitripshtos Can you share the references to argo for suspension and resumption.

amitripshtos commented 4 years ago

Sure,

An example template: https://github.com/argoproj/argo/blob/master/examples/suspend-template.yaml

As you see, you can set a step to a suspense step (which can be timed or not)

In order to resume, you need to use "argo resume workflow-id" cli command.

The command updates the Workflow resource: https://argoproj.github.io/docs/argo/examples/readme.html#suspending

Under the hood: ResumeWorkflow resumes a workflow by setting spec.suspend to nil and any suspended nodes to Successful.

The argo server has an api endpoint for that also

NikeNano commented 4 years ago

Cool, this would be a nice feature. I would be happy to help out :)

amitripshtos commented 4 years ago

I agree! Is there any core contributor that has about 10-15 minutes to speak with me about implementation details on zoom/slack? That would be amazing

Ark-kun commented 4 years ago

Can you please describe more end user scenarios where this feature can be used?

rmgogogo commented 4 years ago

+1 to Ark-kun's reply. It's better all tasks are built into the pipeline.

amitripshtos commented 4 years ago

We have variety of use-cases / stories that will benefit from this feature:

A pipeline we created that trains a model that does batch prediction. The last component of the pipeline outputs as an artifact a CSV file with a batch prediction sample. We want a human eye to verify the predictions - making sure the model is valid and fits the business questions asked.

How the feature may help: In case a we a component that suspense the pipeline, we can review the artifact and in case we are happy with it we can resume the pipeline, and the rest of the pipeline is running on all of our data and predicts all of it.

A pipeline that does runs a model on our data, and clusters the data into groups. We want a human intervention to label these groups in an external application we use and once we complete we want to resume the pipeline. The rest of the pipeline will get the data from our application and continue the processing.

How the feature may help: Instead having 2 different pipelines in KFP (one from grouping the data into groups and the second pipeline is fetching the groups labels and keep processing) we can have 1 pipeline with a suspense. It really helps us out since we don't need to automate pipeline runs.

We have a pipeline that does some training for a model and eventually we deploy a prediction REST API from it. We want to automatically test the model with sample data and manually check the quality of the predictions before rolling out new version of the API. With this feature it really eases this issue (specially if KFP will have a REST API to resume pipelines)
We have a pipeline that trains a model and after training it deploys a prediction endpoint API. We want to suspend the rest of the pipeline for 1 hour and after 1 hour query this API for statistics. It can be automatic with the suspense feature (argo supports suspense for some amount of time and resume automatically)

How I imagine this feature:

The SDK should have some kind of "SuspenseOp" which can be time-driven or not. We should be able to use ".after(...)" to specify it's place in the DAG.
A REST endpoint in the KFP API server for resuming suspended pipeline runs.
Inside the KFP UI - we should see a "SuspenseOp" as a component, but be able to click "resume" to resume the pipeline.

I feel like this feature can be very beneficial for a lot of data scientists and engineers since there are a lot of cases that involves human eye. What do you think? Thanks, Amit.

ido-trumer commented 4 years ago

This can be really helpful, +1

Oranst commented 4 years ago

+1! I think it's a great idea, right now we use a lot of workarounds in order to achieve such functionality, and it's actually one of the main things that blocks us from moving on with KFP to production..

NikeNano commented 4 years ago

I would be happy to help you out implementing this @amitripshtos if you like some help on it?

amitripshtos commented 4 years ago

Sounds great, please send me a mail to amit@noogata.com and we will take it from there? @NikeNano

NikeNano commented 4 years ago

Sounds great, please send me a mail to amit@noogata.com and we will take it from there? @NikeNano

Great, I have sent an email to you .

amitripshtos commented 4 years ago

The SDK part is done (also with tests) Will work today on the backend side.

rmgogogo commented 4 years ago

/cc jingzhang36

Bobgy commented 4 years ago

/cc @jingzhang36 /cc @IronPan for backend review

jingzhang36 commented 4 years ago

/amitripshtos Thanks for the detailed user scenarios! One thing I would like to double check is whether the human intervention at a particular step would impact the output of this step and hence impact the following steps' execution. In other words, could the final result (say the prediction model) be different depending on how the human intervenes at some intermediate steps? The reason I am asking is to estimate the possible impact on our metadata system. Otherwise, this feature proposal looks fantastic to me! And thanks a lot for contributing to KFP codebase :D

amitripshtos commented 4 years ago

Thanks for the quick response! I think it's up to the pipeline author to decide whenever it will change the data or not, and I will explain:

There are cases that there must be a human validation in order to continue the pipeline, such as:

A step that does a batch prediction and gives samples as an artifact in order to manually validate. In case the user resumes the pipeline the next step can be sending email/upload to database or bucket.

There are other cases that involves external systems, for a real example:

Pipeline that implements a full ML pipeline and has a step that does clustering to a dataset. The next step can be sending the clustered data to an external service in order to label the clusters, therefore we need to suspend the pipeline until the user finishes to label the data, and the external system can trigger pipeline resume.

Without this feature there is no good way doing that, unless you separate pipelines - which causes the MLMD to be useless.

Thank you very much for taking a closer look and I hope this feature will help many more data teams integrating their pipelines with human interactions / external systems.

Bobgy commented 4 years ago

Without this feature there is no good way doing that, unless you separate pipelines - which causes the MLMD to be useless.

Can you elaborate on separating pipelines cause MLMD to be useless?

We can find lineage relationship from the final artifact back to artifacts the second pipeline used, therefore back to the first pipeline. Maybe I'm missing some context here.

amitripshtos commented 4 years ago

Sure, Let's say the first pipeline ends with an artifact, and after this pipeline ends we need some manual intervention.

When this intervention completes, we want to start the second pipeline (not by hand, rather using API). Without the suspense feature, the only way to do that is to: Make the first step in the second pipeline to retrieve the artifact from the first pipeline using minIO client + access token + output artfiact URL from the first pipeline. That makes the first step of the second pipeline be very non-generic since it must be your custom code to retrieve the pipeline.

The artifact path should be either fetched from MLMD or sent manually as a parameter when starting the pipeline.

All of this is considered hard work - it requires engineers to write this custom step and also when you look at the MLMD linage you won't see it as one long DAG but 2 seperates ones - it hard to understand the executions and the artifacts flow.

Also, there is no feature to start a pipeline which one of the parameters is an input artifact from other pipeline (might be a cool idea though)

jingzhang36 commented 4 years ago

@amitripshtos Thanks again for the investigation. I'll take a look at your pr and meanwhile I'll let @neuromage to verify the mlmd usage.

neuromage commented 4 years ago

Hi @amitripshtos

All of this is considered hard work - it requires engineers to write this custom step and also when you look at the MLMD linage you won't see it as one long DAG but 2 seperates ones - it hard to understand the executions and the artifacts flow.

I'm not sure this is true. Lineage view in this case will be exactly the same. You will not see two DAGs, but a single one, since the output of produce pipeline is used as input to consumer pipeline.

Agreed you will need a custom step to do the importing of the artifact in the consumer pipeline. But that may be preferable to adding this feature, since it adds some complexity to the system. Unless we can think of more use-cases here which would justify it of course. Thoughts?

/cc @IronPan

amitripshtos commented 4 years ago

I see your points, and I wanted to ask what's the overall opinion of all of you before I work on the Pull requests and fix issues.

If the majority of you think this feature may introduce extra complexity, I think we can close this request - otherwise, I can work on the pull requests.

What do you think?

rmgogogo commented 4 years ago

Thanks Amit for driving this feature. It's great feature but it also has the side which would introduce complexity and resource-holding etc. we may didn't aware yet. Here double check whether it can be treated from another angle.

I would prefer we discuss this FR from detailed use-case.

A step that does a batch prediction and gives samples as an artifact in order to manually validate. In case the user resumes the pipeline the next step can be sending email/upload to database or bucket.

"the next step can be sending email/upload to database or bucket." Is it done in KFP pipeline? If so, it also can be handled in another Pipeline Run, right? Or we treat it as a data/event driven trigger case, via using CloudFunction, a simple trigger can be made to help user easily trigger a "sendingEmail/uploadToDatase" either via KFP or without KFP etc.

Pipeline that implements a full ML pipeline and has a step that does clustering to a dataset. The next step can be sending the clustered data to an external service in order to label the clusters, therefore we need to suspend the pipeline until the user finishes to label the data, and the external system can trigger pipeline resume.

It also can be treated as a data/event trigger case, right? When external service finish something, it trigger KFP Pipeline Run. May I say you don't want to split the pipeline into two? I may prefer treat it two pipelines and trigger the run. It would help concat more complexed multiple pipelines together to act like a virtual-bigger-pipeline.

E.x. external systems outputs 3 cases: Success-A, Success-B, Fail. Success-A trigger a pipeline to run a Xgboost (e.x. less labelled data). Success-B trigger a pipeline to run a DeepLearning model (e.x. enough labelled data). Fail triggers a pipeline to send out email or alert etc. Treating it as Data/Event Driven Trigger would help decouple systems with my understanding.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

harshsaini commented 3 years ago

Hey folks, I noticed that the PRs for this feature are all closed along with this issue as well. Has this feature been added to Kubeflow Pipelines or has it been abandoned?

amitripshtos commented 3 years ago

Unfortunately Kubeflow Pipelines maintainers thought that the feature should not be part of the engine at the moment. I think the discussion exist in this issue or one of the PRs.

On Thu, Jul 8, 2021, 19:46 Harsh Saini @.***> wrote:

Hey folks, I noticed that the PRs for this feature are all closed along with this issue as well. Has this feature been added to Kubeflow Pipelines or has it been abandoned?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubeflow/pipelines/issues/3101#issuecomment-876590086, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSFNTBZ6I5NLUYB33PDRO3TWXI7LANCNFSM4KWXPY2Q .

Bobgy commented 3 years ago

We are tracking the feature request in https://github.com/kubeflow/pipelines/issues/3454.

Do you have any use cases besides that?

Thumbs up on the issue to bump priority 🙂

Currently, we are working on bit.ly/kfp-v2, so the decision was to limit expansion of features until we have a more solid foundation. We can reconsider the open feature request in the future

harshsaini commented 3 years ago

Thanks @amitripshtos and @Bobgy for replying.

I do have a few use cases that are different from the ones listed above and I am mostly interested in this feature as a workaround for my use case.

The use case I am trying to resolve is avoiding busy waiting on pipelines when a long running external task is started via pipelines. Currently, the only way to detect completion of such tasks is via a busy wait pooling operation that comes with a few problems. I have described my use case in more detail here: https://github.com/kubeflow/pipelines/issues/5995

google-oss-prow[bot] commented 1 year ago

@mohifaruq: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubeflow/pipelines/issues/3101#issuecomment-1644070389): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubeflow / pipelines

Feature request - Suuport Argo suspense and resume on KFP #3101