kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.61k stars 1.63k forks source link

[Discussion] Future of recurring run (a.k.a job or scheduled workflow) #4752

Open Bobgy opened 4 years ago

Bobgy commented 4 years ago

There were a lot of discussion in https://github.com/kubeflow/pipelines/issues/3789 that I think worth grouping in a new issue.

Our conclusion for Job (a.k.a recurring run/scheduled workflow) is: We may work on a refactoring that swaps out underlying ScheduledWorkflow implementation with kubernetes CronJob, that will be best for long term maintenance. We don't want to maintain our own ScheduledWorkflowController that is a feature subset of CronJob. Using CronJob as implementation details allow first party integration like UI management, metadata for pipeline runs for CronJob, while keeping maintenance cost low for the controller.

We don't have an ETA for the refactoring right now, because Job is not our priority, we'll triage based on future user requirements.

Originally posted in https://github.com/kubeflow/pipelines/issues/3789#issuecomment-662296647

ekesken commented 4 years ago

I would vote for using CronWorkflow coming from Argo instead of a CronJob.

Bobgy commented 4 years ago

Argo is an implementation detail of KFP, at some point of time in the future, we'll change the way how we integrate with Argo, so CronWorkflow would not be a good fit. A CronJob that calls KFP API server will be guaranteed to work despite whatever implementation detail change we decide for KFP.

Bobgy commented 4 years ago

Also /cc @NikeNano

NikeNano commented 4 years ago

Sounds good, I think it would be great to make use of k8s native scheduling instead. Also agree that argo would not be the best solution in our case.

Jeffwan commented 3 years ago

Argo is an implementation detail of KFP, at some point of time in the future, we'll change the way how we integrate with Argo, so CronWorkflow would not be a good fi

it makes sense. I assume integrating CronWorkflow will be easier at this moment? If KFP plans to change orchestrator some time, that will be a big refactoring and scheduled workflow can be part of it as well?

K8s CronJob itself only gives repeating schedule over Job. It still need to invoke workflow which will be picked by Argo. This means even we determine to use CronJob, it still ties with Argo. The time to change the way to integrate with Argo, we need to refactor this part as well?

Bobgy commented 3 years ago

@Jeffwan right now, the scheduledworkflow spec contains entire workflow details. However, we'd prefer that it only contains a cron schedule + a pipeline/version id, so it can call KFP API server for triggering each run.

(We are already planning on changing, at least, how we integrate with argo in KFP IR efforts, so the change comes in near future.)

ekesken commented 3 years ago

how can we follow the plans for this refactoring?

I would say better not to use terms from Argo if there is a plan to decouple from it.

And it's not just about naming, whole interface is borrowed from Argo for now: https://github.com/kubeflow/pipelines/blob/1.0.4/backend/src/common/util/workflow.go#L29-L32

So now we have Workflow term exists both in Argo and Kubeflow Pipelines, and actually in implementation as well they are same things, it even uses Argo's CRD in k8s side.

And as I understand, now we want to keep ScheduledWorkflow term in Kubeflow Pipelines, and use CronJob interface to implement it.

If the purpose is decoupling, I would say using outsider interfaces as they are won't end well in general, and inventing a new name for Workflow terms in Kubeflow side would reduce confusions a lot.

Bobgy commented 3 years ago

@ekesken completely agree, it will be called pipeline spec and completely decoupled from argo: https://github.com/kubeflow/pipelines/blob/master/api/v2alpha1/pipeline_spec.proto

NikeNano commented 3 years ago

@Bobgy for changes like this one, are we aiming for some form of technical design document before someone starts on the actual work to align on how to best implement an solution?

Bobgy commented 3 years ago

@NikeNano yes, a design doc should be drafted and discussed in KFP community meeting

NikeNano commented 3 years ago

Sounds great @Bobgy. I will start to work on make a small proof of concept/test and then start to draft a design doc.

StefanoFioravanzo commented 3 years ago

@NikeNano @Bobgy First of all, I completely endorse the refactoring and move to native Cron Job, it will definitely improve the code quality of the controller.

A very important feature of ScheduledWorkflow is to be able to set a backfilling policy (i.e. If the start time is in the past, then run all the jobs that should have run until now, respecting the max parallel jobs limit). Doing a quick search I didn't find an obvious counterpart in the CronJob CR, maybe I just missed it. Just mentioning it so that we take it into consideration.

Looking forward to reading the design doc! :)

Bobgy commented 3 years ago

That's a good point Stefano, we'll need to consider other options as well if cronjob itself is not enough.

NikeNano commented 3 years ago

Based upon what you pointed out @StefanoFioravanzo I agree with you @Bobgy that k8s cron jobs wont do it since it would require quite som logic to support backfilling. One option would be to make use of the underlying workflow engine(argo, tekton) and based upon the job configurations create the necessary resources and pass it down to the workflow engine instead. There are some examples for argo. We would then have to create some intermeddiate representation which can be consumed by the workflow engine of interest. However I am not sure this would simplify the code base and reduce the maintenance. WDYT @Bobgy

Bobgy commented 3 years ago

scheduledworkflow CRD is already embedding an argo workflow resource in it, so I think it would be very straightforward to switch to argo cronworkflow (and get rid of scheduledworkflow CRD and controller).

The key problem is investigating e.g. whether argo cronworkflow supports backfill functionalities like we do.

NikeNano commented 3 years ago

scheduledworkflow CRD is already embedding an argo workflow resource in it, so I think it would be very straightforward to switch to argo cronworkflow (and get rid of scheduledworkflow CRD and controller).

But in this case we will lock into argo more tightly as you commented here: https://github.com/kubeflow/pipelines/issues/4752#issuecomment-725333392.

I also think it will be harder to support multiple runners if we mote all the logic to the underlying runner since this might be an edge case that are not natively supported. We could separate the backfilling logic from the scheduled job, and create the needed backfilling as separate jobs.

Will check out how to do it with argo further.

NikeNano commented 3 years ago

I missed to follow up on this, will make a small proof of concept with argo cronjobs to try it out.

NikeNano commented 3 years ago

@Bobgy after looking further in to this it seems possible to get similar(not exactly the same in terms of parallism restrictions between backfilling and runing cronjobs) behaviour as we have today for scheduled jobs using argo cronjobs and workflowtemplates for backfilling logic. However also setting it up for tekton seems to be a different story. How do we go about these things, do we need to have a solution that gives full support for both or are we focusing on argo and then leave it up to the kfp/tekton team to look further in to how it can be handled for tekton. If we aim for fully supporting both workflow engines, which are quite differently I think it might be better to keep the current solution until more features are available in tekton and https://github.com/tektoncd/triggers.

Bobgy commented 3 years ago

Good point.

@Tomcli @animeshsingh may I hear your thoughts on this issue?

Tomcli commented 3 years ago

I agree to keep the current solution to avoid further divergent between Argo and Tekton. But if KFP need to use Argo cronjobs, then we want the recurring run API to have a common gRPC interface that share between the current solution and the new implementation with Argo cronjobs. This way we know what features KFP need on have on Tekton triggers, and the Tekton backend can keep using the current solution while we pushing for the necessary features to Tekton triggers.

NikeNano commented 3 years ago

Sound like a good solutions as well, since it will also allow for making new updates.

But if KFP need to use Argo cronjobs, then we want the recurring run API to have a common gRPC interface that share between the current solution and the new implementation with Argo cronjobs. This way we know what features KFP need on have on Tekton triggers, and the Tekton backend can keep using the current solution while we pushing for the necessary features to Tekton triggers.

Keeping the controller gives us more control over the runtime which I guess is also a bit more aligned with the thoughs of V2 where we go from a smart compiler to a smart run time. I think that the changes we like to do to update the features of the scheduled workflow should work in eaither case.

Bobgy commented 3 years ago

Makes sense to me!

To reduce coupling with argo, what do you think about this proposal:

Keep using scheduledworkflow controller, but change its interface from argo workflow to:

So that, previously scheduledworkflow controller would directly create argo workflows, now it will always call KFP API server to create runs.

This structure will make scheduledworkflow controller reusable between KFP v1, KFP v2 and KFP on Tekton. Also, it can reduce quite some technical debt of having two possible code paths (mostly duplicate) for creating a workflow: https://github.com/kubeflow/pipelines/blob/42630cb5ff07cf5a5c0b890b3c709c921adb02d3/backend/src/apiserver/resource/resource_manager.go#L327 and https://github.com/kubeflow/pipelines/blob/42630cb5ff07cf5a5c0b890b3c709c921adb02d3/backend/src/apiserver/resource/resource_manager.go#L655.

@Tomcli what do you think? Will you be interested to reuse this new scheduledworkflow controller? How do you currently support cron tasks for KFP on Tekton?

Tomcli commented 3 years ago

Makes sense to me!

To reduce coupling with argo, what do you think about this proposal:

Keep using scheduledworkflow controller, but change its interface from argo workflow to:

  • pipeline ID + version ID
  • parameter map

So that, previously scheduledworkflow controller would directly create argo workflows, now it will always call KFP API server to create runs.

This structure will make scheduledworkflow controller reusable between KFP v1, KFP v2 and KFP on Tekton. Also, it can reduce quite some technical debt of having two possible code paths (mostly duplicate) for creating a workflow:

https://github.com/kubeflow/pipelines/blob/42630cb5ff07cf5a5c0b890b3c709c921adb02d3/backend/src/apiserver/resource/resource_manager.go#L327

and https://github.com/kubeflow/pipelines/blob/42630cb5ff07cf5a5c0b890b3c709c921adb02d3/backend/src/apiserver/resource/resource_manager.go#L655

. @Tomcli what do you think? Will you be interested to reuse this new scheduledworkflow controller? How do you currently support cron tasks for KFP on Tekton?

Thanks @Bobgy, this sounds good to me. For the current scheduledworkflow, KFP-Tekton followed the same implementation as KFP to create the Tekton pipelineRuns directly using the Tekton go client. So calling the KFP API server for creating the runs can reduce the amount of code we need to maintain as well.

NikeNano commented 3 years ago

Great, I will write a short suggestion on the design in a google docs and start the work on it.

NikeNano commented 3 years ago

Here is an early draft super happy for all feedback: https://docs.google.com/document/d/15tf4lPUf3cBlsgOhRep9Jj2DAXxGT-oCK0BBB4G9uRk/edit?usp=sharing @Bobgy @Tomcli thanks!

Bobgy commented 3 years ago

That's awesome💯💯 Can you share comment permission to me?

NikeNano commented 3 years ago

That's awesome💯💯 Can you share comment permission to me?

Should be fixed now.

StefanoFioravanzo commented 3 years ago

@NikeNano This is a great doc! I have two comments, I am posting them here since I don't have comment permissions (maybe you could open comments to anyone?)

The suggested change is to add the pipeline id and pipeline version to the ScheduleWorkflow CRD and deprecate the workflow field

I like this. I think it's the best way to make the least amount of changes in the current controller while reducing maintenance burden.

Regarding the ability to update a Scheduled Workflow

Allowing a user to change a pipeline will allow for keeping a complete history of runs within the same ScheduledWorkflow and make traceability easier between scheduled runs.

I think that is very important to maintain a persistent history of all the changes that have been made to a Job. This change history should also be surfaced to the KFP UI. After a pipeline/version/config update to a Job, the runs could be failing, results could be skewed, anything could happen. Once would ideally want to either roll-back these changes (this can be done manually, if one does have the history of changes) or at least know when/why this is happening. I'd love yo hear your thoughts on this.

StefanoFioravanzo commented 3 years ago

@Bobgy @NikeNano Also, would this be a good time to pick one single name, used across the entire code base? It's funny that even this issue and design doc reference the same feature with 3 different names. I think "Recurring Run" is the more idiomatic term, but really any name is fine as long as we agree on one.

Bobgy commented 3 years ago

@Bobgy @NikeNano Also, would this be a good time to pick one single name, used across the entire code base? It's funny that even this issue and design doc reference the same feature with 3 different names. I think "Recurring Run" is the more idiomatic term, but really any name is fine as long as we agree on one.

I couldn't agree more. 💯

NikeNano commented 3 years ago

@NikeNano This is a great doc! I have two comments, I am posting them here since I don't have comment permissions (maybe you could open comments to anyone?)

Tried to update the link: https://docs.google.com/document/d/15tf4lPUf3cBlsgOhRep9Jj2DAXxGT-oCK0BBB4G9uRk/edit?usp=sharing

The suggested change is to add the pipeline id and pipeline version to the ScheduleWorkflow CRD and deprecate the workflow field

I like this. I think it's the best way to make the least amount of changes in the current controller while reducing maintenance burden.

Cool!

Regarding the ability to update a Scheduled Workflow

Allowing a user to change a pipeline will allow for keeping a complete history of runs within the same ScheduledWorkflow and make traceability easier between scheduled runs.

I think that is very important to maintain a persistent history of all the changes that have been made to a Job. This change history should also be surfaced to the KFP UI. After a pipeline/version/config update to a Job, the runs could be failing, results could be skewed, anything could happen. Once would ideally want to either roll-back these changes (this can be done manually, if one does have the history of changes) or at least know when/why this is happening. I'd love yo hear your thoughts on this.

Good point, both the history and the rollback make sens to keep. Will look into how we can achive this and add it to the doc.

@Bobgy @NikeNano Also, would this be a good time to pick one single name, used across the entire code base? It's funny that even this issue and design doc reference the same feature with 3 different names. I think "Recurring Run" is the more idiomatic term, but really any name is fine as long as we agree on one.

Yes, we should update this and decouple the naming from argo to make it more clear as well. Think Recurring Run sounds good.

Thanks for the feedback @StefanoFioravanzo

Bobgy commented 3 years ago

I agree history and rollback seems useful when supporting edit, but the complexity of design would grow a lot.

I'd suggest separating discussion for support updates and refactoring to be agnostic of argo, because as far as I can tell they seem orthogonal problems.

Bobgy commented 3 years ago

For updating recurring runs, I'd suggest continuing the previous thread https://github.com/kubeflow/pipelines/issues/3789

NikeNano commented 3 years ago

For updating recurring runs, I'd suggest continuing the previous thread #3789

Sounds good.

What are your @Bobgy, @StefanoFioravanzo thoughts about the migrations strategies mention in the docs? in general how do we view being backward compatible, should we support both alternative(having the complete workflow vs PipelineId and versionid ) for a couple of release or create migration tools to help people move over?

Bobgy commented 3 years ago

Suggested in the design doc

NikeNano commented 3 years ago

Thanks for the feedback, will start the actual work next week.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

NikeNano commented 3 years ago

Working on it.

Bobgy commented 3 years ago

/lifecycle frozen