kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.6k stars 1.62k forks source link

Logical Intermediate Pipeline Representation #3703

Closed talebzeghmi closed 1 year ago

talebzeghmi commented 4 years ago

There are currently two planned Kubeflow pipeline (KFP) efforts to compile to an intermediate representation.

  1. Kubeflow Pipelines and Tekton #3647: design doc
  2. Merged TFX and KFP SDK: design doc

I'm creating this issue to coordinate and single out an intermediate representation (IR). The IR would ideally be a neutral project outside of both KFP and TFX, and so should the Python SDK to produce such an IR as @animeshsingh suggested.

Possible immediate representations (no order):

  1. Common Workflow Language: https://en.wikipedia.org/wiki/Common_Workflow_Language

  2. https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json

See also:

  1. MLGraph or subset of it.

  2. PFA (successor to PMML).

  3. https://github.com/openml/flow2 from https://www.openml.org/

animeshsingh commented 4 years ago

@talebzeghmi this is something we have been discussing with KFP team, and hope to reach a conclusion in very near future around the state of IR, and get the effort moved in community soon.

@paveldournov @neuromage @jessiezcc @Ark-kun

kumare3 commented 4 years ago

Hello all, I lead an effort called Flyte. We could see if Flyte could be a potential target for Kfp. We have an experimental compiler (incomplete), but would love to see what your thoughts/opinions are. We would love to collaborate and explore options.

Ketan

talebzeghmi commented 4 years ago

The benefit of an intermediate representation (IR), especially if residing in a neutral repo outside from KFP, is that disparate ML SDKs can compile to the same IR, and the IR can be executed by disparate engines.

Possible SDKs:

Possible execution engines:

kumare3 commented 4 years ago

@talebzeghmi at the moment Flyte has an intermediate representation, specified in protobuf - https://github.com/lyft/flyteidl/blob/master/protos/flyteidl/core/workflow.proto#L147

So this allows FlyteAdmin (control plane to JIT to the executable representation) -> so in our case from this representation to FlytePropeller's Workflow CRD. We could potentially alternatively target Argo CRD or Tekton too, I am sure it is not 1:1, and there in lies the challenge.

karlschriek commented 4 years ago

@kumare3 I work with quite a few teams who are currently thinking of moving their workflows to Kubeflow, but they are also interested initiatives like Flyte (and also Metaflow), but at the moment it is hard to concretely show them how how these could work together.

Metaflow currently comes across as very much its own thing. (Although they claim architectural independence in their design, the fact is that currently it only supports AWS Managed Services in any meaningful way). The fact that Flyte runs on Kubernetes seems to be a natural fit for Kubeflow. Anyway, I don't want to hijack the specific discussion here, just wanted to point out that having Flyte play nicely with Kubeflow (Pipelines in particular) would likely be greeted with enthusiasm by a lot of users.

kumare3 commented 4 years ago

@karlschriek I am absolutely open to all conversations where we would love to serve the Kubeflow community in general. But, we are planning to have an experimental "Kubeflow Pipelines" to "Flyte" compiler soon, which should allow using Flyte to run Kubeflow pipelines.

This is not the ideal solution because, it will not use Flyte's native Plugin system, and will not use KfP's UI (as they have no hooks today). But, we could start pushing for the right integration?

On the other hand @karlschriek please join Flyte Slack and you can ping me, if we want to discuss ideas

jlewi commented 4 years ago

@kumare3 Would you be interested in presenting flyte at an upcoming Kubeflow community meeting? You can find the calendar here

Just leave a comment in the Google doc to let us know which date you are interested in presenting on.

savingoyal commented 4 years ago

@karlschriek Although most of our (metaflow) integrations are with AWS managed services, there is work underway to integrate with GCP and K8S.

We also have a PR out for compiling Metaflow workflows into AWS Step Functions specification, and there is significant interest within the community to have a similar integration with KfP.

kumare3 commented 4 years ago

@jlewi I would definitely be interested to presenting Flyte at the community meeting. Do you think a general presentation about Flyte and our design decisions would be good?

jlewi commented 4 years ago

@kumare3 yes. Could you email the mailing list at kubeflow-discuss@googlegroups.com to coordinate a presentation at an upcoming community meeting.

talebzeghmi commented 4 years ago

Mitar Milutinovic suggested [1] looking at https://github.com/openml/flow2

[1] https://gitlab.com/datadrivendiscovery/d3m/-/issues/458

Ark-kun commented 4 years ago

Please take a look at the TFX IR for a Logical pipeline representation. You can author a pipeline using the TFX SDK and then submit it for execution on Kubeflow.

The TFX IR is the suggested way forward for future development.

Only if for some reason you cannot use TFX SDK and TFX IR, then the KFP SDK already has a structure for pipeline persistence.

Sharing and persistence

For component sharing the KFP SDK already has a portable platform-independent structure called ComponentSpec (schema, outline, approximate proto). This structure is usually serialized into component.yaml definition files and we have built a big ecosystem of those components (hundreds of components). This format has existed in stable version from the first release of KFP. We are pledging support for components authored and persisted in this format. See an example of component yaml.

Graph components

While most components are backed by a container, the ComponentSpec structure also allows having a graph implementation. (schema, outline, approximate proto) This essentially enables 'pipeline-as-component' feature. See an example of an end-to-end pipeline saved as a graph component yaml.

KFP's Python SDK provides a way to

A pipeline saved as a graph component can be easily sent for execution:

my_pipeline_op = kfp.components.load_component_from_url(...)
kfp.Client().create_run_from_pipeline_func(my_pipeline_op, arguments={})

being part of the ComponentSpec format, graph components provide portable platform-independent pipeline persistence format. This format is simple, portable and minimalistic, so it is pretty easy to convert to some other workflow format for execution: Argo, Tekton, Airflow, etc.

The advice from the KFP team is:

If you want to persist a pipeline, the python code is the most supported way for now, but if you want to persist the pipeline in a non-python format, then use the graph component file format.

We'd like the graph component format to be the KFP's logical intermediate pipeline representation.

Open questions:

Adding Backend support for IR/graph-components

This has been considered, but not all stakeholders were on board. Implementation is doable.

Making Frontend IR-native

Frontend directly works with the workflow status object updated by the orchestrator. Since the orchestrator is Argo, the frontend is currently tied to the Argo WorkflowStatus structure (in addition to the WorkflowSpec structure).

Refactoring the DSL to be based on the Graph Component structures.

This might be a good idea that can improve the capabilities of the create_graph_component_from_pipeline_func function that essentially compiles python pipeline to the IR.

Implementing those major refactorings does not seem to bring immediate improvements for the KFP users and the KFP team might lack the bandwidth to implement the above feature.

We're seeking feedback for improving the pipeline --> graph component transformation code so that more features are available. We're also encouraging the compiler author to consider the graph component format as the intermediate representation they can work with.

animeshsingh commented 4 years ago

Thanks @Ark-kun Is this the final word on IR - Using GraphComponentSpec? We were being told there another IR in consideration with TFX team - has that goal been dropped?

If answer is yes to the above question, then this makes sense. Now apart from the issues you mentioned

  1. Adding Backend support for IR/graph-components
  2. Making Frontend IR-native
  3. Refactoring the DSL to be based on the Graph Component structures.

few other things are needed

  1. Pipelines either are persisted in Python, or using this IR, and that's how they are shared e.g. on Google's AIHub
  2. Additionally these are the limitations in the IR

cc @neuromage @paveldournov @jessiezcc

animeshsingh commented 4 years ago

some more details in the slides I used for pipeline community meeting https://www.slideshare.net/AnimeshSingh/kubeflow-pipelines-with-tekton

Ark-kun commented 4 years ago

Is this the final word on IR - Using GraphComponentSpec? We were being told there another IR in consideration with TFX team - has that goal been dropped?

No, that goal is not dropped. I guess my answer was ambiguous. I've reworded it. The information about graph ComponentSpec was only applicable to people who only use KFP SDK and want to persist their pipelines right now. Graph ComponentSpec is only an alternative to inventing a new IR, not an alternative to TFX IR.

The TFX IR is the road forward for the future development. Please try using it.

Pipelines either are persisted in Python, or using this IR, and that's how they are shared e.g. on Google's AIHub

I'm not sure I fully understand what you want to say with this item.

The reason that the sample pipelines use Python is so that they are easier to understand for the users (Python DSL vs YAML). We do not have good editing tools for YAML-based pipelines.

You can upload graph component.yaml files to AI Hub. You can even have zip files with both Argo's pipeline.yaml and component.yaml. load_component_from_url has supported loading them since AI-Hub launch.

Additionally these are the limitations in the IR

I think most of the limitations are not really limitations of the graph ComponentSpec format, but rather limitations of the particular SDK and how it consumes/produces ComponentSpec. If you change the pipeline persistence format, these missing feature implementations won't magically fix themselves.

We're practicing "demand-driven-development" where we implement the features when there are requests for them. Please file the issues so we can learn about the demand. For example, we have an open PR for making use of the Kubernetes options from loaded graph component tasks (getting passing kubernetes_options to ContainerOp). But the PR is not going forward due to lack of demand.

Will not work on ResourceOp, VolumeOp, VolumeSnapShotOp

I'd consider these to be pretty Argo-specific.

Argo implements those as containers that just run kubectl. I think would make the SDK more portable if we change ResourceOp to use an explicit container. I've already created a PR last week to do that for ResourceOp.delete(): https://github.com/kubeflow/pipelines/pull/3841/files ResourceOp.create() will follow.

ExitOp

Per-pipeline ExitOp probably belongs to the PipelineRunSpec, not ComponentSpec.

Conditionals

Conditionals are supported by the format (although not applied during loading). See TaskSpec.is_enabled.

nested pipeline

? The graph ComponentSpec supports that. Every TaskSpec has component_ref which can point to any component - container or graph. The loading should also work (I need to check and add a test for that).

Loops

This is the only real limitation of the ComponentSpec as of now. Loops are not easy to design (especially to be portable). Also loops usually require advanced capabilities for output aggregation which is even harder to design. I have plans to add for-style loop with foreach-style loops (like Argo's withItems) implemented by compilers as syntactic sugar.

executionOptions for adding Kubernetes spec doesn’t seem to work.

This is a missing SDK feature which was moving slowly due to lack of demand. Please create/upvote issues, so that features can be prioritized. See https://github.com/kubeflow/pipelines/pull/3447 and https://github.com/kubeflow/pipelines/pull/3448

Features such as input artifacts

Can you explain the feature and the limitation? I think ComponentSpec has same or better support for artifacts then even ContainerOp.

runAfter, and timeout

Please create feature request issues. The whole available design is bigger than the implemented parts since we do not want to implement features prematurely until there is demand.

When features are implemented prematurely without first collecting the demand feedback first, the design can be suboptimal and require breaking changes in the future. We're trying to aviod that by keeping the design minimal.

P.S. For some of the features you've listed there are sizeable gaps in DSL -> graph ComponentSpec and graph ComponentSpec -> ContainerOp conversion, but these are not the issues in the format itself. Whatever is used as IR, these gaps will need to be filled.

P.P.S. One reason for some of the DSL -> graph ComponentSpec gaps is that the component library part of KFP (kfp.components) is explicitly independent from the DSL and compiler.

Ark-kun commented 4 years ago

some more details in the slides I used for pipeline community meeting https://www.slideshare.net/AnimeshSingh/kubeflow-pipelines-with-tekton

P.S. I really liked your presentation and the visual pipeline editing tool.

talebzeghmi commented 4 years ago

The TFX IR is the road forward for the future development. Please try using it.

@Ark-kun can you please share what the TFX IR is?

I ask because Metaflow are asking for the KFP IR to interface with KFP here https://github.com/Netflix/metaflow/issues/16#issuecomment-631004976

thanks!

Ark-kun commented 4 years ago

A clarification:

My initial answer might have been ambiguous. The TFX IR is the recommended way forward for future development. The information about the KFP's graph ComponentSpec was only applicable to people who only use KFP SDK and want to persist their pipelines right now. It was suggested only an alternative to inventing a new IR, not an alternative to TFX IR. The TFX IR is the recommended way forward for future development.

talebzeghmi commented 4 years ago

Thanks @Ark-kun is the TFX IR in development or ready to share? Are you able to share links? I ask because I'm interested in having Metaflow compile to the IR and run on KFP. thank you!

rmgogogo commented 4 years ago

Thanks @Ark-kun is the TFX IR in development or ready to share? Are you able to share links? I ask because I'm interested in having Metaflow compile to the IR and run on KFP. thank you!

@zhitaoli on TFX IR.

We are actively discussing this topic now and should can provide an initial update here around middle of next week. It's possible we may define a layered IR (core and different extensions).

animeshsingh commented 4 years ago

Thanks @Ark-kun for clarification. So established that TFX IR is the future.

@rmgogogo looking forward to the update

rmgogogo commented 4 years ago

The first initial checkin for IR is here. https://github.com/tensorflow/tfx/blob/master/tfx/proto/orchestration/pipeline.proto#L318

@zhitaoli to correct me.

As for how KFP side do changes correspondingly, I'm still evaluating more details. It's a big change and may worth a version number 2.0.

My current thought / design goal is

  1. decouple orchestrator (e.x. Argo) from pipeline authoring (SDK) via IR.

The IR is expected to abstract the K8s concepts (e.x. PVC, ConfigMap etc.). So that the IR can be quick tested in other env (e.x. local/dev env without a K8s cluster).

  1. decouple orchestrator (e.x. Argo) from front-end / visualizer

Currently our FE has many K8s/Argo concepts while many data can be fetched from file-system and normally indexed by MLMD. Alexey also mentioned same in previous reply: "Making Frontend IR-native". I may extend it as "Making Frontend IR & MLMD native". So if Tekton solution proposed by Animesh can generates the same data, the new visualizer should can work.

Welcome more inputs.

jlewi commented 4 years ago

@rmgogogo or @zhitaoli is there a corresponding RFC/doc that describes the thought behind the IR in more details?

@rmgogogo @talebzeghmi What are the implications of using proto as opposed to say OpenAPI/Swagger as the IDL?

The IR is expected to abstract the K8s concepts (e.x. PVC, ConfigMap etc.).

What does this mean in terms of how K8s concepts get surfaced? e.g. if people want to be able to attach PVCs do we first need to define a suitable abstraction in the IR?

/cc @animeshsingh

zhitaoli commented 4 years ago

We are working towards publishing a doc about the IR. Because it also includes semantics for async pipelines which might be foreign to batch-based pipelines, we intend to take a gradual approach to first discuss those semantics, then present an IR proposal which can be used to model that.

On Wed, Jun 3, 2020 at 6:22 AM Jeremy Lewi notifications@github.com wrote:

@rmgogogo https://github.com/rmgogogo or @zhitaoli https://github.com/zhitaoli is there a corresponding RFC/doc that describes the thought behind the IR in more details?

@rmgogogo https://github.com/rmgogogo @talebzeghmi https://github.com/talebzeghmi What are the implications of using proto as opposed to say OpenAPI/Swagger as the IDL?

The IR is expected to abstract the K8s concepts (e.x. PVC, ConfigMap etc.).

What does this mean in terms of how K8s concepts get surfaced? e.g. if people want to be able to attach PVCs do we first need to define a suitable abstraction in the IR?

/cc @animeshsingh https://github.com/animeshsingh

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubeflow/pipelines/issues/3703#issuecomment-638193046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY6AZT3ER5RPFOTW5OW42TRUZFAFANCNFSM4M2TLI7Q .

-- Cheers,

Zhitao Li

animeshsingh commented 4 years ago

Thanks @rmgogogo and @zhitaoli - great news! Will dive deeper into the IDL.

Vis a vis OpenAPI/Swagger would be perfect, but proto is not a deal breaker here. What maybe can be am issue is if we are not able to surface k8s constructs (PVC, ConfigMap, and there quite a few more in KFP DSL) either through core IDL, or an IDL extension for folks running on Kube, which is what more than half of the enterprises are using in production right now.

Our efforts below to map Kubeflow Pipeline DSL to Tekton backend enlist all the DSL functionalities we have to implement for Tekton, and we would expect IDL to be able to capture those from DSL, so that they can relayed back to Argo or Tetkon.

https://github.com/kubeflow/kfp-tekton/blob/master/sdk/FEATURES.md

And again, the more collaborative we can be on this effort, better for the project, so moving an IR proposal as quickly as possible to have community align it help out will work to everyone's mutual advantage here.

rmgogogo commented 4 years ago

"which is what more than half of the enterprises are using in production right now."

+1, it's important from runtime perspective.

From ML perspective, here is one worth a check to understand more of @zhitaoli 's previous replies. https://github.com/tensorflow/community/pull/253

animeshsingh commented 4 years ago

@rmgogogo @zhitaoli any update on the IR RFC/doc?

rmgogogo commented 4 years ago

@rmgogogo @zhitaoli any update on the IR RFC/doc?

@hongye-sun (didn't find Ruoyu's Github account)

Some high level info around how we plan to provide an IR and support it.

rmgogogo commented 4 years ago

As for Tekton runner, one big difference between Argo is that it can run multiple steps in one pod so that it can share common setup, e.x. secret, volume. Any others I missed? I'm thinking on whether we bring this concept to IR but would like to learn more Tekton benefits.

In ML pipeline, actually put multiple steps to one pod may has its disadvantage, e.x. one step requires big CPU/MEM/GPU while others don't. So better don't put it to one pod. In Tekton it has to be in two pods but it also means loss the benefit of using Tekton differentiated feature.

(BTW, it's also possible we impl another orchestration engine, not on Argo or Tekton but more MLMD friendly as MLMD itself is highly related with orchestration. It's more like long term plan. In first step (Q3/Q4), we may still do delta changes based on Argo.)

jlewi commented 4 years ago

@rmgogogo For the execution spec; is there a proto or OpenAPI Spec somewhere that indicates what this might look like?

jlewi commented 4 years ago

@rmgogogo ping?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rmgogogo commented 4 years ago

Hongye's PR: Pipeline IR #4371 contains the detail info.

Bobgy commented 4 years ago

/lifecycle frozen

RobbeSneyders commented 1 year ago

Is there still an ongoing effort to get the IR YAML format supported by different SDKs and execution engines? And if so, where could I find the status?

We are now compiling Fondant pipelines to IR YAML, partially relying on the KfP SDK, which allows us to run them on both KfP and Vertex AI. We would be interested in the ability to execute Fondant pipelines on more execution engines leveraging IR YAML.

We have also implemented a simple LocalRunner based on Docker Compose. We currently compile to Docker Compose directly, but if there's wider support for IR YAML, we would be interested to use it as an intermediate representation and then compile to Docker Compose from there. This could lead to a Fondant SDK and simple Docker Compose execution engine for IR YAML.

RobbeSneyders commented 11 months ago

@Ark-kun could you tell me where I can find the latest status on this? See my message above.