kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.62k stars 1.63k forks source link

Expose Pipelines as CRD and enable to easy migration from Argo workflow #1132

Closed inc0 closed 1 year ago

inc0 commented 5 years ago

Currently to use pipelines you need to run python SDK, only that generates argo workflow underneath etc. I think this is very limiting because:

  1. not everyone uses python
  2. requires to learn whole new API and DSL
  3. Argo has a lot of examples already, shame that we can't tap to this knowledge source
  4. Argo can do much more than just data pipelines, you can learn one syntax and have it used for data, CI, CD etc

I propose creating new CRD that will be effectively Argo workflow with additional options. For example

apiVersion: kubeflow.org
kind: Pipeline
metadata:
  generateName:  mlapp-
  labels:
    workflow: mlapp
spec:
# Add some useful pipeline specific data
  model_name: foobar
  model_version: 1
# This is just argo workflow spec
  entrypoint: mlapp
  templates:
  - name: mlapp
    dag:
      tasks:
      - name: preprocess
        template: preprocess

      - name: model1
        dependencies: [preprocess]
        template: train
        arguments:
          artifacts:
          - name: dataset
            from: "{{tasks.preprocess.outputs.artifacts.dataset}}"

  - name: preprocess
    container:
      image: myimage:latest
      name: preprocess
      command: ["python", "/src/preprocess.py"]
      env:
        - name: SOMEENV
          value: foobar
    outputs:
     artifacts:
     - name: dataset
       path: /data

  - name: train
    inputs:
      artifacts:
      - name: dataset
        path: /data
    outputs:
     artifacts:
     - name: model
       path: /output
    container:
      image: myimage:latest
      name: trainer
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
      command: ["python", "/src/train.py"]

This would make transition to pipelines much easier as Operators are already well known pattern and it handles a lot of things for us, including RBAC multitenancy, API auth etc.

Ark-kun commented 5 years ago

I'm not sure this is needed. Currently KF Pipelines use Argo Workflow CRD without changes. Pipelines do not extend it - there are no extra pipeline-specific fields.

If we decide to replace Argo, then we'll create a new CRD.

not everyone uses python requires to learn whole new API and DSL

I do not think KF Pipelines requires you to do that. Pipelines Python SDK just allows some people to write

preprocess = load_component(...)
train = load_component(...)

@pipeline
def mlapp():
    train(preprocess(train_set).output)

instead of writing the YAML manually.

inc0 commented 5 years ago

So, if I'd submit argo workflow, it will be picked up by pipelines immediatly? How, for example, will it save metrics?

vicaire commented 5 years ago

Hi inc0@, having a CRD for pipeline is being considered. We are planing to implement this in multiple steps:

Ark-kun commented 5 years ago

How, for example, will it save metrics?

To provide metrics the workflow task must have an output artifact called 'mlpipeline-metrics'.

So, if I'd submit argo workflow, it will be picked up by pipelines immediatly?

You have to submit the workflow against the pipelines API. You can use either python client (kfp.Client(...).run_pipeline(...)) or CLI. https://github.com/kubeflow/pipelines/tree/master/backend/src/cmd/ml

Note that it's not considered a supported mode of operation. It may break in future.

vicaire commented 5 years ago

@Ark-kun, having a CRD for pipeline is something that we are considering. Let's please keep this open.

yanniszark commented 5 years ago

Adding to this, having a Pipelines CRD would also provide a path for multi-user pipelines, as Kubernetes CRDs have built-in authentication and authorization via the API Server, like any other Kubernetes Object. As such, maybe there is some overlap with https://github.com/kubeflow/pipelines/issues/1223

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bobgy commented 4 years ago

/lifecycle frozen

I think this is something we'd want to consider for the long term.

alexlatchford commented 3 years ago

Chiming in here, more background in this Slack thread.

Our use case at Zillow is to be able to deploy monitoring alongside scheduled pipelines. We use Datadog internally and have created a K8s operator for creating Datadog Monitors (essentially alerts triggered by metrics over thresholds), it just reconciles the state of the resources with teh Datadog API.

We would like to be able to use a standard kubectl apply (or better a kubectl apply -k with kustomize) to deploy both a ScheduledWorkflow CRD, see these samples alongside these custom DatadogMonitor CRD resources. This is an extensible pattern and in teh future we are planning to produce a Datadog Dashboards operator so we could dynamically create dashboards on a per ScheduledWorkflow basis (useful for defining and monitoring SLOs for instance).

This would also allow us to unify our CICD pipeline with KFServing. Essentially we have the same pattern where we generate a set of resource manifests using kustomize and in that case it's an InferenceService + a set of DatadogMonitors. As we have an underlying core K8s team they already have CICD pipelines for running kubectl apply -k super easily internally so instead of the custom CICD pipelines we need to maintain atop the kfp CLI/SDK tooling the current public interfaces KFP exposes this would allow us to align wholly with the rest of our company reducing maintenance overheads!

Bobgy commented 3 years ago

@alexlatchford for clarification, does the use case only applies to ScheduledWorkflow?

Sounds to me one time pipeline runs do not need a CRD interface.

alexlatchford commented 3 years ago

I think we'd ideally prefer to just use the same CICD pipeline regardless so I'd imagine we'd use the ScheduledWorkflow in this mode just to unify the deployment process.

rubenaranamorera commented 3 years ago

Is this something that is still possible? it would be nice to have pipeline CRDs to be able to integrate pipelines with GitOps without loosing all UI capabilities

kujon commented 3 years ago
  • First, we will create a pipeline spec that will combine and Argo workflow + additional data needed for ML pipelines.
  • Initially, this spec will be processed by the pipeline API server and turned into an Argo workflow.
  • Later on, we could turn this pipeline spec into a standalone CRD.
  • The long term expectation is that the pipeline CRD will let us combine multiple orchestration CRDs useful for ML (Argo workflow, HP tuning, etc.) and let users specify additional, optional, ML metadata.

@vicaire as I understand steps 1 + 2 have been completed, are there still plans to introduce a standalone CRD? Having to rely on Python SDK and submitting files to Kubeflow API instead of Kubernetes API makes Kubeflow a really hard sell. In our case, dedicated CI/CD workflows need to be developed, we can't rely on any of the tooling (e.g. helm-secrets) that works virtually with any other thing deployed onto Kubernetes too.

chensun commented 1 year ago

Currently, there's no plan to make pipeline a CRD. In fact, we are moving to make pipeline platform-agnostic.

laurence-hudson-mindfoundry commented 1 year ago

I have the same use case as kujon, rubenaranamorera & alexlatchford . We deploy things using a Flux based GitOps workflow. The lack of on option to declaratively define kubeflow pipelines as kubernetes resource objects that can be kubectl apply'ed is a pain, and seems like a departure from K8s norms. It also seems inconsistent with other KF components like Kserve, were your have InferenceService resource objects etc.