[feature] documentation for production grade deployment of kubeflow pipelines

darthsuogles commented 2 years ago

Feature Area

/area documentation /area samples /area deployment

What feature would you like to see?

Documentation for production-grade deployment of kubeflow pipelines.

What is the use case or pain point?

Is there a workaround currently?

Unaware

Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.

Bobgy commented 2 years ago

I have some personal notes on the topic, will try to document them.

darthsuogles commented 2 years ago

Thank you! Any chance you had time working on this in the past couple of weeks?

vinayan3 commented 2 years ago

@Bobgy in terms of production some guidance on what components can have > 1 replica would be very useful. Initially, I'm planning to try to increase the replica count to 2 for ml-pipeline-ui. This should allow users to see something even if other things are down.

The other things that I think could have replica count >1 is:

ml-pipeline
metadata-grpc-service
ml-pipeline-visualizationserver

Things I'm not sure about are:

controller-manager-service

Bobgy commented 2 years ago

Posting my unedited notes first, will try to revisit. Looking forward to any feedback.

Some of these tips are Google Cloud specific, but most of them are general advice.

Deploy in a regional cluster, even if your workload runs on zonal nodepools. Regional clusters have multiple instances of K8s api server, so K8s api is highly available. During scaling, upgrade or many maintenance operations, zonal cluster k8s api servers are not responsive.
For KFP on GCP configure a nodepool default Google Service Account (GSA) with minimal permissions. You can grant serviceAccountUser permission to users/GSAs on this GSA to allow access to the proxy.
Recommend enabling nodepool autoscaling when there are too many workloads.
Set memory/CPU requests/limit on pipeline steps to guarantee they are not evicted when the cluster is under resource constraints. Also, Kubernetes use resource requests as the signal for node pool scaling, so when you enable auto-scaling, you should always set resource requests, so that Kubernetes can properly identify when you want to scale up/down. Resource requests/limits can be set using KFP DSL, example pipeline. Reference: https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.dsl.html#kfp.dsl.Sidecar.set_memory_limit.
Set memory/CPU requests/limit on system services, latest KFP release already have sane default values. However, KFP API server (ml-pipeline deployment), KFP persistence agent (ml-pipeline-persistence-agent deployment) and argo workflow controller (workflow-controller deployment) memory/CPU needs are roughly linear to the number of concurrent workflows (even if they are completed), Therefore do:
Reduce workflow TTL of completed workflows to match your use-case. Default is 1 day.
Monitor these deployments and set requests/limits based on real usage + some buffer.
Set up retry strategies for steps in error state. There are two types of failures, error and failure. Error refers to orchestration system problems. While failure refers to user container failures. So it’s recommended to specify retryStrategy at least for errors, and depending on use-case also for failures. Example: you can set set_retry(policy="Always"). # or “OnError”
If you need to customize the deployment, pull KFP manifests as an upstream and follow the off the shelf application workflow of kustomize. This allows infrastructure as code and easy upgrades.
A bonus point is to use gitops (there are many tools for similar purposes), put your infrastructure as code in a repo and use a gitops tool to sync it to production. In this way, you can version control, roll back, etc.
Use managed storage (Cloud SQL & Cloud Storage) to simplify lifecycle management: https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/sample .
Configure a lifecycle policy (e.g. clean up intermediate artifacts after 7 days) for the object store you are using, e.g. for minio and for gcs. Note, on the default minio bucket, intermediate artifacts are stored in minio://mlpipeline/artifacts, pipeline templates are stored in minio://mlpipeline/pipelines, so do not set a lifecycle for pipeline templates, they should be kept.

Bobgy commented 2 years ago

@Bobgy in terms of production some guidance on what components can have > 1 replica would be very useful. Initially, I'm planning to try to increase the replica count to 2 for ml-pipeline-ui. This should allow users to see something even if other things are down.

The other things that I think could have replica count >1 is:

ml-pipeline

metadata-grpc-service

ml-pipeline-visualizationserver

Things I'm not sure about are:

controller-manager-service

This is something I haven't experimented much, from my understanding:

ml-pipeline-ui
ml-pipeline*
metadata-grpc-service*
ml-pipeline-visualizationserver

can be made multi replica right now.

There is a caveat that ml-pipeline and metadata-grpc-service upgrade DB schema on start up, so if you are doing an upgrade, recommend changing replica to 1 first.

The controllers should be able to run in leader election mode: one instance is leader, one instance is standby, whenever the leader dies, the standby instance takes over. However, I believe for KFP controllers some dependency upgrade might be necessary and we need to expose flags. Argo workflow controller can be set up this way now. https://argoproj.github.io/argo-workflows/high-availability/

vinayan3 commented 2 years ago

@Bobgy I've taken the suggestions above for the things that can have more than replica count one. I've also added in PodDisruptionBudgets and put Pod Topology Spread Constraints to avoid all the replicas going onto a single node.

I'll have to look into getting the argo workflow controller to have an active / passive mode.

Thanks for suggestions and advice. It's really appreciated.

Bobgy commented 2 years ago

Cool, interested to see how that plays out.

rubenaranamorera commented 2 years ago

@Bobgy Is there any easy way to integrate kubeflow pipelines directly with gitops? currently we are just converting our pipelines to Argo workflows. We can run and schedule those pipelines, but we are losing all fancy kubeflow capabilities from the UI and complicates things for DataScientists. Any ideas on this?

Bobgy commented 2 years ago

@rubenaranamorera There's a feature request in https://github.com/kubeflow/pipelines/issues/6001.

Bobgy commented 2 years ago

minor update, I added a last point in my comment above about configuring a lifecycle policy for the object store.

NikeNano commented 2 years ago

@Bobgy Is there any easy way to integrate kubeflow pipelines directly with gitops? currently we are just converting our pipelines to Argo workflows. We can run and schedule those pipelines, but we are losing all fancy kubeflow capabilities from the UI and complicates things for DataScientists. Any ideas on this?

You(@rubenaranamorera ) can use the SDK if you like to, I did some stuff with this for github actions(it has not been update in quite some time so might need some love to work for you) https://github.com/NikeNano/kubeflow-github-action.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bobgy commented 2 years ago

/lifecycle freeze

vinayan3 commented 2 years ago

So after more than 6 months of running the configuration with replica > 1 there hasn't been any issues.

Also, for argocd workflows the controller may not need to be run with more than replica / sharded unless there is huge number of workflows. The pod gracefully restarts on other nodes and is able to pick up work where it left it off.

Would there be interest in creating an overlay for HA?

daro1337 commented 4 months ago

@vinayan3 could you please sum up which components can be easily scaled and did not bring any malfunction for your deployment? Thanks in advance

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 weeks ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

kubeflow / pipelines