kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.52k stars 1.59k forks source link

[feature] documentation for production grade deployment of kubeflow pipelines #6204

Closed darthsuogles closed 2 weeks ago

darthsuogles commented 2 years ago

Feature Area

/area documentation /area samples /area deployment

What feature would you like to see?

Documentation for production-grade deployment of kubeflow pipelines.

What is the use case or pain point?

Is there a workaround currently?

Unaware


Love this idea? Give it a ๐Ÿ‘. We prioritize fulfilling features with the most ๐Ÿ‘.

Bobgy commented 2 years ago

I have some personal notes on the topic, will try to document them.

darthsuogles commented 2 years ago

Thank you! Any chance you had time working on this in the past couple of weeks?

vinayan3 commented 2 years ago

@Bobgy in terms of production some guidance on what components can have > 1 replica would be very useful. Initially, I'm planning to try to increase the replica count to 2 for ml-pipeline-ui. This should allow users to see something even if other things are down.

The other things that I think could have replica count >1 is:

Things I'm not sure about are:

Bobgy commented 2 years ago

Posting my unedited notes first, will try to revisit. Looking forward to any feedback.

Some of these tips are Google Cloud specific, but most of them are general advice.

Bobgy commented 2 years ago

@Bobgy in terms of production some guidance on what components can have > 1 replica would be very useful. Initially, I'm planning to try to increase the replica count to 2 for ml-pipeline-ui. This should allow users to see something even if other things are down.

The other things that I think could have replica count >1 is:

  • ml-pipeline
  • metadata-grpc-service
  • ml-pipeline-visualizationserver

Things I'm not sure about are:

  • controller-manager-service

This is something I haven't experimented much, from my understanding:

can be made multi replica right now.

There is a caveat that ml-pipeline and metadata-grpc-service upgrade DB schema on start up, so if you are doing an upgrade, recommend changing replica to 1 first.

The controllers should be able to run in leader election mode: one instance is leader, one instance is standby, whenever the leader dies, the standby instance takes over. However, I believe for KFP controllers some dependency upgrade might be necessary and we need to expose flags. Argo workflow controller can be set up this way now. https://argoproj.github.io/argo-workflows/high-availability/

vinayan3 commented 2 years ago

@Bobgy I've taken the suggestions above for the things that can have more than replica count one. I've also added in PodDisruptionBudgets and put Pod Topology Spread Constraints to avoid all the replicas going onto a single node.

I'll have to look into getting the argo workflow controller to have an active / passive mode.

Thanks for suggestions and advice. It's really appreciated.

Bobgy commented 2 years ago

Cool, interested to see how that plays out.

rubenaranamorera commented 2 years ago

@Bobgy Is there any easy way to integrate kubeflow pipelines directly with gitops? currently we are just converting our pipelines to Argo workflows. We can run and schedule those pipelines, but we are losing all fancy kubeflow capabilities from the UI and complicates things for DataScientists. Any ideas on this?

Bobgy commented 2 years ago

@rubenaranamorera There's a feature request in https://github.com/kubeflow/pipelines/issues/6001.

Bobgy commented 2 years ago

minor update, I added a last point in my comment above about configuring a lifecycle policy for the object store.

NikeNano commented 2 years ago

@Bobgy Is there any easy way to integrate kubeflow pipelines directly with gitops? currently we are just converting our pipelines to Argo workflows. We can run and schedule those pipelines, but we are losing all fancy kubeflow capabilities from the UI and complicates things for DataScientists. Any ideas on this?

You(@rubenaranamorera ) can use the SDK if you like to, I did some stuff with this for github actions(it has not been update in quite some time so might need some love to work for you) https://github.com/NikeNano/kubeflow-github-action.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bobgy commented 2 years ago

/lifecycle freeze

vinayan3 commented 2 years ago

So after more than 6 months of running the configuration with replica > 1 there hasn't been any issues.

Also, for argocd workflows the controller may not need to be run with more than replica / sharded unless there is huge number of workflows. The pod gracefully restarts on other nodes and is able to pick up work where it left it off.

Would there be interest in creating an overlay for HA?

daro1337 commented 4 months ago

@vinayan3 could you please sum up which components can be easily scaled and did not bring any malfunction for your deployment? Thanks in advance

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 weeks ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.