kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.56k stars 1.61k forks source link

kubeflow pipeline add support of postgresql #7512

Open yiyuanyu17 opened 2 years ago

yiyuanyu17 commented 2 years ago

Feature Area

What feature would you like to see?

kubeflow pipeline add support of postgresql

What is the use case or pain point?

for some case , we can not use mysql for kubeflow pipeline , hope kubeflow pipeline can add the suppoort of postgresql

Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.

zijianjoy commented 2 years ago

Hello @yiyuanyu17 , can you help us understand what is the reason of not being able to use mysql? And do you want to use postgresql within cluster or outside the cluster? Or are you looking for a way to configure postgresql as an alternative of cloudsql?

yiyuanyu17 commented 2 years ago

Hello @yiyuanyu17 , can you help us understand what is the reason of not being able to use mysql? And do you want to use postgresql within cluster or outside the cluster? Or are you looking for a way to configure postgresql as an alternative of cloudsql?

hello, we use kubeflow pipeline for AI model training in our platform. In the process of privatization delivery, some customers explicitly require that self built MySQL is not allowed, and the PostgreSQL provided by the customer side must be used. Therefore, our applications are modified into ORM framework to adapt to different database types. However, it is noted that the kubeflow pipeline server has not increased its support for PostgreSQL. Therefore, it proposes this issue and hopes to get the help of the community.

zijianjoy commented 2 years ago

Thank you for the info, @yiyuanyu17 . I will keep this issue open so people can upvote if they are also interested in this postgresql support. People can create overlay which connects to postgresql but such support is not available in this repo yet.

imiller445 commented 2 years ago

We would also be interested in this feature. We do a lot of on prem and disconnected/airgapped deployments. As such, Cloud Vendor hosted databases are not an option. In most scenarios it is easiest to run our own database clusters colocated on the same k8s environment as we run Kubeflow. The Crunchy Postgres experience on k8s is the best experience we've found to operate RDBMS clusters on k8s and we leverage it for other tooling. Would be nice to leverage it from Kubeflow as well, as operating MySql clusters on k8s is not as seamless an experience.

vasireddyvkl commented 2 years ago

In our case we only have Postgres as an option for managed on prem DB. So looking for out fo the box Postgres support. @zijianjoy Can you please elaborate what creating overlay means, If that helps connect Kubeflow to postgres, I am interested to give it a try. Thanks!

RoyerRamirez commented 2 years ago

Hi @zijianjoy,

We also strictly use PostgreSQL internally, since it's better suited for data warehousing purposes.

zijianjoy commented 2 years ago

overlay is a kustomize concept as described in https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/. An overlay is a kubernetes resource package, it is like a variant of base KFP package.

Here is a list of KFP overlay: https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/env

If you look at platform-agnostic folder, you will find that it is depending on mysql: https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/third-party/mysql

So if you want to introduce postgresql, what you need to do is:

  1. Create a postgresql folder under https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/third-party and define the postgresql resources in this folder.
  2. Define an overlay in https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/env which allows people to use postgresql.

I would recommend testing this postgresql integration on your environment first before committing to KFP repo, because there is no guarantee/testing to verify KFP working with postgresql.

javen218 commented 2 years ago

It would be great if kubeflow pipeline support postgres!!! For some reason, our company also can not use MySQL. We strongly recommend the community to make the database optional @zijianjoy

javen218 commented 2 years ago

It's a time consuming job for us every user to implement postgresql available for pipelines. So we're eagerly waiting for someone to contribute to it.

There are already pull requests implementing postgres for kubeflow katib (https://github.com/kubeflow/katib/pull/1921), I wander if there any plan about KFP SUPPORT PG?

Edward-liang commented 2 years ago

MySQL is of no doubt an excellent database, however Oracle's acquisition brought uncertainty to its future. Like the others above, I sincerely hope kubeflow/pipeline can support postgresql soon, which is license friendly, and owns lots of advanced features.

chensun commented 2 years ago

Also note that google/mlmd doesn't support Postgres yet: https://github.com/google/ml-metadata/issues/26

shalberd commented 2 years ago

As others have hinted towards here, PostgreSQL, especially with Operator Lifecycle Manager and, if wanted, being a Red-Hat-certified operator, is the way to go in an Enterprise environment that is Kubernetes-based. I wholeheartedly agree with all people who posted here. Database should not come pre-packaged with Kubeflow, as it is not a core component. Let people who really know their stuff handle things like database-ops and deployment, like e.g. Crunchy with PostgreSQL. And then use Postgres as a database for Kubeflow. Seriously, replication factor of 1, no pgbouncer proxy to improve load handling, no backup strategy .... https://github.com/kubeflow/katib/blob/9fce9dd03bc476b4e1f3d385e9692ac5cef681f4/manifests/v1beta1/components/postgres/postgres.yaml That cannot seriously be an approach by a project that has its origins with one of the big tech firms. Same goes for air-gapped functionality support with custom docker registries, HTTP_PROXY support via env variables and custom CA configmap for PKI trust.

zijianjoy commented 2 years ago

Currently we would like help from community to support PostgresQL integration. For anyone who wants to contribute making Kubeflow Pipelines runnable with PostgresSQL:

  1. ~Create a postgresql folder under https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/third-party and define the postgresql resources in this folder.~ (Done)
  2. Implement PostgreSQL integration in KFP API server and cache server.
  3. Define an overlay in https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/env which integrates KFP with postgresql
  4. ~Change MLMD's use of database as PostgreSQL: https://github.com/google/ml-metadata/issues/26~ (Done)
rimolive commented 6 months ago

Since we have #9813 to track this work, I'll close this issue. Please follow updates in that tracker issue

/close

google-oss-prow[bot] commented 6 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/7512#issuecomment-1983971941): >Since we have #9813 to track this work, I'll close this issue. Please follow updates in that tracker issue > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
zijianjoy commented 6 months ago

@rimolive Sorry, since this work is not finished yet, the feature request bug is still valid. (Note: We use the upvote count of the original issue in order to track community's interest across the org, thus I am reopening this issue)

tarilabs commented 6 months ago

May I suggest keeping track of MLMD's: https://github.com/google/ml-metadata/issues/194#issuecomment-1975207465 for this KFP-with-PostgreSQL scope?

Reason being, when MLMD is backed by PostgreSQL, there is allegedly a practical limits of only ~2K chars in MLMD string properties.

Potential solutions are mentioned (and one presented) with: https://github.com/google/ml-metadata/pull/195

hope this helps!

rimolive commented 6 months ago

Thanks @tarilabs for letting us know!

@zijianjoy Can you add this issue as a work item for MLMD integration in #9813? I thinks it's a good first issue and for GSoC.

zijianjoy commented 6 months ago

Thanks @tarilabs for letting us know!

@zijianjoy Can you add this issue as a work item for MLMD integration in #9813? I thinks it's a good first issue and for GSoC.

Added, however, please note that it is going to be an optional task in terms of postgresql integration with KFP, but a good item to contribute on.

rimolive commented 6 months ago

Agreed, the idea to add this issue is for tracking purposes.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

rimolive commented 3 months ago

/reopen /lifecycle frozen

google-oss-prow[bot] commented 3 months ago

@rimolive: Reopened this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/7512#issuecomment-2145353236): >/reopen >/lifecycle frozen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.