kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.58k stars 1.61k forks source link

Kubeflow pipelines with pachyderm. #1162

Closed nirmalsinghania2008 closed 4 years ago

nirmalsinghania2008 commented 5 years ago

Hi, Thanks for this amazing project. I am wondering if there is a way to integrate pachyderm features(data versioning, full data provenance) with kubeflow pipelines. I went through the pachyderm examplehttps://github.com/kubeflow/examples/tree/master/github_issue_summarization/Pachyderm_Example and it is good for using kubeflow core with pachyderm. but, what if someone wants to use pachyderm within kubeflow pipelines.

Do we need to create pipeline components(using pachyderm under the hood)? and if it is the case, how can we proceed with it? Maybe it's trivial to do and I am missing something here.

Thanks for all your efforts.

Ark-kun commented 5 years ago

How do you think the integrated system should look like? What Pachyderm features would you like to see in the Pipelines?

Pachyderm might not be easy to integrate as it's a pretty similar project. It's doing it's own orchestration - just doing it differently. It's more likely that Pipelines will get the Pachyderm's features like data provenance. Of course, you can create a Pachyderm component for Pipelines, so that one step of the pipeline runs a Pachyderm pipeline.

nirmalsinghania2008 commented 5 years ago

Yeah that was my doubt, pipelines and pachyderm are doing a similar job. I like data versioning, change based build trigger, and data provenance of pachyderm. And want to incorporate these features in pipelines. I think these features are very important for an e2e ml pipeline. Is there currently any way(can be hacky) to achieve this? If not, i would like to contribute to this.

animeshsingh commented 5 years ago

@nirmalsinghania2008 fellow IBMer here. Let`s sync up once (ping me on slack) what's driving your requirement, and we can discuss.

jdoliner commented 5 years ago

Weighing in from Pachyderm on this. We're planning to integrate first class support for TF-job into our pipelines, which will allow you to write kubeflow jobs that benefit from Pachyderm's provenance tracking, versioning and dedupe. We're planning to have the data exposed to the TF-job as either an s3 endpoint that it can query are a Kubernetes CSI that gets mounted in. We'd love to chat more with you either in our slack channel or through github issues if you'd like to help guide the development process.

animeshsingh commented 5 years ago

Hi @jdoliner - wouldn`t it make sense to support pachyderm as an underlying construct on kubeflow pipelines itself? Currently KFP works with argo - it would be great if we can also support pachyderm under the same constructs - where for data oriented usecases we can run pachderm?

jdoliner commented 5 years ago

@animeshsingh I certainly think that makes sense and we'd love to see Pachyderm support KFP. For the time being though it makes more sense for us to focus our resources on bringing Kubeflow support to Pachyderm and making it really good, since that's what will benefit our users. If you're up for it we'd love to chat more about your use case and see if Pachyderm might be a good fit for what you're doing.

animeshsingh commented 5 years ago

Thanks @jdoliner. I believe for folks entering in it from end to end ML platform perspective, KFP would be the integrated entry point. But I hear you. Have a meeting with Nick from your team coming up to discuss some of this...

dcyoung commented 4 years ago

Have there been any developments here regarding KFP as the entrypoint, leveraging KFP for orchestration and pachyderm more intentionally for data management? If so, could you point me to relevant examples or documentation?

jdoliner commented 4 years ago

@dcyoung Pachyderm recently released first class support for kubeflow pipelines. You can read more about it here: https://www.pachyderm.com/blog/pachyderm-1-10-s3-gateway-expansion-brings-support-for-kubeflow/ this allows you to run KFPs from within Pachyderm pipelines which automatically tracks the data's lineage. It also should give you a good general idea of how to get Kubeflow reading data out of Pachyderm. Pachyderm now exposes its data using an s3 interface so your code can just use pfs if it wants and orchestrate the pipelines separately, reading and writing through the s3 gateway.

Feel free to drop by our Slack channel if you'd like some more hands on help setting this stuff up.

dcyoung commented 4 years ago

@jdoliner thanks for the quick reply and links.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.