kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.54k stars 1.59k forks source link

[feature] Store Pipeline IR in database, not object storage #10509

Open HumairAK opened 6 months ago

HumairAK commented 6 months ago

Feature Area

What feature would you like to see?

Currently the Object Store in KFP is largely used for artifacts, except for one outlier, which is the Pipeline IR.

I agree with the inline comments that this should be stored in the DB just like everything else that's not an artifact.

What is the use case or pain point?

Moving this to be stored in db, removes api server's dependency on the object store, and will make it fore future solutions for different artifact store implementations, without having to worry about api server.

Is there a workaround currently?

No

Anything else?

There's also archive logging, but this seems delegated to the backend engine (currently argo, but soon tekton as well), I'm not sure what to do about this one.


Love this idea? Give it a 👍.

HumairAK commented 6 months ago

Related: https://github.com/kubeflow/pipelines/issues/10510

HumairAK commented 5 months ago

follow up from Feb 02, 2024 call

@chensun suggests we might actually be storing pipeline ir in both db and object storage

It is not clear if the object store is being used any more for pipeline IR, we should confirm if that's indeed the case, if so we should remove this from apiserver and just rely on the db for this.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gmfrasca commented 3 months ago

bumping to unstale.

I've looked into this, at a decent glance it does appear that the Pipeline IR stored in Object Storage goes unused*, and I believe we can remove that copy of the definition since it creates duplicate sources-of-truth and just rely on the definition stored in DB.

A couple other findings:

  1. I did find one area of code that checks ObjStore for a PipelineVersion if it can't find it in the DB. Since it's a failsafe we can likely leave it, at least temporarily, even though data wouldn't be placed in those 'backup' destinations.
  2. It does appear that PipelineURI (which points to the pipeline definition location in the object store) needs to remain as it appears to be leveraged for the upload-from-web mechanism.
gmfrasca commented 3 months ago

/assign @gmfrasca

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

HumairAK commented 1 month ago

/remove-lifecycle stale