kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.59k stars 1.62k forks source link

[feature] Post Publish Hooks to capture Lineage in external systems. #6103

Closed Nagarajj closed 2 months ago

Nagarajj commented 3 years ago

Feature Area

What feature would you like to see?

IR Spec specifies Lifecycle Hooks (pre-cache) which could be used to override the caching logic in the driver. A similar hook (post_publish) would be useful which can be fired after the publisher, to help capture the entire lineage metadata of output values/artifacts, input values/artifacts, pipeline/task status etc.

It would be ideal if the hook could be

  1. Specified as a container which runs after publisher actions (Python Container rather than Go Binary).
  2. Synchronous execution for reliability reasons (Synchronous over asynchronous).

What is the use case or pain point?

Having these hooks will help integrate with the existing external metadata management systems. These external metadata management systems are already in use within the company for governance/compliance/audit reasons along with metadata management.

Is there a workaround currently?

It is not possible to achieve this with the current IR spec.


Love this idea? Give it a šŸ‘. We prioritize fulfilling features with the most šŸ‘.

Bobgy commented 3 years ago

Hi @Nagarajj, thank you for drafting this detailed proposal!

Specified as a container which runs after publisher actions (Python Container rather than Go Binary).

Are you OK with the perf implication? Because a container will have to be a separate Pod, and the overhead of scheduling & starting a Pod is at least 3~5 seconds from what I observe on GKE.

Maybe we can improve using https://argoproj.github.io/argo-workflows/container-set-template/, but it's an Alpha feature and it's only supported by argo emissary executor.

I was talking about go binary as an option, because it would have minimal perf overhead. A binary can be mounted in a volume to the same main container and executed after your main command (and typically it should finish really fast).

Bobgy commented 3 years ago

Synchronous execution for reliability reasons (Synchronous over asynchronous).

+1, I've heard similar requirements before

Bobgy commented 3 years ago

output values/artifacts, input values/artifacts

Do you need access to the artifact files? This will also be not very efficient if postpublish hook runs on a separate Pod.

zzxvictor commented 3 years ago

On the highest level, we have two objectives:

  1. Run container before and after each user component as part of our custom caching layer. This can be further optimized by the built-in caching logic because starting new containers lead to performance overhead. Is it possible to override the built-in caching logic with customized code?

  2. For each component, we want to capture where its arguments come from to build data lineage. This is particularly critical to reproducibility and debugging. Unfortunately, right now there's no way to achieve such things.

Nagarajj commented 3 years ago

Hi @Nagarajj, thank you for drafting this detailed proposal!

Specified as a container which runs after publisher actions (Python Container rather than Go Binary).

Are you OK with the perf implication? Because a container will have to be a separate Pod, and the overhead of scheduling & starting a Pod is at least 3~5 seconds from what I observe on GKE.

Maybe we can improve using https://argoproj.github.io/argo-workflows/container-set-template/, but it's an Alpha feature and it's only supported by argo emissary executor.

I was talking about go binary as an option, because it would have minimal perf overhead. A binary can be mounted in a volume to the same main container and executed after your main command (and typically it should finish really fast).

Thanks @Bobgy for looking into this.

Perf should not be an implication in our case, as these are batch pipelines. The only concern with go binary is, Python has low barrier of entry/maintenance. Yes, ContainerSet Template would do good for these scenarios.

Nagarajj commented 3 years ago

output values/artifacts, input values/artifacts

Do you need access to the artifact files? This will also be not very efficient if postpublish hook runs on a separate Pod.

@Bobgy Would ContainerSet Template help here as well ?

Bobgy commented 3 years ago

Yes, I believe ContainerSet template seems like a good fit. It's just released as Alpha in argo, so encourage anyone interested to try it out and mature it.

Run container before and after each user component as part of our custom caching layer. This can be further optimized by the built-in caching logic because starting new containers lead to performance overhead. Is it possible to override the built-in caching logic with customized code?

Will ContainerSet be enough for your case @zzxvictor? As far as I can tell, multiple containers in one Pod has less overhead than multiple Pods, because they can be scheduled only once and they share local volumes.

Another option I mentioned before is using go binaries, if you mount a go binary that KFP cacher can call into to get the caching decision -- that will be faster than container set I believe. However, as mentioned it has higher barrier for entry, because now language is limited to those that can be compiled to binaries.

As a last resort, you can always fork, we plan to build built-in caching logic as HTTP template in argo (a long running service receiving cache requests from argo workflows), so if you fork our handler code and replace it with your own caching code you can achieve the goal you want. However, I'm not sure there are enough people who wants to customize so deeply.

Bobgy commented 3 years ago

For each component, we want to capture where its arguments come from to build data lineage. This is particularly critical to reproducibility and debugging. Unfortunately, right now there's no way to achieve such things.

In v2 compatible mode and KFP v2 we are building, these info are already captured in the ml-metadata store deployed with KFP. What's missing there?

Bobgy commented 3 years ago

Just to clarify our process to move this forward. Either

We are currently collecting feedback and different use-cases with this. Here are some questions that I am still unclear with:

  1. Who will be the person that configures post publish hooks? Pipeline authors or platform admins or there are use-cases for both?
  2. What will be the most appropriate interface for the post publish hook? golang binary or container?
  3. Following 2, what will be minimum perf requirement for the hook?
Nagarajj commented 3 years ago

If we can have post_publish be a Lifecycle Hook, similar to PreCache check that should be an ideal experience.

  1. These LifeCycle hooks, would let platform admins extend the capabilities of Kubeflow Pipelines Runtime in interesting ways, without having to hack things like launcher (integration with existing metadata systems etc).
  2. If Platform admins are building this, it should be ok it being a golang binary.
  3. If these are built as optional/opt-in hooks (and in golang), platform teams should be able to decide if added performance overhead provides enough value.
stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.