Closed Nagarajj closed 2 months ago
Hi @Nagarajj, thank you for drafting this detailed proposal!
Specified as a container which runs after publisher actions (Python Container rather than Go Binary).
Are you OK with the perf implication? Because a container will have to be a separate Pod, and the overhead of scheduling & starting a Pod is at least 3~5 seconds from what I observe on GKE.
Maybe we can improve using https://argoproj.github.io/argo-workflows/container-set-template/, but it's an Alpha feature and it's only supported by argo emissary executor.
I was talking about go binary as an option, because it would have minimal perf overhead. A binary can be mounted in a volume to the same main container and executed after your main command (and typically it should finish really fast).
Synchronous execution for reliability reasons (Synchronous over asynchronous).
+1, I've heard similar requirements before
output values/artifacts, input values/artifacts
Do you need access to the artifact files? This will also be not very efficient if postpublish hook runs on a separate Pod.
On the highest level, we have two objectives:
Run container before and after each user component as part of our custom caching layer. This can be further optimized by the built-in caching logic because starting new containers lead to performance overhead. Is it possible to override the built-in caching logic with customized code?
For each component, we want to capture where its arguments come from to build data lineage. This is particularly critical to reproducibility and debugging. Unfortunately, right now there's no way to achieve such things.
Hi @Nagarajj, thank you for drafting this detailed proposal!
Specified as a container which runs after publisher actions (Python Container rather than Go Binary).
Are you OK with the perf implication? Because a container will have to be a separate Pod, and the overhead of scheduling & starting a Pod is at least 3~5 seconds from what I observe on GKE.
Maybe we can improve using https://argoproj.github.io/argo-workflows/container-set-template/, but it's an Alpha feature and it's only supported by argo emissary executor.
I was talking about go binary as an option, because it would have minimal perf overhead. A binary can be mounted in a volume to the same main container and executed after your main command (and typically it should finish really fast).
Thanks @Bobgy for looking into this.
Perf should not be an implication in our case, as these are batch pipelines. The only concern with go binary is, Python has low barrier of entry/maintenance. Yes, ContainerSet Template would do good for these scenarios.
output values/artifacts, input values/artifacts
Do you need access to the artifact files? This will also be not very efficient if postpublish hook runs on a separate Pod.
@Bobgy Would ContainerSet Template help here as well ?
Yes, I believe ContainerSet template seems like a good fit. It's just released as Alpha in argo, so encourage anyone interested to try it out and mature it.
Run container before and after each user component as part of our custom caching layer. This can be further optimized by the built-in caching logic because starting new containers lead to performance overhead. Is it possible to override the built-in caching logic with customized code?
Will ContainerSet be enough for your case @zzxvictor? As far as I can tell, multiple containers in one Pod has less overhead than multiple Pods, because they can be scheduled only once and they share local volumes.
Another option I mentioned before is using go binaries, if you mount a go binary that KFP cacher can call into to get the caching decision -- that will be faster than container set I believe. However, as mentioned it has higher barrier for entry, because now language is limited to those that can be compiled to binaries.
As a last resort, you can always fork, we plan to build built-in caching logic as HTTP template in argo (a long running service receiving cache requests from argo workflows), so if you fork our handler code and replace it with your own caching code you can achieve the goal you want. However, I'm not sure there are enough people who wants to customize so deeply.
For each component, we want to capture where its arguments come from to build data lineage. This is particularly critical to reproducibility and debugging. Unfortunately, right now there's no way to achieve such things.
In v2 compatible mode and KFP v2 we are building, these info are already captured in the ml-metadata store deployed with KFP. What's missing there?
Just to clarify our process to move this forward. Either
We are currently collecting feedback and different use-cases with this. Here are some questions that I am still unclear with:
If we can have post_publish be a Lifecycle Hook, similar to PreCache check that should be an ideal experience.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
Feature Area
What feature would you like to see?
IR Spec specifies Lifecycle Hooks (pre-cache) which could be used to override the caching logic in the driver. A similar hook (post_publish) would be useful which can be fired after the publisher, to help capture the entire lineage metadata of output values/artifacts, input values/artifacts, pipeline/task status etc.
It would be ideal if the hook could be
What is the use case or pain point?
Having these hooks will help integrate with the existing external metadata management systems. These external metadata management systems are already in use within the company for governance/compliance/audit reasons along with metadata management.
Is there a workaround currently?
It is not possible to achieve this with the current IR spec.
Love this idea? Give it a š. We prioritize fulfilling features with the most š.