tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.org/tfx
Apache License 2.0
2.11k stars 706 forks source link

Parameterized pipelines using Kubeflow? #362

Closed htahir1 closed 2 years ago

htahir1 commented 5 years ago

hi again! As you can see im a new and very excited user of the technology and have so many questions as I want to implement TFX in our company's workflows and want to convince everyone of it as well! (P.S. if theres a way for me not to bombard this board and just ask questions on another format, I would be do that).

This time I wanted to ask about Kubeflow pipelines. In Kubeflow, it is relatively easy to just specify parameters that can be changed on the UI (see image below). This lets you run the same pipeline with different configurations (i.e. different feature transforms, different hyper parameters, different model types even)

However, I am not sure how TFX solves this problem. It is surely an important point - If pipelines are not parameterized, how can we re-use them and run them again and again? Do we always have to edit our codebase (i.e. tf.transform preprocessing_fn, training_fn etc) and upload a new pipelines each time? If that is the case, is the ML metadata shared across pipelines or is there a ML metadata store per pipeline? Specifically, if cache is enabled and we have two pipelines (A and B) with the same first examplegen, schemagen and statisticsgen code and then different transform and train code, and we run pipeline A, and then we run pipeline B on the same inputs, would pipeline B be 'aware' enough to skip the first three component steps?

I hope the example illustrates what im asking. If not, i can clarify. Thanks in advance!

pipelines-start-xgboost-run

neuromage commented 5 years ago

Thanks for the detailed request. We are planning on adding pipeline level parameters that should ideally translate to both Kubeflow and Airflow configurable runtime parameters. We will update this issue once we have a more concrete plan on how to do this.

/cc @ruoyu90

htahir1 commented 5 years ago

Okay thanks for the update. Is there a current workaround? How do we then re-run pipelines with slight different configurations?

1025KB commented 5 years ago

as long as inputs are the same, component will be cached,

in our chicago taxi example, metadata is per pipeline as metadata path contains pipeline name in it, but it's not required

if your two setups(A,B) are using the same metadata.db, and first 3 components is not changed, then they will be cached

htahir1 commented 5 years ago

i see. can you also please explain how the inputs are compared? So for example, if the input is a google storage bucket, will it go in and check what was in that storage bucket? or will it just see that the URL is the same and skip the whole thing

1025KB commented 5 years ago

for kubeflow, cache isn't enabled yet, for airflow or beam orchestrator (in 1.14 release), we check all input artifact, exec_properties to decide whether to skip a component, details can be found here

for input, we check URL instead of content as component input artifact is immutable, for external input of example gen we add a fingerprint(1.14 release) to the input artifact to make sure we treat updated input as a new input artifact

jinnovation commented 5 years ago

@neuromage: Thanks for the updates. I'd be eager to hear about any news that comes out of this. My team at Twitter is looking into adopting TFX for our own Airflow-based workflow product and I think a story around runtime parametrization with Airflow support will be crucial moving forward.

I see, for instance, that #373 introduces RuntimeParameter. Looking forward to seeing what new developments arise.

hongye-sun commented 5 years ago

@ruoyu90 @numerology

numerology commented 5 years ago

Hi @jinnovation, Kubeflow Pipeline has native support for runtime parameter but for airflow DAG runner things are a bit of different. Can you provide some example of your use cases demonstrating what type of parameterization do you need? Thanks!

jinnovation commented 5 years ago

Kubeflow Pipeline has native support for runtime parameter

I understand; that's one thing about the platform that I'm particularly excited about. 😄

...airflow DAG runner things are a bit of different. Can you provide some example of your use cases demonstrating what type of parameterization do you need?

Sure. Essentially, we would like runtime parametrization with some form of management UI to cross-associate parameter bundles with corresponding pipeline executions. In other words, we'd like functionality similar to Kubeflow Pipelines, but for Airflow.

To provide some context, my team has implemented a runtime parametrization (internal) extension to Airflow that provides a very Kubeflow-esque GUI to users, like the following: image

You can find more details in a blog post from last year detailing our platform.

Currently, my team is eager to integrate TFX into our product—ML Workflows—as a first-class citizen. This will, to some degree, involve finding TFX equivalents to a lot of the extensions we've built on top of Airflow. Runtime parametrization with some form of management UI to cross-associate parameter bundles with corresponding pipeline executions will be one of these.

As mentioned, my team's solution is internal and, in several ways, not appropriate for contribution to core Airflow in its current state. This is why I'm curious and very eager to hear what the TFX team's thoughts are regarding Airflow-compatible runtime parametrization in TFX. 👍

jiyongjung0 commented 2 years ago

We added runtime parameter support a while ago. For example, you can use runtime parameters including TFX CLI after 1.3.0. https://github.com/tensorflow/tfx/blob/master/RELEASE.md#version-130

Let me close this for now. Please reopen if you need.