tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.github.io/tfx/
Apache License 2.0
2.11k stars 709 forks source link

How to propagate mlpipeline-metrics from custom Python function TFX component? #3094

Open axeltidemann opened 3 years ago

axeltidemann commented 3 years ago

I want to export mlpipeline-metrics from my custom Python function TFX component so that it is displayed in the KubeFlow UI, as described here: https://www.kubeflow.org/docs/pipelines/sdk/pipelines-metrics/

This is a minimal example of what I am trying to do:

import json

from tfx.dsl.component.experimental.annotations import OutputArtifact
from tfx.dsl.component.experimental.decorators import component
from tfx.types.standard_artifacts import Artifact

class Metric(Artifact):
    TYPE_NAME = 'Metric'

@component
def ShowMetric(MLPipeline_Metrics: OutputArtifact[Metric]):

    rmse_eval = 333.33

    metrics = {
        'metrics':[
            {
                'name': 'RMSE-validation',
                'numberValue': rmse_eval,
                'format': 'RAW'
            }
        ]
    }

    path = '/tmp/mlpipeline-metrics.json'

    with open(path, 'w') as _file:
        json.dump(metrics, _file)

    MLPipeline_Metrics.uri = path

In the KubeFlow UI, the "Run output" tab says "No metrics found for this run." However, the output artefact shows up in the ML MetaData (see screenshot). Any help on how to accomplish this would be greatly appreciated. Thanks!

Screenshot 2021-01-20 at 16 21 51
arghyaganguly commented 3 years ago

@axeltidemann, this issue seems relevant to Kubeflow. Please raise this in Kubeflow issues. Confirm once , if this seems okay. Thanks.

axeltidemann commented 3 years ago

@arghyaganguly But I see mlpipeline-ui-metadata showing up automatically in the KubeFlow UI, which also is coming from TFX (see https://github.com/tensorflow/tfx/blob/e0cb043ff5d3a9fc33f20b1ce6348518e68352ff/tfx/orchestration/kubeflow/base_component.py). Given that TFX is built on top of KubeFlow and I am using a TFX custom component, it must be a TFX relevant issue, no? How would the KubeFlow team know TFX specific questions? (There could be some overlap of course, I am happy to stand corrected.)

axeltidemann commented 3 years ago

Sorry to bother you @jiyongjung0, but I'd really appreciate your input when you have the time. Thanks.

jiyongjung0 commented 3 years ago

I'm sorrry for late response. I'm not very familiar with Kubeflow stuff and was finding a better person to respond. @neuromage could you give some help on this issue?

axeltidemann commented 3 years ago

It seems mlpipeline-metrics does not get propagated at all, if so it would have been added to the output_artifact_paths dictionary: https://github.com/tensorflow/tfx/blob/e0cb043ff5d3a9fc33f20b1ce6348518e68352ff/tfx/orchestration/kubeflow/base_component.py#L131

In addition, it should have been dealt with in the container entry point, like mlpipeline-ui-metadata: https://github.com/tensorflow/tfx/blob/511763835e8f982ecb05f31be3903040179f3968/tfx/orchestration/kubeflow/container_entrypoint.py#L291

Is there a specific reason for this omission? Or maybe something for a pull request?

axeltidemann commented 3 years ago

I tried to make changes to the source code of TFX itself (following the instructions here), where I basically implemented the changes above, i.e.

output_artifact_paths={
            'mlpipeline-ui-metadata': '/mlpipeline-ui-metadata.json',
            'mlpipeline-metrics': '/mlpipeline-metrics.json'
        }

in tfx/tfx/orchestration/kubeflow/base_component.py and also hardcoded metrics and file output in tfx/tfx/orchestration/kubeflow/container_entrypoint.py like so:

metrics = {
        'metrics':[
            {
                'name': 'RMSE-validation',
                'numberValue': 777.77,
                'format': 'RAW'
            }
        ]
    }

with open('/mlpipeline-metrics.json', 'w') as _file:
        json.dump(metrics, _file)

This was still not picked up by the KubeFlow UI. I assume there are some deeper changes needed, then. Maybe @neuromage can shed some light on this?

neuromage commented 3 years ago

Hi @axeltidemann, those changes look correct to me.

/cc @numerology and @chensun, any ideas why the above may not be working?

numerology commented 3 years ago

Changing output_artifact_paths in base_component.py should suffice. If that is not picked up by the UI then it seems like a bug to me.

May I ask which KFP version are you using (both SDK and deployment)?

axeltidemann commented 3 years ago

Good question. I don't specify which KFP version to use in deployment, I use the tfx CLI. It was my assumption that it creates a docker image from my local installation and uploads that to eu.gcr.io and therefore would use my local KFP version, but I can't figure out how to determine which KFP version is actually used on the cluster. Is there a way to find that out?

These are my local versions, in any case:

>python
Python 3.7.9 (default, Nov 20 2020, 18:45:38) 
[Clang 12.0.0 (clang-1200.0.32.27)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tfx
>>> tfx.__version__
'0.28.0.dev'
>>> import kfp
>>> kfp.__version__
'1.3.0'
chensun commented 3 years ago

but I can't figure out how to determine which KFP version is actually used on the cluster. Is there a way to find that out?

If you have access to KFP UI, it's shown in the bottom left corner, for instance: image

or if you have kubectl connected to your cluster, you can describe any KFP pod, for example: kubectl describe pod ml-pipeline-76fddff986-h7hsh -n kubeflow

and the container image label (1.2.0 shown in the below output) is the KFP backend version.

Containers:
  ml-pipeline-api-server:
    Container ID:   docker://a84dc475d6b6fb6e9dc58204e58e6c606498239f38fa12145e93953458bdd045
    Image:          gcr.io/ml-pipeline/api-server:1.2.0
axeltidemann commented 3 years ago

Thanks, @chensun. The version displayed is indeed 1.0.4, and the container image label is in the YAML file in the KubeFlow UI: ml-pipeline-api-server: gcr.io/cloud-marketplace/google-cloud-ai-platform/kubeflow-pipelines/apiserver:1.0.4.

However, could it be that local changes I make to TFX are not packaged and uploaded to the KubeFlow cluster in any case?

axeltidemann commented 3 years ago

@numerology I suppose I should create a separate Docker image with my changes to TFX, push that to Docker hub, and make the tfx cli use that image. I see when running

tfx pipeline update --pipeline-path=kubeflow_runner.py --endpoint=$ENDPOINT

that the tensorflow/tfx:0.25.0 image is used:

[truncated]
[Skaffold] #3 [internal] load metadata for docker.io/tensorflow/tfx:0.25.0
[Skaffold] #3 sha256:0de1d35ca0abce93f6f1d57543269f062bb56777e77abd8be41593a801cd2d61
[Skaffold] #3 DONE 2.8s
[Skaffold]
[Skaffold] #7 [1/3] FROM docker.io/tensorflow/tfx:0.25.0@sha256:0700c27c6492b8b2998e7d543ca13088db8d40ef26bd5c6eec58245ff8cdec35
[Skaffold] #7 sha256:8e5e2c00eb5ed31ca14860fd9aa40e783fe78ad12be31dc9da89ddad19876dc9
[Skaffold] #7 DONE 0.0s
[truncated]

However, I cannot figure out where to set which Docker image to use. I have even tried searching the repository for load metadata for, but no results came up. Any ideas?

numerology commented 3 years ago

@axeltidemann Indeed, in order to do that I believe you'll need to specify the base image when running the CLI command. For example:

tfx pipeline create --pipeline-path=kubeflow_runner.py --endpoint=$ENDPOINT --build_base_image your-docker-hub-repo/your-tfx-image --build_target_image your-docker-hub-repo/your-image-for-this-pipeline

Also please refer to the help message for --build_target_image option in https://github.com/tensorflow/tfx/blob/HEAD/tfx/tools/cli/commands/pipeline.py for advanced image building options.

easadler commented 3 years ago

I wanted to mention how important getting Kubeflow metrics into TFX is for my team. I curious if this is no longer an issue in the kubeflow v2 runner? I haven't been able to try it out.

numerology commented 3 years ago

@easadler

Kubeflow v2 runner is still being developed. Currently it only compiles TFX DSL objects into KFP IR spec. The story of visualization in Kubeflow v2 runner is being discussed.

/cc @neuromage

axeltidemann commented 3 years ago

@numerology I was able to create a custom build-base-image of TFX with the changes I referenced above:

  1. Pulled the TFX code from GitHub.
  2. Made changes as I wrote above.
  3. Created image by following these instructions (in essence ./tfx/tools/docker/build_docker_image.sh)
  4. Renamed (i.e. gave the image the tag eu.gcr.io/my-project/custom-tfx-image), pushed it to GCR.
  5. Specified both build and target image: tfx pipeline create --engine kubeflow --build-target-image eu.gcr.io/my-project/my-tfx-pipeline --build-base-image eu.grc.io/my-project/custom-tfx-image --endpoint $ENDPOINT --pipeline-path kubeflow_runner.py
  6. I can verify that the changes above are applied, because I can see [Skaffold] Step 1/4 : FROM eu.gcr.io/my-project/custom-tfx-image when creating the pipeline.
  7. Another verification step: I print out the hardcoded JSON file in the container_entrypoint.py after writing it, so I am sure it is successfully written.
  8. But alas, no mlpipeline-metrics in the KubeFlow UI.

SDK KFP version is 1.4, and KubeFlow (deployment) is 1.0.4, could this be an issue? It was my understanding that the KubeFlow Pipelines (1.4) and deployment of KubeFlow on Kubernetes (1.0.4) are two different things, and that the version number comparison is meaningless (but please correct me if I am wrong). Do you have any other ideas?

axeltidemann commented 3 years ago

@neuromage maybe you have some ideas why the above approach does not work?

axeltidemann commented 3 years ago

@neuromage @numerology sorry to bother you again, but do you have any thoughts on this?

ConverJens commented 2 years ago

@neuromage @numerology @axeltidemann What is the status on this? Is it possible to export metrics and custom metadata with TFX in KubeFlow nowadays?

axeltidemann commented 2 years ago

No progress from my side, when I have time I'd like to re-try the suggestions I outlined above, just to verify.

github-actions[bot] commented 1 year ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

axeltidemann commented 1 year ago

I still haven't had the time, but I'd very much like to keep this issue open.