kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.6k stars 1.62k forks source link

[backend] Metadata/Executions not written in 1.7.0-rc1 (New visualisations not working as a result) #6138

Closed vaskozl closed 7 months ago

vaskozl commented 3 years ago

/kind bug

I upgraded Kubeflow from 1.4.0 to 1.7.0-rc1 with the platnform-agnostic manifests.

While I now see correct visualizations of statistics from runs that happened before upgrading to 1.7.0-rc1, new runs only display the markdown details.

The TFX pipelines I submit are exactly the same. On the new runs the ML Metadata tab of the components prints:

"Corresponding ML Metadata not found."

Furthermore I don't see any new executions on the executions page despite running many pipelines since upgrading.

I don't see anything special in the logs of the TFX pods except:

WARNING:absl:metadata_connection_config is not provided by IR.

But that was present before upgrading to 1.7.0-rc1.

The only errors I see in the metadata-grpc-deployment pod is:

name: "sp-lstm-rh6xt"
Internal: mysql_query failed: errno: 1062, error: Duplicate entry '48-sp-lstm-rh6xt' for key 'type_id'
    Cannot create node for type_id: 48 name: "sp-lstm-rh6xt"

Which I also think is normal?

Basically I don't think executions and artifacts are getting written to the DB for some reason in 1.7.0-rc1. Not sure how to debug this. This causes the visualizations to not show up as far as I can see.

Metadata in the TFX pipelines is configured via the get_default_kubeflow_metadata_config tfx.orchestration.kubeflow function.

Environment:

Kubeflow version: 1.4.0 -> 1.7.0-rc1 kfctl version: Not used. Using tfx.orchestration.kubeflow to submit pipelines. Kubernetes platform: Upstream kubeadm: k8s v1.20.5 Kubernetes version: (use kubectl version): OS (e.g. from /etc/os-release): Centos 8

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

vaskozl commented 3 years ago

I've also just tested the issues with 1.7.0-rc2.

I've tried kfp 1.4.0 and 1.6.6.

vaskozl commented 3 years ago

Downgrading the metadata images seems to have solved this for me:

images:
  - name: gcr.io/ml-pipeline/metadata-envoy
    newTag: 1.4.0
  - name: gcr.io/tfx-oss-public/ml_metadata_store_server
    newTag: 0.25.1
  - name: gcr.io/ml-pipeline/metadata-writer
    newTag: 1.4.0

I believe the metadata-writer container version is the most important one here, though I have no context.

vaskozl commented 3 years ago

Actually reverting doesn't solve the problem completely, I still find the Statistics and Evaluator artifacts don't display.

In the markdown they have zeroed out id's:

id: 0
span: None
type_id: 0
vaskozl commented 3 years ago

There seems to be a problem with ml-metadata 1.0.0 which is required by tfx 1.0.0 and matches the grpc server of 1.7.1.

Downgrading the the metadata-grpc-server doesn't work solve anything.

I've gone back down to tfx 0.30 which uses metadata 0.30.0

Visualisations work when I downgrade gcr.io/tfx-oss-public/ml_metadata_store_server to 0.25.1 as well.

zijianjoy commented 3 years ago

Hello @vaskozl , what is the TFX version you are using when you are using KFP 1.4.0?

Bobgy commented 3 years ago

/cc @jiyongjung0 do you know any context?

zijianjoy commented 3 years ago

Possible related: https://github.com/tensorflow/tensorflow/issues/50045

vaskozl commented 3 years ago

When on 1.4.0 everything from TFX 0.27 to 1.0.0 wrote metadata. (visualizations didn't work but they do in the 1.7.0 release). I do recall having to bump the grpcio version in requirements up and down while hopping up to TFX 1.0.0 .

TFX 1.0.0 -> Kubeflow 1.7.0 (with metadata) No metadata written TFX 1.0.0 -> Kubeflow 1.4.0 Partial metadata written it seemed TFX 0.30.0 -> Kubeflow 1.4.0 All as expected

As I said I'm now using 1.7.0 with just the ml_metadata_store_server downgraded below 1.0.0.

jiyongjung0 commented 3 years ago

I found some possible cause. It is related to the changes in the way TFX stores their contexts since 1.0.0 (which is related to the changes in the execution stack using TFX IR).

In TFX 0.X, the context were

However in TFX 1.0, the context became

Related code

So it seems like Kubeflow Pipelines cannot find context (and artifacts) properly. I think that we should change mlmd access code like here.

CC. @zhitaoli , @1025KB , @Bobgy

zijianjoy commented 3 years ago

Thank you @jiyongjung0 for finding the difference between TFX 0.X and 1.0+!

For backward compatibility, what should KFP frontend do to detect if a TFX pipeline is TFX 0.X or TFX 1.0+?

jiyongjung0 commented 3 years ago

Unfortunately, it seems that there is no direct clue when finding executions. (Artifacts has tfx_version property, but there is no such information in Context / Execution.)

I think that we can try to find 1.0 context first, and fallback to 0.X context if not found.

Bobgy commented 3 years ago

Makes sense to me to find 1.0 context first and then fall back to 0.X

zijianjoy commented 3 years ago

Sounds good, agree with the fallback strategy here.

zijianjoy commented 3 years ago

Hello @jiyongjung0 ,

Currently we use this code logic to identify the execution for a specific node using node ID (which is the corresponding Pod ID). However, with the latest integration with TFX, we are unable to find this connection from Execution, see one of the following example for a statisticsGen execution:

tfxemptyexecution

How do I fix the TFX integration that I get the properties correctly like the following? (From an old deployment)

tfxstepnotempty

jiyongjung0 commented 3 years ago

Hi, this is a bug from TFX side introduced in https://github.com/tensorflow/tfx/commit/24fc5d1db9198a75db11af25cf05c4d3ae05491f. It seems like we don't record pod names in TFX 1.0.0. I'll try to fix this ASAP, and will update the release plan.

jiyongjung0 commented 3 years ago

I'm trying to include the above fix in the TFX 1.2.0 which is expected to be released tomorrow. I'll update you when the release is finalized.

jiyongjung0 commented 3 years ago

TFX 1.2.0 was released today and this should be fixed. Thank you again for reporting this!

zijianjoy commented 3 years ago

Appreciate it a lot @jiyongjung0 for the quick fix and release of TFX 1.2.0!

The following is what I found:

  1. I need to make changes to notebook example for TFX v1.2.0 for pip installation and image. I send a PR for this: https://github.com/kubeflow/pipelines/pull/6381, appreciate for your review in advance!

  2. I am able to see HTML visualization for staticsgen, schemagen, etc. (yay!), but I am not able to see the visualization of transform step, for the artifact like pre_transform_stats. Because KFP is trying to visit Split-eval and Split-train in the code files = tf.io.gfile.listdir('${specificUri}'): https://github.com/kubeflow/pipelines/blob/master/frontend/src/lib/OutputArtifactLoader.ts#L304, where I don't have those files, see the screenshot below: pretransformstat

  3. The step evaluator fails with the following logs, how does TFX utilize the KFP visualization feature?

    File "apache_beam/runners/common.py", line 572, in apache_beam.runners.common.SimpleInvoker.invoke_process
    File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 779, in process
    result.extend(self._batch_reducible_process(unbatched_element))
    File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 928, in _batch_reducible_process
    input_specs = get_input_specs(model, signature_name, required) or {}
    File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 472, in get_input_specs
    signature_name, model.signatures))
    ValueError: tft_layer not found in model signatures: _SignatureMap({'serving_default': <ConcreteFunction signature_wrapper(*, examples) at 0x7FCBCA4E62D0>, 'transform_features': <ConcreteFunction signature_wrapper(*, examples) at 0x7FCBCA4E3410>}) [while running 'ExtractEvaluateAndWriteResults/ExtractAndEvaluate/ExtractTransformedFeatures/Predict']
    time="2021-08-18T16:09:09.989Z" level=error msg="cannot save artifact /mlpipeline-ui-metadata.json" argo=true error="stat /mlpipeline-ui-metadata.json: no such file or directory"
    Error: exit status 1
jiyongjung0 commented 3 years ago

@zijianjoy

Thank you so much!!

For 2.(Transform output) It seems like an inconsistency in TFX implementation. The artifact from Transform was added recently and I never tried before. I'll talk with other TFX folks and get back to you.

For 3.(Evaluator), We need to change the preprocessing_function_names in the evaluator config in 1.2.0, because the example was changed in 1.2.0. Please see https://github.com/tensorflow/tfx/blob/34bdbc8c0f7c2d0da36559c9cb7afd603e44a5e3/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_native_keras.py#L126

zijianjoy commented 3 years ago

Document the conversation:

Now TFX moves the information in Execution to Context with type node. see code: https://github.com/tensorflow/tfx/blob/master/tfx/orchestration/metadata.py#L428-L435.

KFP will consider the possibility to pulling context for TFX.

ConverJens commented 3 years ago

@zijianjoy Was the TFX > 1.0.0 fix included in the KFP 1.7.0 release?

vaskozl commented 3 years ago

@ConverJens I can confirm it has.

TFX 1.2.0 and Pipelines 1.7.0 work perfectly with no patches.

ConverJens commented 3 years ago

@vaskozl Great news, thank you for the information!

zijianjoy commented 3 years ago

Yes @ConverJens , it is as confirmed by @vaskozl .

BTW, @Bobgy is working on a compatibility matrix for KFP and TFX (and more) which shows the working version combinations in the future.

ConverJens commented 3 years ago

@zijianjoy Great! A compatibility matrix would be awsome!

zijianjoy commented 3 years ago

Hello @ConverJens , you can check out the compatibility matrix in https://www.kubeflow.org/docs/components/pipelines/installation/compatibility-matrix/ now.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

wyljpn commented 2 years ago

Hi, there. I am using kfp v.1.5.1 with tfx v0.28.0. I faced the issue of loading ML metadata in the Runs page, although in the compatibility matrix it is "Fully Compatible". That was because of the contextName inconsistency between what frontend wants to find and that stored in the MySQL. frontend wanted to find something like "mypipeline.my-pipeline-xaew1" (some hash is appended in the second part.), but what is stored in the MySQL is "my-pipeline.my-pipeline-xaew1", so error happened. See: https://github.com/kubeflow/pipelines/blob/1.5.1/frontend/src/lib/MlmdUtils.ts#L66 image image After changing it from .join('') to .join('-'), it works.

  const pipelineName = argoWorkflowName
    .split('-')
    .slice(0, -1)
    .join('-');

And it also had an issue with getKfpPod using KFP_POD_NAME, because kfpRun wasn't written into the DB. See: https://github.com/kubeflow/pipelines/blob/1.5.1/frontend/src/lib/MlmdUtils.ts#L146 Took a workaround, using POD_NAME, made it works.

export enum KfpExecutionProperties {
  KFP_POD_NAME = 'kfp_pod_name',
  POD_NAME = 'pod_name',
}
...
  getKfpPod(execution: Execution): string | number | undefined {
    return (
      getResourceProperty(execution, KfpExecutionProperties.POD_NAME, true) ||
      getResourceProperty(execution, KfpExecutionProperties.KFP_POD_NAME) ||
      getResourceProperty(execution, KfpExecutionProperties.KFP_POD_NAME, true) ||
      undefined
    );
  },

image

rimolive commented 7 months ago

Closing this issue as there is no activity since 2022.

/close

google-oss-prow[bot] commented 7 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/6138#issuecomment-1998505041): >Closing this issue as there is no activity since 2022. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.