Closed vaskozl closed 7 months ago
I've also just tested the issues with 1.7.0-rc2.
I've tried kfp 1.4.0 and 1.6.6.
Downgrading the metadata images seems to have solved this for me:
images:
- name: gcr.io/ml-pipeline/metadata-envoy
newTag: 1.4.0
- name: gcr.io/tfx-oss-public/ml_metadata_store_server
newTag: 0.25.1
- name: gcr.io/ml-pipeline/metadata-writer
newTag: 1.4.0
I believe the metadata-writer
container version is the most important one here, though I have no context.
Actually reverting doesn't solve the problem completely, I still find the Statistics and Evaluator artifacts don't display.
In the markdown they have zeroed out id's:
id: 0
span: None
type_id: 0
There seems to be a problem with ml-metadata 1.0.0 which is required by tfx 1.0.0 and matches the grpc server of 1.7.1.
Downgrading the the metadata-grpc-server doesn't work solve anything.
I've gone back down to tfx 0.30 which uses metadata 0.30.0
Visualisations work when I downgrade gcr.io/tfx-oss-public/ml_metadata_store_server to 0.25.1
as well.
Hello @vaskozl , what is the TFX version you are using when you are using KFP 1.4.0?
/cc @jiyongjung0 do you know any context?
Possible related: https://github.com/tensorflow/tensorflow/issues/50045
When on 1.4.0 everything from TFX 0.27 to 1.0.0 wrote metadata. (visualizations didn't work but they do in the 1.7.0 release). I do recall having to bump the grpcio version in requirements up and down while hopping up to TFX 1.0.0 .
TFX 1.0.0 -> Kubeflow 1.7.0 (with metadata) No metadata written TFX 1.0.0 -> Kubeflow 1.4.0 Partial metadata written it seemed TFX 0.30.0 -> Kubeflow 1.4.0 All as expected
As I said I'm now using 1.7.0 with just the ml_metadata_store_server downgraded below 1.0.0.
I found some possible cause. It is related to the changes in the way TFX stores their contexts since 1.0.0 (which is related to the changes in the execution stack using TFX IR).
In TFX 0.X, the context were
However in TFX 1.0, the context became
So it seems like Kubeflow Pipelines cannot find context (and artifacts) properly. I think that we should change mlmd access code like here.
CC. @zhitaoli , @1025KB , @Bobgy
Thank you @jiyongjung0 for finding the difference between TFX 0.X and 1.0+!
For backward compatibility, what should KFP frontend do to detect if a TFX pipeline is TFX 0.X or TFX 1.0+?
Unfortunately, it seems that there is no direct clue when finding executions. (Artifacts has tfx_version
property, but there is no such information in Context / Execution.)
I think that we can try to find 1.0 context first, and fallback to 0.X context if not found.
Makes sense to me to find 1.0 context first and then fall back to 0.X
Sounds good, agree with the fallback strategy here.
Hello @jiyongjung0 ,
Currently we use this code logic to identify the execution for a specific node using node ID (which is the corresponding Pod ID). However, with the latest integration with TFX, we are unable to find this connection from Execution, see one of the following example for a statisticsGen execution:
How do I fix the TFX integration that I get the properties correctly like the following? (From an old deployment)
Hi, this is a bug from TFX side introduced in https://github.com/tensorflow/tfx/commit/24fc5d1db9198a75db11af25cf05c4d3ae05491f. It seems like we don't record pod names in TFX 1.0.0. I'll try to fix this ASAP, and will update the release plan.
I'm trying to include the above fix in the TFX 1.2.0 which is expected to be released tomorrow. I'll update you when the release is finalized.
TFX 1.2.0 was released today and this should be fixed. Thank you again for reporting this!
Appreciate it a lot @jiyongjung0 for the quick fix and release of TFX 1.2.0!
The following is what I found:
I need to make changes to notebook example for TFX v1.2.0 for pip installation and image. I send a PR for this: https://github.com/kubeflow/pipelines/pull/6381, appreciate for your review in advance!
I am able to see HTML visualization for staticsgen, schemagen, etc. (yay!), but I am not able to see the visualization of transform
step, for the artifact like pre_transform_stats
. Because KFP is trying to visit Split-eval
and Split-train
in the code files = tf.io.gfile.listdir('${specificUri}')
: https://github.com/kubeflow/pipelines/blob/master/frontend/src/lib/OutputArtifactLoader.ts#L304, where I don't have those files, see the screenshot below:
The step evaluator
fails with the following logs, how does TFX utilize the KFP visualization feature?
File "apache_beam/runners/common.py", line 572, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 779, in process
result.extend(self._batch_reducible_process(unbatched_element))
File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 928, in _batch_reducible_process
input_specs = get_input_specs(model, signature_name, required) or {}
File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 472, in get_input_specs
signature_name, model.signatures))
ValueError: tft_layer not found in model signatures: _SignatureMap({'serving_default': <ConcreteFunction signature_wrapper(*, examples) at 0x7FCBCA4E62D0>, 'transform_features': <ConcreteFunction signature_wrapper(*, examples) at 0x7FCBCA4E3410>}) [while running 'ExtractEvaluateAndWriteResults/ExtractAndEvaluate/ExtractTransformedFeatures/Predict']
time="2021-08-18T16:09:09.989Z" level=error msg="cannot save artifact /mlpipeline-ui-metadata.json" argo=true error="stat /mlpipeline-ui-metadata.json: no such file or directory"
Error: exit status 1
@zijianjoy
Thank you so much!!
For 2.(Transform output) It seems like an inconsistency in TFX implementation. The artifact from Transform was added recently and I never tried before. I'll talk with other TFX folks and get back to you.
For 3.(Evaluator), We need to change the preprocessing_function_names
in the evaluator config in 1.2.0, because the example was changed in 1.2.0. Please see https://github.com/tensorflow/tfx/blob/34bdbc8c0f7c2d0da36559c9cb7afd603e44a5e3/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_native_keras.py#L126
Document the conversation:
Now TFX moves the information in Execution to Context with type node
. see code: https://github.com/tensorflow/tfx/blob/master/tfx/orchestration/metadata.py#L428-L435.
KFP will consider the possibility to pulling context for TFX.
@zijianjoy Was the TFX > 1.0.0 fix included in the KFP 1.7.0 release?
@ConverJens I can confirm it has.
TFX 1.2.0 and Pipelines 1.7.0 work perfectly with no patches.
@vaskozl Great news, thank you for the information!
Yes @ConverJens , it is as confirmed by @vaskozl .
BTW, @Bobgy is working on a compatibility matrix for KFP and TFX (and more) which shows the working version combinations in the future.
@zijianjoy Great! A compatibility matrix would be awsome!
Hello @ConverJens , you can check out the compatibility matrix in https://www.kubeflow.org/docs/components/pipelines/installation/compatibility-matrix/ now.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, there. I am using kfp v.1.5.1 with tfx v0.28.0. I faced the issue of loading ML metadata in the Runs page, although in the compatibility matrix it is "Fully Compatible". That was because of the contextName inconsistency between what frontend wants to find and that stored in the MySQL. frontend wanted to find something like "mypipeline.my-pipeline-xaew1" (some hash is appended in the second part.), but what is stored in the MySQL is "my-pipeline.my-pipeline-xaew1", so error happened. See: https://github.com/kubeflow/pipelines/blob/1.5.1/frontend/src/lib/MlmdUtils.ts#L66 After changing it from .join('') to .join('-'), it works.
const pipelineName = argoWorkflowName
.split('-')
.slice(0, -1)
.join('-');
And it also had an issue with getKfpPod using KFP_POD_NAME, because kfpRun wasn't written into the DB. See: https://github.com/kubeflow/pipelines/blob/1.5.1/frontend/src/lib/MlmdUtils.ts#L146 Took a workaround, using POD_NAME, made it works.
export enum KfpExecutionProperties {
KFP_POD_NAME = 'kfp_pod_name',
POD_NAME = 'pod_name',
}
...
getKfpPod(execution: Execution): string | number | undefined {
return (
getResourceProperty(execution, KfpExecutionProperties.POD_NAME, true) ||
getResourceProperty(execution, KfpExecutionProperties.KFP_POD_NAME) ||
getResourceProperty(execution, KfpExecutionProperties.KFP_POD_NAME, true) ||
undefined
);
},
Closing this issue as there is no activity since 2022.
/close
@rimolive: Closing this issue.
/kind bug
I upgraded Kubeflow from 1.4.0 to 1.7.0-rc1 with the platnform-agnostic manifests.
While I now see correct visualizations of statistics from runs that happened before upgrading to 1.7.0-rc1, new runs only display the markdown details.
The TFX pipelines I submit are exactly the same. On the new runs the ML Metadata tab of the components prints:
"Corresponding ML Metadata not found."
Furthermore I don't see any new executions on the executions page despite running many pipelines since upgrading.
I don't see anything special in the logs of the TFX pods except:
But that was present before upgrading to 1.7.0-rc1.
The only errors I see in the metadata-grpc-deployment pod is:
Which I also think is normal?
Basically I don't think executions and artifacts are getting written to the DB for some reason in 1.7.0-rc1. Not sure how to debug this. This causes the visualizations to not show up as far as I can see.
Metadata in the TFX pipelines is configured via the get_default_kubeflow_metadata_config tfx.orchestration.kubeflow function.
Environment:
Kubeflow version: 1.4.0 -> 1.7.0-rc1 kfctl version: Not used. Using tfx.orchestration.kubeflow to submit pipelines. Kubernetes platform: Upstream kubeadm: k8s v1.20.5 Kubernetes version: (use kubectl version): OS (e.g. from /etc/os-release): Centos 8
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.