Closed maciejwarchol closed 3 years ago
One interesting finding of the package versions:
Base TFX image
Image: gcr.io/tfx-oss-public/tfx:0.30.0
apache-beam==2.28.0
tfx==0.30.0
tensorflow==2.4.1
Base Dataflow image
Image:gcr.io/cloud-dataflow/v1beta3/python37-fnapi:2.28.0
# check available packages
pip freeze | grep tensorflow
tensorflow==2.4.1
tensorflow-estimator==2.4.0
# install tfx
pip install tfx==0.30.0
# check available packages
pip freeze | grep apache-beam
apache-beam==2.31.0
pip freeze | grep tensorflow
tensorflow==2.4.3
pip freeze | grep tfx
tfx==0.30.0
so in this case we end up with different TF and Beam versions which may be impactful.
We used a custom dataflow image (same as which runs the rest of the pipeline) to eliminate any possibility of versions being an issue and it still failed.
I've found out that the following code fails with a similar error message (InvalidArgumentError: axis = 1 not in [-1, 1)
) if serialized_examples
is a list of size one. It doesn't matter what examples I provide.
from tensorflow_model_analysis.eval_saved_model import load
eval_saved_model = load.EvalSavedModel(os.path.join(model_path, 'Format-TFMA'))
eval_saved_model.metrics_reset_update_get_list(serialized_examples)
We figured it out. The reason why it was failing in Dataflow is because the data was distributed among workers and sometimes a worker tried to process a batch size of one. It shouldn't be an issue but somewhere in the computation graph we had a tf.squeeze
transformation without any axes argument specified. In the rare case of batch size of one it unintentionally decreased the rank of the input tensor too much by removing the batch dimension. Incorrect rank of the tensor caused other errors downstream like the InvalidArgumentError
I posted.
System information
pip freeze
output):This is basically a docker image
gcr.io/tfx-oss-public/tfx:0.30.0
withkfp==1.4.1rc1
.Describe the current behavior
When I run the pipeline using Kubeflow with DataflowRunner the Evaluator for some data samples always fails with the error message:
InvalidArgumentError: axis = 1 not in [-1, 1)
. I tried using the same data sample and run the pipeline on Kubeflow with DirectRunner or locally using InteractiveContext and it always works. I've found some examples printed out in the stack traceraw_input = ...
that I replaced with "XXXXX" for confidentiality reasons. When I try to run Evaluator on that example it fails but only if run using Kubeflow with DataflowRunner. By my judgment, there is nothing unusual about these examples. I couldn't differentiate between them and the ones that work.Describe the expected behavior
The Evaluator component finishes the evaluation without any errors.
Standalone code to reproduce the issue
I don't have standalone code I could share.
Name of your Organization (Optional)
OpenX
Other info / logs
I replaced the confidential data with "XXXXX".