tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.org/tfx
Apache License 2.0
2.11k stars 706 forks source link

Evaluator works locally but fails on Dataflow #4180

Closed maciejwarchol closed 3 years ago

maciejwarchol commented 3 years ago

System information

This is basically a docker image gcr.io/tfx-oss-public/tfx:0.30.0 with kfp==1.4.1rc1.

Describe the current behavior

When I run the pipeline using Kubeflow with DataflowRunner the Evaluator for some data samples always fails with the error message: InvalidArgumentError: axis = 1 not in [-1, 1). I tried using the same data sample and run the pipeline on Kubeflow with DirectRunner or locally using InteractiveContext and it always works. I've found some examples printed out in the stack trace raw_input = ... that I replaced with "XXXXX" for confidentiality reasons. When I try to run Evaluator on that example it fails but only if run using Kubeflow with DataflowRunner. By my judgment, there is nothing unusual about these examples. I couldn't differentiate between them and the ones that work.

Describe the expected behavior

The Evaluator component finishes the evaluation without any errors.

Standalone code to reproduce the issue

I don't have standalone code I could share.

Name of your Organization (Optional)

OpenX

Other info / logs

I replaced the confidential data with "XXXXX".

INFO:apache_beam.runners.dataflow.dataflow_runner:2021-08-23T15:20:19.848Z: JOB_MESSAGE_ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: axis = 1 not in [-1, 1)
     [[{{node stack}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_metrics_graph/eval_metrics_graph.py", line 323, in _perform_metrics_update_list
    self._perform_metrics_update_fn(*[examples_list])
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1243, in _generic_run
    return self.run(fetches, feed_dict=feed_dict, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: axis = 1 not in [-1, 1)
     [[node stack (defined at usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_saved_model/load.py:169) ]]

Original stack trace for 'stack':
  File "usr/local/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "usr/local/lib/python3.7/site-packages/apache_beam/utils/thread_pool_executor.py", line 68, in run
    self._work_item.run()
  File "usr/local/lib/python3.7/site-packages/apache_beam/utils/thread_pool_executor.py", line 44, in run
    self._future.set_result(self._fn(*self._fn_args, **self._fn_kwargs))
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 362, in task
    lambda: self.create_worker().do_instruction(request), request)
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 289, in _execute
    response = task()
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 362, in <lambda>
    lambda: self.create_worker().do_instruction(request), request)
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 607, in do_instruction
    getattr(request, request_type), request.instruction_id)
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 638, in process_bundle
    instruction_id, request.process_bundle_descriptor_id)
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 467, in get
    self.data_channel_factory)
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 870, in __init__
    op.setup()
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 631, in setup
    model_load_time_callback=self._set_model_load_seconds)
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/types.py", line 164, in load
    return self._shared_handle.acquire(construct_fn)
  File "usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 315, in acquire
    return _shared_map.acquire(self._key, constructor_fn, tag)
  File "usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 256, in acquire
    result = control_block.acquire(constructor_fn, tag)
  File "usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 150, in acquire
    result = constructor_fn()
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/types.py", line 173, in with_load_times
    model = self.construct_fn()
  File "opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 588, in construct_fn
    tags=tags)
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_saved_model/load.py", line 102, in __init__
    super(EvalSavedModel, self).__init__()
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_metrics_graph/eval_metrics_graph.py", line 135, in __init__
    self._construct_graph()
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_saved_model/load.py", line 169, in _construct_graph
    self._session, self._tags, self._path)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 340, in new_func
    return func(*args, **kwargs)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/loader_impl.py", line 300, in load
    return loader.load(sess, tags, import_scope, **saver_kwargs)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/loader_impl.py", line 453, in load
    **saver_kwargs)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/loader_impl.py", line 383, in load_graph
    meta_graph_def, import_scope=import_scope, **saver_kwargs)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 1485, in _import_meta_graph_with_return_elements
    **kwargs))
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/meta_graph.py", line 804, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 538, in new_func
    return func(*args, **kwargs)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/importer.py", line 513, in _import_graph_def_internal
    _ProcessNewOps(graph)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/importer.py", line 243, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3680, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3680, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3561, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 289, in _execute
    response = task()
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 362, in <lambda>
    lambda: self.create_worker().do_instruction(request), request)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 607, in do_instruction
    getattr(request, request_type), request.instruction_id)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 644, in process_bundle
    bundle_processor.process_bundle(instruction_id))
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1005, in process_bundle
    op.finish()
  File "apache_beam/runners/worker/operations.py", line 1080, in apache_beam.runners.worker.operations.PGBKCVOperation.finish
  File "apache_beam/runners/worker/operations.py", line 1083, in apache_beam.runners.worker.operations.PGBKCVOperation.finish
  File "apache_beam/runners/worker/operations.py", line 1098, in apache_beam.runners.worker.operations.PGBKCVOperation.output_key
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/evaluators/metrics_plots_and_validations_evaluator.py", line 396, in compact
    return super(_ComputationsCombineFn, self).compact(accumulator)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/combiners.py", line 758, in compact
    a in zip(self._combiners, accumulator)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/combiners.py", line 758, in <listcomp>
    a in zip(self._combiners, accumulator)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/evaluators/eval_saved_model_util.py", line 260, in compact
    self._maybe_do_batch(accumulator, force=True)  # Guaranteed compaction.
  File "/usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/evaluators/eval_saved_model_util.py", line 225, in _maybe_do_batch
    inputs_for_metrics))
  File "/usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_metrics_graph/eval_metrics_graph.py", line 348, in metrics_reset_update_get_list
    self._perform_metrics_update_list(examples_list)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_metrics_graph/eval_metrics_graph.py", line 328, in _perform_metrics_update_list
    'raw_input = %s' % (examples_list))
  File "/usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/util.py", line 270, in reraise_augmented
    six.reraise(type(new_exception), new_exception, original_traceback)
  File "/usr/local/lib/python3.7/site-packages/six.py", line 702, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_metrics_graph/eval_metrics_graph.py", line 323, in _perform_metrics_update_list
    self._perform_metrics_update_fn(*[examples_list])
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1243, in _generic_run
    return self.run(fetches, feed_dict=feed_dict, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
RuntimeError: tensorflow.python.framework.errors_impl.InvalidArgumentError: axis = 1 not in [-1, 1)
     [[node stack (defined at usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_saved_model/load.py:169) ]]

Original stack trace for 'stack':
  File "usr/local/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "usr/local/lib/python3.7/site-packages/apache_beam/utils/thread_pool_executor.py", line 68, in run
    self._work_item.run()
  File "usr/local/lib/python3.7/site-packages/apache_beam/utils/thread_pool_executor.py", line 44, in run
    self._future.set_result(self._fn(*self._fn_args, **self._fn_kwargs))
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 362, in task
    lambda: self.create_worker().do_instruction(request), request)
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 289, in _execute
    response = task()
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 362, in <lambda>
    lambda: self.create_worker().do_instruction(request), request)
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 607, in do_instruction
    getattr(request, request_type), request.instruction_id)
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 638, in process_bundle
    instruction_id, request.process_bundle_descriptor_id)
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 467, in get
    self.data_channel_factory)
  File "usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 870, in __init__
    op.setup()
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 631, in setup
    model_load_time_callback=self._set_model_load_seconds)
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/types.py", line 164, in load
    return self._shared_handle.acquire(construct_fn)
  File "usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 315, in acquire
    return _shared_map.acquire(self._key, constructor_fn, tag)
  File "usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 256, in acquire
    result = control_block.acquire(constructor_fn, tag)
  File "usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 150, in acquire
    result = constructor_fn()
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/types.py", line 173, in with_load_times
    model = self.construct_fn()
  File "opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/model_util.py", line 588, in construct_fn
    tags=tags)
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_saved_model/load.py", line 102, in __init__
    super(EvalSavedModel, self).__init__()
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_metrics_graph/eval_metrics_graph.py", line 135, in __init__
    self._construct_graph()
  File "usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/eval_saved_model/load.py", line 169, in _construct_graph
    self._session, self._tags, self._path)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 340, in new_func
    return func(*args, **kwargs)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/loader_impl.py", line 300, in load
    return loader.load(sess, tags, import_scope, **saver_kwargs)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/loader_impl.py", line 453, in load
    **saver_kwargs)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/loader_impl.py", line 383, in load_graph
    meta_graph_def, import_scope=import_scope, **saver_kwargs)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 1485, in _import_meta_graph_with_return_elements
    **kwargs))
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/meta_graph.py", line 804, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 538, in new_func
    return func(*args, **kwargs)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/importer.py", line 513, in _import_graph_def_internal
    _ProcessNewOps(graph)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/importer.py", line 243, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3680, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3680, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3561, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack() additional message: raw_input = [b'XXXXX']
michalbrys commented 3 years ago

One interesting finding of the package versions:

Base TFX image

 Image: gcr.io/tfx-oss-public/tfx:0.30.0
 apache-beam==2.28.0
 tfx==0.30.0
 tensorflow==2.4.1

Base Dataflow image

 Image:gcr.io/cloud-dataflow/v1beta3/python37-fnapi:2.28.0

# check available packages
pip freeze | grep tensorflow
tensorflow==2.4.1
tensorflow-estimator==2.4.0

# install tfx
pip install tfx==0.30.0

# check available packages
pip freeze | grep apache-beam
apache-beam==2.31.0

pip freeze | grep tensorflow
tensorflow==2.4.3

pip freeze | grep tfx
tfx==0.30.0

so in this case we end up with different TF and Beam versions which may be impactful.

pselden commented 3 years ago

We used a custom dataflow image (same as which runs the rest of the pipeline) to eliminate any possibility of versions being an issue and it still failed.

maciejwarchol commented 3 years ago

I've found out that the following code fails with a similar error message (InvalidArgumentError: axis = 1 not in [-1, 1)) if serialized_examples is a list of size one. It doesn't matter what examples I provide.

from tensorflow_model_analysis.eval_saved_model import load
eval_saved_model = load.EvalSavedModel(os.path.join(model_path, 'Format-TFMA'))
eval_saved_model.metrics_reset_update_get_list(serialized_examples)
maciejwarchol commented 3 years ago

We figured it out. The reason why it was failing in Dataflow is because the data was distributed among workers and sometimes a worker tried to process a batch size of one. It shouldn't be an issue but somewhere in the computation graph we had a tf.squeeze transformation without any axes argument specified. In the rare case of batch size of one it unintentionally decreased the rank of the input tensor too much by removing the batch dimension. Incorrect rank of the tensor caused other errors downstream like the InvalidArgumentError I posted.

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No