tensorflow / model-analysis

Model analysis tools for TensorFlow
Apache License 2.0
1.26k stars 276 forks source link

Error in merge_accumulators when using keras metrics on dataflow #158

Open zywind opened 2 years ago

zywind commented 2 years ago

System information

I am using TFX's evaluator

eval_config = tfma.EvalConfig(
  model_specs=model_specs,
  metrics_specs=tfma.metrics.specs_from_metrics([
      tf.keras.metrics.AUC(curve='ROC', name='ROCAUC'),
      tf.keras.metrics.AUC(curve='PR', name='PRAUC'),
      tf.keras.metrics.Precision(),
      tf.keras.metrics.Recall(),
      tf.keras.metrics.BinaryAccuracy(),
    ]),
  slicing_specs=slicing_specs
)

evaluator = Evaluator(
  eval_config=eval_config,
  model=model,
  examples=transform_examples,
)

context.run(evaluator)

Describe the problem

Running the same evaluation using Beam's DirectRunner locally will not cause any error, but whenever I run it on dataflow and when dataflow spawns more than one worker, I get an error like so:

output.with_value(self.phased_combine_fn.apply(output.value)): File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/combiners.py", line 882, in merge_only return self.combine_fn.merge_accumulators(accumulators) File "/home/sandbox/.pex/install/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl.06f7ceb62380d1c704d774a5096a04f953de60c9/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl/apache_beam/transforms/combiners.py", line 665, in merge_accumulators a in zip(self._combiners, zip(accumulators_batch)) File "/home/sandbox/.pex/install/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl.06f7ceb62380d1c704d774a5096a04f953de60c9/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl/apache_beam/transforms/combiners.py", line 665, in a in zip(self._combiners, zip(accumulators_batch)) File "/usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 560, in merge_accumulators for metric_index in range(len(self._metrics[output_name])): TypeError: 'NoneType' object is not subscriptable

Based on the dataflow log, the failing steps were:

I see that you have this commit, which appears to be addressing this problem, but it is immediately rolled back. I wonder if you have had similar issues and what would you recommend to fix the error.

zywind commented 2 years ago

I tried setting Dataflow's max_num_workers to 1 and the job succeeded. Looks like the problem is indeed in running dataflow with multiple workers.

singhniraj08 commented 2 years ago

Hi @zywind ,

As mentioned here, for distributed evaluation, we use tfma.ExtractEvaluateAndWriteResults. Please refer to this example notebook let me know if this resolves your issue.

Thank you.

zywind commented 2 years ago

Hi @singhniraj08,

I'm using the official TFX Evaluator, which internally uses tfma.ExtractEvaluateAndWriteResults as you can see here.