tensorflow / model-analysis

Model analysis tools for TensorFlow
Apache License 2.0
1.26k stars 276 forks source link

only integer values should be passed to num_instances metric. #171

Open zzing0907 opened 1 year ago

zzing0907 commented 1 year ago

System information

Describe the problem

While TFMA using beam metric, an error occurs because the type is numpy.int64 rather than int. The error log is as follows, and an error occurs when running evaluation with padding option (tf-ranking metrics). It seems that the above error occurs while obtaining batch_size from the metric called num_instances.

Source code / logs

image

Error message from worker: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 292, in _execute
    response = task()
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 365, in <lambda>
    lambda: self.create_worker().do_instruction(request), request)
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 624, in do_instruction
    return getattr(self, request_type)(
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 663, in process_bundle
    monitoring_infos = bundle_processor.monitoring_infos()
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/runners/worker/bundle_processor.py", line 1198, in monitoring_infos
    op.monitoring_infos(transform_id, dict(tag_to_pcollection_id)))
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/runners/worker/operations.py", line 543, in monitoring_infos
    all_monitoring_infos.update(self.user_monitoring_infos(transform_id))
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/runners/worker/operations.py", line 584, in user_monitoring_infos
    return self.metrics_container.to_runner_api_monitoring_infos(transform_id)
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/metrics/execution.py", line 309, in to_runner_api_monitoring_infos
    all_metrics = [
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/metrics/execution.py", line 310, in <listcomp>
    cell.to_runner_api_monitoring_info(key.metric_name, transform_id)
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/metrics/cells.py", line 76, in to_runner_api_monitoring_info
    mi = self.to_runner_api_monitoring_info_impl(name, transform_id)
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/metrics/cells.py", line 150, in to_runner_api_monitoring_info_impl
    return monitoring_infos.int64_user_counter(
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/metrics/monitoring_infos.py", line 185, in int64_user_counter
    return create_monitoring_info(
  File "/usr/local/lib/python3.8/dist-packages/apache_beam/metrics/monitoring_infos.py", line 302, in create_monitoring_info
    return metrics_pb2.MonitoringInfo(
TypeError: 3367 has type numpy.int64, but expected one of: bytes  
singhniraj08 commented 1 year ago

@zzing0907,

Could you please provide the minimum reproducible code to reproduce the error at our end? Please refer Tensorflow Model Analysis Metrics and Plots for TFMA supported metrics and Ranking based metrics. Thank you!

EdwardCuiPeacock commented 1 year ago

@zzing0907 @singhniraj08 I am also facing similar issue. I am finding that this error comes and leaves, thus not quite reproducible. However, it occurs with high enough frequency to be concerning. I would also like to point out that the monitoring metric referred by the error message is not the same as Evaluation metrics in Data Science as referred by TFMA. The monitoring metrics are created by Apache Beam to check the progress of the workers (likely). So I wonder whether or not this is actually an issue with Apache Beam. I have filed a similar issue with the Apache Beam team here: https://github.com/apache/beam/issues/27469

jrmccluskey commented 1 year ago

Coming at it from the Beam metric side, it looks like numpy.int64 values are being passed to the counter improperly somewhere. Those counters should only receive ints, as that is the only type that the Beam code will encode before passing it to a protobuffer to be reported. I provided a little context on https://github.com/apache/beam/issues/27469. If you can find where the metric is getting numpy.int64s and convert the values to python ints in the call, that should resolve it.