'Label' out of bounds error during evaluation

fschvart commented 2 years ago

Hi,

I tried either through eval (after running training) or running train_and_eval and got the same error in both cases (after training of course).

I use a custom dataset of 2500 with panoptic annotation. Training ran without errors (can't see yet how good it was). I edited the COCO dataset file. I use only 2 labels (background and person)

I'm using Windows 10 and an RTX 3090. Is it something that I forgot to change in the settings?

I'll really appreciate your help!

This is the error I get:

I0701 20:15:52.247402 1128 controller.py:276] eval | step: 5000 | running complete evaluation... eval | step: 5000 | running complete evaluation... I0701 20:15:53.003024 1128 api.py:459] Eval with scales ListWrapper([1.0]) I0701 20:15:53.006016 1128 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0701 20:15:53.007014 1128 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0701 20:15:53.008012 1128 api.py:459] Eval scale 1.0; setting pooling size to [68, 121] I0701 20:15:53.969449 1128 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0701 20:15:53.971444 1128 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. 2022-07-01 20:15:55.516803: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:903] layout failed: INVALID_ARGUMENT: Size of values 3 does not match size of permutation 4 @ fanin shape inDeepLab/PostProcessor/StatefulPartitionedCall/while/body/_85/while/SelectV2_1-1-TransposeNHWCToNCHW-LayoutOptimizer 2022-07-01 20:16:00.112171: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once. Traceback (most recent call last): File "C:\deeplab\deeplab2\trainer\train.py", line 78, in app.run(main) File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 312, in run _run_main(main, args) File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 258, in _run_main sys.exit(main(argv)) File "C:\deeplab\deeplab2\trainer\train.py", line 73, in main train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master, File "c:\deeplab\deeplab2\trainer\train_lib.py", line 194, in run_experiment controller.train_and_evaluate( File "c:\deeplab\deeplab2\orbit\controller.py", line 332, in train_and_evaluate self.evaluate(steps=eval_steps) File "c:\deeplab\deeplab2\orbit\controller.py", line 281, in evaluate eval_output = self.evaluator.evaluate(steps_tensor) File "c:\deeplab\deeplab2\orbit\standard_runner.py", line 346, in evaluate outputs = self._eval_loop_fn( File "c:\deeplab\deeplab2\orbit\utils\loop_fns.py", line 75, in loop_fn outputs = step_fn(iterator) File "C:\deeplab\venv\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler raise e.with_traceback(filtered_tb) from None File "C:\deeplab\venv\lib\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'confusion_matrix/assert_less/Assert/AssertGuard/Assert' defined at (most recent call last): File "C:\deeplab\deeplab2\trainer\train.py", line 78, in app.run(main) File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 312, in run _run_main(main, args) File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 258, in _run_main sys.exit(main(argv)) File "C:\deeplab\deeplab2\trainer\train.py", line 73, in main train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master, File "c:\deeplab\deeplab2\trainer\train_lib.py", line 194, in run_experiment controller.train_and_evaluate( File "c:\deeplab\deeplab2\orbit\controller.py", line 332, in train_and_evaluate self.evaluate(steps=eval_steps) File "c:\deeplab\deeplab2\orbit\controller.py", line 281, in evaluate eval_output = self.evaluator.evaluate(steps_tensor) File "c:\deeplab\deeplab2\orbit\standard_runner.py", line 346, in evaluate outputs = self._eval_loop_fn( File "c:\deeplab\deeplab2\orbit\utils\loop_fns.py", line 75, in loop_fn outputs = step_fn(iterator) File "c:\deeplab\deeplab2\trainer\evaluator.py", line 181, in eval_step distributed_outputs = self._strategy.run(step_fn, args=(next(iterator),)) File "c:\deeplab\deeplab2\trainer\evaluator.py", line 178, in step_fn step_outputs = self._eval_step(inputs) File "c:\deeplab\deeplab2\trainer\evaluator.py", line 199, in _eval_step if self._decode_groundtruth_label: File "c:\deeplab\deeplab2\trainer\evaluator.py", line 214, in _eval_step self._eval_iou_metric.update_state( File "C:\deeplab\venv\lib\site-packages\keras\utils\metrics_utils.py", line 70, in decorated update_op = update_state_fn(*args, kwargs) File "C:\deeplab\venv\lib\site-packages\keras\metrics\base_metric.py", line 140, in update_state_fn return ag_update_state(*args, *kwargs) File "C:\deeplab\venv\lib\site-packages\keras\metrics\metrics.py", line 2494, in update_state current_cm = tf.math.confusion_matrix( Node: 'confusion_matrix/assert_less/Assert/AssertGuard/Assert' Detected at node 'confusion_matrix/assert_less/Assert/AssertGuard/Assert' defined at (most recent call last): File "C:\deeplab\deeplab2\trainer\train.py", line 78, in app.run(main) File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 312, in run _run_main(main, args) File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 258, in _run_main sys.exit(main(argv)) File "C:\deeplab\deeplab2\trainer\train.py", line 73, in main train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master, File "c:\deeplab\deeplab2\trainer\train_lib.py", line 194, in run_experiment controller.train_and_evaluate( File "c:\deeplab\deeplab2\orbit\controller.py", line 332, in train_and_evaluate self.evaluate(steps=eval_steps) File "c:\deeplab\deeplab2\orbit\controller.py", line 281, in evaluate eval_output = self.evaluator.evaluate(steps_tensor) File "c:\deeplab\deeplab2\orbit\standard_runner.py", line 346, in evaluate outputs = self._eval_loop_fn( File "c:\deeplab\deeplab2\orbit\utils\loop_fns.py", line 75, in loop_fn outputs = step_fn(iterator) File "c:\deeplab\deeplab2\trainer\evaluator.py", line 181, in eval_step distributed_outputs = self._strategy.run(step_fn, args=(next(iterator),)) File "c:\deeplab\deeplab2\trainer\evaluator.py", line 178, in step_fn step_outputs = self._eval_step(inputs) File "c:\deeplab\deeplab2\trainer\evaluator.py", line 199, in _eval_step if self._decode_groundtruth_label: File "c:\deeplab\deeplab2\trainer\evaluator.py", line 214, in _eval_step self._eval_iou_metric.update_state( File "C:\deeplab\venv\lib\site-packages\keras\utils\metrics_utils.py", line 70, in decorated update_op = update_state_fn(args, kwargs) File "C:\deeplab\venv\lib\site-packages\keras\metrics\base_metric.py", line 140, in update_state_fn return ag_update_state(*args, **kwargs) File "C:\deeplab\venv\lib\site-packages\keras\metrics\metrics.py", line 2494, in update_state current_cm = tf.math.confusion_matrix( Node: 'confusion_matrix/assert_less/Assert/AssertGuard/Assert' 2 root error(s) found. (0) INVALID_ARGUMENT: assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (confusion_matrix/Cast_2:0) = ] [2] [[{{node confusion_matrix/assert_less/Assert/AssertGuard/Assert}}]] [[DeepLab/PostProcessor/StatefulPartitionedCall/PartitionedCall/while_1/body/_299/while_1/cond_1/then/_611/while_1/cond_1/cond_1/then/_700/while_1/cond_1/cond_1/while/loop_counter/_202]] (1) INVALID_ARGUMENT: assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (confusion_matrix/Cast_2:0) = ] [2] [[{{node confusion_matrix/assert_less/Assert/AssertGuard/Assert}}]] 0 successful operations. 0 derived errors ignored. [Op:__inference_eval_step_77340]

kernelpanic77 commented 2 years ago

Hi! facing the same issue, with evaluation of motion-deeplab model on kitti-step dataset. I ran the command below and model evaluation has been stuck there since then. python deeplab2/trainer/train.py --config_file=deeplab2/configs/kitti/motion_deeplab/resnet50_os32.textproto --mode=eval --model_dir=/mnt/nasfolder/ishan/experiments/motion_deeplab/trial/ --num_gpus=1

Screenshot from 2022-07-04 11-56-33

@ClaireXie @aquariusjay @joe-siyuan-qiao

joe-siyuan-qiao commented 2 years ago

Thanks for your interest in this project!

@fschvart The error was likely due to mismatches between the configurations and the dataset ground-truth labels. Could you please take a look at them and make sure the config file properly specifies the number of classes and the dataset labels are all within the valid range? As the error happened when computing the evaluation metric, it will disappear after the settings are corrected.

@kernelpanic77 I think you are seeing a different issue. Please take a look at the GPU status to check the device usage. If the program is truly stuck (i.e. very low GPU/CPU usage after a long time), it's likely a bug in the Tensorflow setup. Could you please follow solution to manually transport tensors for tf.where in post_processor/panoptic_deeplab.py? It should clear the layout error in your shared screenshot and hopefully fix the bug.

Hope the above information helps.

Cheers!

fschvart commented 2 years ago

Sorry for the delayed response. The issue was that I had multiple instances labelling the same area so when I summed all the instances of an image to create a binary semantic mask, I had an area that got a '2' label, which threw me out of bounds.

markweberdev commented 2 years ago

@kernelpanic77 Hey, could you please update the repository (and update to TF 2.6) or patch this PR https://github.com/google-research/deeplab2/pull/107 manually? This should fix the layout bug. Let me know if that solves your issue, as even though I encountered the same error message, evaluation still worked fine for me.

hannes09 commented 2 years ago

For me this bug is not fixed.

I have a background class in my images that includes all pixlesl that are not part of any other class. Ignorign the background class solves the error, but I need the detect this class also.

I got the follwing error:

`I0819 23:08:02.059501 140155675887424 controller.py:282] eval | step: 22400 | running complete evaluation... eval | step: 22400 | running complete evaluation... Traceback (most recent call last): File "trainer/train.py", line 76, in app.run(main) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "trainer/train.py", line 71, in main train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master, File "/panoptic_segmentation/deeplab2/trainer/train_lib.py", line 208, in run_experiment controller.train_and_evaluate( File "/panoptic_segmentation/models/orbit/controller.py", line 335, in train_and_evaluate self.evaluate(steps=eval_steps) File "/panoptic_segmentation/models/orbit/controller.py", line 287, in evaluate eval_output = self.evaluator.evaluate(steps_tensor) File "/panoptic_segmentation/models/orbit/standard_runner.py", line 346, in evaluate outputs = self._eval_loop_fn( File "/panoptic_segmentation/models/orbit/utils/loop_fns.py", line 75, in loop_fn outputs = step_fn(iterator) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler raise e.with_traceback(filtered_tb) from None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'confusion_matrix/assert_less/Assert/AssertGuard/Assert' defined at (most recent call last): File "trainer/train.py", line 76, in app.run(main) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "trainer/train.py", line 71, in main train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master, File "/panoptic_segmentation/deeplab2/trainer/train_lib.py", line 208, in run_experiment controller.train_and_evaluate( File "/panoptic_segmentation/models/orbit/controller.py", line 335, in train_and_evaluate self.evaluate(steps=eval_steps) File "/panoptic_segmentation/models/orbit/controller.py", line 287, in evaluate eval_output = self.evaluator.evaluate(steps_tensor) File "/panoptic_segmentation/models/orbit/standard_runner.py", line 346, in evaluate outputs = self._eval_loop_fn( File "/panoptic_segmentation/models/orbit/utils/loop_fns.py", line 75, in loop_fn outputs = step_fn(iterator) File "/panoptic_segmentation/deeplab2/trainer/evaluator.py", line 186, in eval_step distributed_outputs = self._strategy.run(step_fn, args=(next(iterator),)) File "/panoptic_segmentation/deeplab2/trainer/evaluator.py", line 183, in step_fn step_outputs = self._eval_step(inputs) File "/panoptic_segmentation/deeplab2/trainer/evaluator.py", line 204, in _eval_step if self._decode_groundtruth_label: File "/panoptic_segmentation/deeplab2/trainer/evaluator.py", line 219, in _eval_step self._eval_iou_metric.update_state( File "/usr/local/lib/python3.8/dist-packages/keras/utils/metrics_utils.py", line 70, in decorated update_op = update_state_fn(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 140, in update_state_fn return ag_update_state(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/keras/metrics/metrics.py", line 2494, in update_state current_cm = tf.math.confusion_matrix( Node: 'confusion_matrix/assert_less/Assert/AssertGuard/Assert' Detected at node 'confusion_matrix/assert_less/Assert/AssertGuard/Assert' defined at (most recent call last): File "trainer/train.py", line 76, in app.run(main) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "trainer/train.py", line 71, in main train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master, File "/panoptic_segmentation/deeplab2/trainer/train_lib.py", line 208, in run_experiment controller.train_and_evaluate( File "/panoptic_segmentation/models/orbit/controller.py", line 335, in train_and_evaluate self.evaluate(steps=eval_steps) File "/panoptic_segmentation/models/orbit/controller.py", line 287, in evaluate eval_output = self.evaluator.evaluate(steps_tensor) File "/panoptic_segmentation/models/orbit/standard_runner.py", line 346, in evaluate outputs = self._eval_loop_fn( File "/panoptic_segmentation/models/orbit/utils/loop_fns.py", line 75, in loop_fn outputs = step_fn(iterator) File "/panoptic_segmentation/deeplab2/trainer/evaluator.py", line 186, in eval_step distributed_outputs = self._strategy.run(step_fn, args=(next(iterator),)) File "/panoptic_segmentation/deeplab2/trainer/evaluator.py", line 183, in step_fn step_outputs = self._eval_step(inputs) File "/panoptic_segmentation/deeplab2/trainer/evaluator.py", line 204, in _eval_step if self._decode_groundtruth_label: File "/panoptic_segmentation/deeplab2/trainer/evaluator.py", line 219, in _eval_step self._eval_iou_metric.update_state( File "/usr/local/lib/python3.8/dist-packages/keras/utils/metrics_utils.py", line 70, in decorated update_op = update_state_fn(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 140, in update_state_fn return ag_update_state(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/keras/metrics/metrics.py", line 2494, in update_state current_cm = tf.math.confusion_matrix( Node: 'confusion_matrix/assert_less/Assert/AssertGuard/Assert' 2 root error(s) found. (0) INVALID_ARGUMENT: assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (confusion_matrix/control_dependency:0) = ] [4 4 4...] [y (confusion_matrix/Cast_2:0) = ] [4] [[{{node confusion_matrix/assert_less/Assert/AssertGuard/Assert}}]] [[DeepLabFamilyLoss/MaXDeepLabLoss/find_augmenting_path/while/body/_59/DeepLabFamilyLoss/MaXDeepLabLoss/find_augmenting_path/while/add_1/x/_498]] (1) INVALID_ARGUMENT: assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (confusion_matrix/control_dependency:0) = ] [4 4 4...] [y (confusion_matrix/Cast_2:0) = ] [4] [[{{node confusion_matrix/assert_less/Assert/AssertGuard/Assert}}]] 0 successful operations. 0 derived errors ignored. [Op:__inference_eval_step_197849] `

fschvart commented 2 years ago

Had a dataset bug, was supposed to have only 0s and 1s and had an accidental 2

google-research / deeplab2

'Label' out of bounds error during evaluation #106