Closed fschvart closed 2 years ago
Hi! facing the same issue, with evaluation of motion-deeplab model on kitti-step dataset.
I ran the command below and model evaluation has been stuck there since then.
python deeplab2/trainer/train.py --config_file=deeplab2/configs/kitti/motion_deeplab/resnet50_os32.textproto --mode=eval --model_dir=/mnt/nasfolder/ishan/experiments/motion_deeplab/trial/ --num_gpus=1
@ClaireXie @aquariusjay @joe-siyuan-qiao
Thanks for your interest in this project!
@fschvart The error was likely due to mismatches between the configurations and the dataset ground-truth labels. Could you please take a look at them and make sure the config file properly specifies the number of classes and the dataset labels are all within the valid range? As the error happened when computing the evaluation metric, it will disappear after the settings are corrected.
@kernelpanic77 I think you are seeing a different issue. Please take a look at the GPU status to check the device usage. If the program is truly stuck (i.e. very low GPU/CPU usage after a long time), it's likely a bug in the Tensorflow setup. Could you please follow solution to manually transport tensors for tf.where
in post_processor/panoptic_deeplab.py? It should clear the layout error in your shared screenshot and hopefully fix the bug.
Hope the above information helps.
Cheers!
Sorry for the delayed response. The issue was that I had multiple instances labelling the same area so when I summed all the instances of an image to create a binary semantic mask, I had an area that got a '2' label, which threw me out of bounds.
@kernelpanic77 Hey, could you please update the repository (and update to TF 2.6) or patch this PR https://github.com/google-research/deeplab2/pull/107 manually? This should fix the layout bug. Let me know if that solves your issue, as even though I encountered the same error message, evaluation still worked fine for me.
For me this bug is not fixed.
I have a background class in my images that includes all pixlesl that are not part of any other class. Ignorign the background class solves the error, but I need the detect this class also.
I got the follwing error:
`I0819 23:08:02.059501 140155675887424 controller.py:282] eval | step: 22400 | running complete evaluation...
eval | step: 22400 | running complete evaluation...
Traceback (most recent call last):
File "trainer/train.py", line 76, in
Detected at node 'confusion_matrix/assert_less/Assert/AssertGuard/Assert' defined at (most recent call last):
File "trainer/train.py", line 76, in labels
out of bound] [Condition x < y did not hold element-wise:] [x (confusion_matrix/control_dependency:0) = ] [4 4 4...] [y (confusion_matrix/Cast_2:0) = ] [4]
[[{{node confusion_matrix/assert_less/Assert/AssertGuard/Assert}}]]
[[DeepLabFamilyLoss/MaXDeepLabLoss/find_augmenting_path/while/body/_59/DeepLabFamilyLoss/MaXDeepLabLoss/find_augmenting_path/while/add_1/x/_498]]
(1) INVALID_ARGUMENT: assertion failed: [labels
out of bound] [Condition x < y did not hold element-wise:] [x (confusion_matrix/control_dependency:0) = ] [4 4 4...] [y (confusion_matrix/Cast_2:0) = ] [4]
[[{{node confusion_matrix/assert_less/Assert/AssertGuard/Assert}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_eval_step_197849]
`
Had a dataset bug, was supposed to have only 0s and 1s and had an accidental 2
Hi,
I tried either through eval (after running training) or running train_and_eval and got the same error in both cases (after training of course).
I use a custom dataset of 2500 with panoptic annotation. Training ran without errors (can't see yet how good it was). I edited the COCO dataset file. I use only 2 labels (background and person)
I'm using Windows 10 and an RTX 3090. Is it something that I forgot to change in the settings?
I'll really appreciate your help!
This is the error I get:
I0701 20:15:52.247402 1128 controller.py:276] eval | step: 5000 | running complete evaluation... eval | step: 5000 | running complete evaluation... I0701 20:15:53.003024 1128 api.py:459] Eval with scales ListWrapper([1.0]) I0701 20:15:53.006016 1128 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0701 20:15:53.007014 1128 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0701 20:15:53.008012 1128 api.py:459] Eval scale 1.0; setting pooling size to [68, 121] I0701 20:15:53.969449 1128 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0701 20:15:53.971444 1128 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. 2022-07-01 20:15:55.516803: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:903] layout failed: INVALID_ARGUMENT: Size of values 3 does not match size of permutation 4 @ fanin shape inDeepLab/PostProcessor/StatefulPartitionedCall/while/body/_85/while/SelectV2_1-1-TransposeNHWCToNCHW-LayoutOptimizer 2022-07-01 20:16:00.112171: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once. Traceback (most recent call last): File "C:\deeplab\deeplab2\trainer\train.py", line 78, in
app.run(main)
File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "C:\deeplab\deeplab2\trainer\train.py", line 73, in main
train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master,
File "c:\deeplab\deeplab2\trainer\train_lib.py", line 194, in run_experiment
controller.train_and_evaluate(
File "c:\deeplab\deeplab2\orbit\controller.py", line 332, in train_and_evaluate
self.evaluate(steps=eval_steps)
File "c:\deeplab\deeplab2\orbit\controller.py", line 281, in evaluate
eval_output = self.evaluator.evaluate(steps_tensor)
File "c:\deeplab\deeplab2\orbit\standard_runner.py", line 346, in evaluate
outputs = self._eval_loop_fn(
File "c:\deeplab\deeplab2\orbit\utils\loop_fns.py", line 75, in loop_fn
outputs = step_fn(iterator)
File "C:\deeplab\venv\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\deeplab\venv\lib\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
Detected at node 'confusion_matrix/assert_less/Assert/AssertGuard/Assert' defined at (most recent call last): File "C:\deeplab\deeplab2\trainer\train.py", line 78, in
app.run(main)
File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "C:\deeplab\deeplab2\trainer\train.py", line 73, in main
train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master,
File "c:\deeplab\deeplab2\trainer\train_lib.py", line 194, in run_experiment
controller.train_and_evaluate(
File "c:\deeplab\deeplab2\orbit\controller.py", line 332, in train_and_evaluate
self.evaluate(steps=eval_steps)
File "c:\deeplab\deeplab2\orbit\controller.py", line 281, in evaluate
eval_output = self.evaluator.evaluate(steps_tensor)
File "c:\deeplab\deeplab2\orbit\standard_runner.py", line 346, in evaluate
outputs = self._eval_loop_fn(
File "c:\deeplab\deeplab2\orbit\utils\loop_fns.py", line 75, in loop_fn
outputs = step_fn(iterator)
File "c:\deeplab\deeplab2\trainer\evaluator.py", line 181, in eval_step
distributed_outputs = self._strategy.run(step_fn, args=(next(iterator),))
File "c:\deeplab\deeplab2\trainer\evaluator.py", line 178, in step_fn
step_outputs = self._eval_step(inputs)
File "c:\deeplab\deeplab2\trainer\evaluator.py", line 199, in _eval_step
if self._decode_groundtruth_label:
File "c:\deeplab\deeplab2\trainer\evaluator.py", line 214, in _eval_step
self._eval_iou_metric.update_state(
File "C:\deeplab\venv\lib\site-packages\keras\utils\metrics_utils.py", line 70, in decorated
update_op = update_state_fn(*args, kwargs)
File "C:\deeplab\venv\lib\site-packages\keras\metrics\base_metric.py", line 140, in update_state_fn
return ag_update_state(*args, *kwargs)
File "C:\deeplab\venv\lib\site-packages\keras\metrics\metrics.py", line 2494, in update_state
current_cm = tf.math.confusion_matrix(
Node: 'confusion_matrix/assert_less/Assert/AssertGuard/Assert'
Detected at node 'confusion_matrix/assert_less/Assert/AssertGuard/Assert' defined at (most recent call last):
File "C:\deeplab\deeplab2\trainer\train.py", line 78, in
app.run(main)
File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "C:\deeplab\venv\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "C:\deeplab\deeplab2\trainer\train.py", line 73, in main
train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master,
File "c:\deeplab\deeplab2\trainer\train_lib.py", line 194, in run_experiment
controller.train_and_evaluate(
File "c:\deeplab\deeplab2\orbit\controller.py", line 332, in train_and_evaluate
self.evaluate(steps=eval_steps)
File "c:\deeplab\deeplab2\orbit\controller.py", line 281, in evaluate
eval_output = self.evaluator.evaluate(steps_tensor)
File "c:\deeplab\deeplab2\orbit\standard_runner.py", line 346, in evaluate
outputs = self._eval_loop_fn(
File "c:\deeplab\deeplab2\orbit\utils\loop_fns.py", line 75, in loop_fn
outputs = step_fn(iterator)
File "c:\deeplab\deeplab2\trainer\evaluator.py", line 181, in eval_step
distributed_outputs = self._strategy.run(step_fn, args=(next(iterator),))
File "c:\deeplab\deeplab2\trainer\evaluator.py", line 178, in step_fn
step_outputs = self._eval_step(inputs)
File "c:\deeplab\deeplab2\trainer\evaluator.py", line 199, in _eval_step
if self._decode_groundtruth_label:
File "c:\deeplab\deeplab2\trainer\evaluator.py", line 214, in _eval_step
self._eval_iou_metric.update_state(
File "C:\deeplab\venv\lib\site-packages\keras\utils\metrics_utils.py", line 70, in decorated
update_op = update_state_fn( args, kwargs)
File "C:\deeplab\venv\lib\site-packages\keras\metrics\base_metric.py", line 140, in update_state_fn
return ag_update_state(*args, **kwargs)
File "C:\deeplab\venv\lib\site-packages\keras\metrics\metrics.py", line 2494, in update_state
current_cm = tf.math.confusion_matrix(
Node: 'confusion_matrix/assert_less/Assert/AssertGuard/Assert'
2 root error(s) found.
(0) INVALID_ARGUMENT: assertion failed: [
labels
out of bound] [Condition x < y did not hold element-wise:] [x (confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (confusion_matrix/Cast_2:0) = ] [2] [[{{node confusion_matrix/assert_less/Assert/AssertGuard/Assert}}]] [[DeepLab/PostProcessor/StatefulPartitionedCall/PartitionedCall/while_1/body/_299/while_1/cond_1/then/_611/while_1/cond_1/cond_1/then/_700/while_1/cond_1/cond_1/while/loop_counter/_202]] (1) INVALID_ARGUMENT: assertion failed: [labels
out of bound] [Condition x < y did not hold element-wise:] [x (confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (confusion_matrix/Cast_2:0) = ] [2] [[{{node confusion_matrix/assert_less/Assert/AssertGuard/Assert}}]] 0 successful operations. 0 derived errors ignored. [Op:__inference_eval_step_77340]