tensorflow / models

Models and examples built with TensorFlow
Other
77.23k stars 45.75k forks source link

mask rcnn inception resenet v2 error in train model #9527

Open mashalahmad opened 3 years ago

mashalahmad commented 3 years ago

anybody face this issue while training mask rcnn inception resnet v2? here is my pipeline.config

`Mask R-CNN with Inception Resnet v2 (no atrous) Sync-trained on COCO (with 8 GPUs) with batch size 16 (1024x1024 resolution) Initialized from Imagenet classification checkpoint TF2-Compatible, Not TPU-Compatible

Achieves XXX mAP on COCO

model { faster_rcnn { number_of_stages: 3 num_classes: 3 image_resizer { fixed_shape_resizer { height: 1024 width: 1024

pad_to_max_dimension: true

  }
}
feature_extractor {
  type: 'faster_rcnn_inception_resnet_v2_keras'
}
first_stage_anchor_generator {
  grid_anchor_generator {
    scales: [0.25, 0.5, 1.0, 2.0]
    aspect_ratios: [0.5, 1.0, 2.0]
    height_stride: 16
    width_stride: 16
  }
}
first_stage_box_predictor_conv_hyperparams {
  op: CONV
  regularizer {
    l2_regularizer {
      weight: 0.0
    }
  }
  initializer {
    truncated_normal_initializer {
      stddev: 0.01
    }
  }
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
  mask_rcnn_box_predictor {
    use_dropout: false
    dropout_keep_probability: 1.0
    fc_hyperparams {
      op: FC
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        variance_scaling_initializer {
          factor: 1.0
          uniform: true
          mode: FAN_AVG
        }
      }
    }
    mask_height: 33
    mask_width: 33
    mask_prediction_conv_depth: 0
    mask_prediction_num_conv_layers: 4
    conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    predict_instance_masks: true
  }
}
second_stage_post_processing {
  batch_non_max_suppression {
    score_threshold: 0.0
    iou_threshold: 0.6
    max_detections_per_class: 100
    max_total_detections: 100
  }
  score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
second_stage_mask_prediction_loss_weight: 4.0
resize_masks: false

} }

train_config: { batch_size: 5 num_steps: 200000 optimizer { momentum_optimizer: { learning_rate: { cosine_decay_learning_rate { learning_rate_base: 0.008 total_steps: 200000 warmup_learning_rate: 0.0 warmup_steps: 5000 } } momentum_optimizer_value: 0.9 } use_moving_average: false } gradient_clipping_by_norm: 10.0 fine_tune_checkpoint: "path/to/mymodel/checkpoint" data_augmentation_options { random_horizontal_flip { } } }

train_input_reader: { tf_record_input_reader { input_path: "annotations/train.record" } load_instance_masks: true mask_type: PNG_MASKS }

eval_config: { metrics_set: "coco_detection_metrics" metrics_set: "coco_mask_metrics" eval_instance_masks: true use_moving_averages: false batch_size: 1 include_metrics_per_category: true }

eval_input_reader: { label_map_path: "annotations/label_map.pbtxt" shuffle: false num_epochs: 1 tf_record_input_reader { input_path: "annotations/test.record" } load_instance_masks: true mask_type: PNG_MASKS }`

but im getting this error raceback (most recent call last): File "model_main_tf2.py", line 113, in <module> tf.compat.v1.app.run() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "model_main_tf2.py", line 110, in main record_summaries=FLAGS.record_summaries) File "/usr/local/lib/python3.6/dist-packages/object_detection-0.1-py3.6.egg/object_detection/model_lib_v2.py", line 566, in train_loop unpad_groundtruth_tensors) File "/usr/local/lib/python3.6/dist-packages/object_detection-0.1-py3.6.egg/object_detection/model_lib_v2.py", line 344, in load_fine_tune_checkpoint features, labels = iter(input_dataset).next() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 645, in next return self.__next__() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 649, in __next__ return self.get_next() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 694, in get_next self._iterators[i].get_next_as_list_static_shapes(new_name)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1474, in get_next_as_list_static_shapes return self._iterator.get_next() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 581, in get_next result.append(self._device_iterators[i].get_next()) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 825, in get_next return self._next_internal() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 764, in _next_internal return structure.from_compatible_tensor_list(self._element_spec, ret) File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__ self.gen.throw(type, value, traceback) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 2105, in execution_mode executor_new.wait() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/executor.py", line 67, in wait pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle) tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0] = 0 is not in [0, 0) [[{{node GatherV2_7}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]]

kakkaranupam commented 3 years ago

Hi, Is there any update on this issue? i am facing the same while trying to train a Mask RCNN Inception Resnet v2 on custom dataset.

VeeranjaneyuluToka commented 3 years ago

hi, Is there any update on this issue? i am also facing exactly same issue.

maximdorogov commented 2 years ago

Hello Im having the same problem, has anyone fixed this?

aino-gautam commented 1 year ago

I am facing the same exact error

python3 model_main_tf2.py --model_dir=models/ark_mrcnn_iresnet_v2 --pipeline_config_path=models/ark_mrcnn_iresnet_v2/pipeline.config 2023-09-12 20:29:03.291482: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-12 20:29:03.950341: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-09-12 20:29:05.294629: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.313257: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.313438: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.396335: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.396513: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.396630: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.396707: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:227] Using CUDA malloc Async allocator for GPU: 0 2023-09-12 20:29:05.396776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2187 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5 2023-09-12 20:29:05.398224: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.398369: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.398482: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.398876: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.398995: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.399104: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.399241: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.399352: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-09-12 20:29:05.399428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2187 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5 INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',) I0912 20:29:05.400160 139957487306560 mirrored_strategy.py:419] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',) INFO:tensorflow:Maybe overwriting train_steps: None I0912 20:29:05.420384 139957487306560 config_util.py:552] Maybe overwriting train_steps: None INFO:tensorflow:Maybe overwriting use_bfloat16: False I0912 20:29:05.420483 139957487306560 config_util.py:552] Maybe overwriting use_bfloat16: False WARNING:tensorflow:From /home/dga/miniconda3/envs/tf-py/lib/python3.10/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function W0912 20:29:05.447323 139957487306560 deprecation.py:364] From /home/dga/miniconda3/envs/tf-py/lib/python3.10/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function INFO:tensorflow:Reading unweighted datasets: ['annotations/train.record'] I0912 20:29:05.452799 139957487306560 dataset_builder.py:162] Reading unweighted datasets: ['annotations/train.record'] INFO:tensorflow:Reading record datasets for input file: ['annotations/train.record'] I0912 20:29:05.452907 139957487306560 dataset_builder.py:79] Reading record datasets for input file: ['annotations/train.record'] INFO:tensorflow:Number of filenames to read: 1 I0912 20:29:05.452954 139957487306560 dataset_builder.py:80] Number of filenames to read: 1 WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. W0912 20:29:05.452991 139957487306560 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards. WARNING:tensorflow:From /home/dga/miniconda3/envs/tf-py/lib/python3.10/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)instead. If sloppy execution is desired, usetf.data.Options.deterministic. W0912 20:29:05.457518 139957487306560 deprecation.py:364] From /home/dga/miniconda3/envs/tf-py/lib/python3.10/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)instead. If sloppy execution is desired, usetf.data.Options.deterministic. WARNING:tensorflow:From /home/dga/miniconda3/envs/tf-py/lib/python3.10/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map() W0912 20:29:05.470539 139957487306560 deprecation.py:364] From /home/dga/miniconda3/envs/tf-py/lib/python3.10/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.map() WARNING:tensorflow:From /home/dga/.local/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py:459: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead W0912 20:29:06.681802 139957487306560 deprecation.py:569] From /home/dga/.local/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py:459: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead WARNING:tensorflow:From /home/dga/.local/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create atf.sparse.SparseTensorand usetf.sparse.to_denseinstead. W0912 20:29:10.030629 139957487306560 deprecation.py:364] From /home/dga/.local/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create atf.sparse.SparseTensorand usetf.sparse.to_denseinstead. WARNING:tensorflow:From /home/dga/.local/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.castinstead. W0912 20:29:11.330377 139957487306560 deprecation.py:364] From /home/dga/.local/lib/python3.10/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.castinstead. Traceback (most recent call last): File "/home/dga/ml-git/conda-tf/TensorFlow/workspace/ark_deba_iteration_2/model_main_tf2.py", line 123, in <module> tf.compat.v1.app.run() File "/home/dga/.local/lib/python3.10/site-packages/tensorflow/python/platform/app.py", line 36, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/dga/.local/lib/python3.10/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/home/dga/.local/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/home/dga/ml-git/conda-tf/TensorFlow/workspace/ark_deba_iteration_2/model_main_tf2.py", line 114, in main model_lib_v2.train_loop( File "/home/dga/miniconda3/envs/tf-py/lib/python3.10/site-packages/object_detection/model_lib_v2.py", line 605, in train_loop load_fine_tune_checkpoint( File "/home/dga/miniconda3/envs/tf-py/lib/python3.10/site-packages/object_detection/model_lib_v2.py", line 401, in load_fine_tune_checkpoint _ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors) File "/home/dga/miniconda3/envs/tf-py/lib/python3.10/site-packages/object_detection/model_lib_v2.py", line 161, in _ensure_model_is_built features, labels = iter(input_dataset).next() File "/home/dga/.local/lib/python3.10/site-packages/tensorflow/python/distribute/input_lib.py", line 260, in next return self.__next__() File "/home/dga/.local/lib/python3.10/site-packages/tensorflow/python/distribute/input_lib.py", line 264, in __next__ return self.get_next() File "/home/dga/.local/lib/python3.10/site-packages/tensorflow/python/distribute/input_lib.py", line 325, in get_next return self._get_next_no_partial_batch_handling(name) File "/home/dga/.local/lib/python3.10/site-packages/tensorflow/python/distribute/input_lib.py", line 361, in _get_next_no_partial_batch_handling replicas.extend(self._iterators[i].get_next_as_list(new_name)) File "/home/dga/.local/lib/python3.10/site-packages/tensorflow/python/distribute/input_lib.py", line 1427, in get_next_as_list return self._format_data_list_with_options(self._iterator.get_next()) File "/home/dga/.local/lib/python3.10/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 553, in get_next result.append(self._device_iterators[i].get_next()) File "/home/dga/.local/lib/python3.10/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 867, in get_next return self._next_internal() File "/home/dga/.local/lib/python3.10/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 777, in _next_internal ret = gen_dataset_ops.iterator_get_next( File "/home/dga/.local/lib/python3.10/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3028, in iterator_get_next _ops.raise_from_not_ok_status(e, name) File "/home/dga/.local/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 6656, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_18_device_/job:localhost/replica:0/task:0/device:GPU:0}} indices[0] = 0 is not in [0, 0) [[{{node GatherV2_7}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [Op:IteratorGetNext] name:

I have tried on Mac M2 and Ubuntu 22.04 with GTX 1650 GPU . Both places I am getting the same error whereas I can train other models within the same environment. Please help.

rg4352 commented 1 year ago

Any updates on solution or how anyone in here solved it? I am facing same issue.