CBCBCBCBCBCBCBCB commented 6 years ago

System information What is the top-level directory of the model you are using: /home/user/Work Have I written custom code: No OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 TensorFlow installed from (source or binary): source TensorFlow version (use command below): 1.10.1 Bazel version (if compiling from source): I don't use it CUDA/cuDNN version: CUDA 9.0 / cuDNN: 7.1 GPU model and memory: Geforce GTX 1070 Exact command to reproduce: No ############################################################ [Question]

I just try to train "faster_rcnn_nas_coco" and got error message.(I write that after configuration) I think, the reason of that error message is my GPU memory. I had Geforce GTX 1070 8GB. Do I need more gpu memory?

Thanks to read this!

Here is my "faster_rcnn_nas_coco.config" code

model { faster_rcnn { num_classes: 1 image_resizer {

TODO(shlens): Only fixed_shape_resizer is currently supported for NASNet

  # featurization. The reason for this is that nasnet.py only supports
  # inputs with fully known shapes. We need to update nasnet.py to handle
  # shapes not known at compile time.
  fixed_shape_resizer {
    height: 1200
    width: 1200
  }
}
feature_extractor {
  type: 'faster_rcnn_nas'
}
first_stage_anchor_generator {
  grid_anchor_generator {
    scales: [0.25, 0.5, 1.0, 2.0]
    aspect_ratios: [0.5, 1.0, 2.0]
    height_stride: 16
    width_stride: 16
  }
}
first_stage_box_predictor_conv_hyperparams {
  op: CONV
  regularizer {
    l2_regularizer {
      weight: 0.0
    }
  }
  initializer {
    truncated_normal_initializer {
      stddev: 0.01
    }
  }
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 500
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
  mask_rcnn_box_predictor {
    use_dropout: false
    dropout_keep_probability: 1.0
    fc_hyperparams {
      op: FC
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        variance_scaling_initializer {
          factor: 1.0
          uniform: true
          mode: FAN_AVG
        }
      }
    }
  }
}
second_stage_post_processing {
  batch_non_max_suppression {
    score_threshold: 0.0
    iou_threshold: 0.6
    max_detections_per_class: 100
    max_total_detections: 100
  }
  score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0

} }

train_config: { batch_size: 1 optimizer { momentum_optimizer: { learning_rate: { manual_step_learning_rate { initial_learning_rate: 0.0003 schedule { step: 900000 learning_rate: .00003 } schedule { step: 1200000 learning_rate: .000003 } } } momentum_optimizer_value: 0.9 } use_moving_average: false } gradient_clipping_by_norm: 10.0 fine_tune_checkpoint: "faster_rcnn_nas_coco_2018_01_28/model.ckpt" from_detection_checkpoint: true

Note: The below line limits the training process to 200K steps, which we

empirically found to be sufficient enough to train the pets dataset. This

effectively bypasses the learning rate schedule (the learning rate will

never decay). Remove the below line to train indefinitely.

num_steps: 200000 data_augmentation_options { random_horizontal_flip { } } }

train_input_reader: { tf_record_input_reader { input_path: "data/train.record" } label_map_path: "data/object-detection.pbtxt" }

eval_config: { num_examples: 8000

Note: The below line limits the evaluation process to 10 evaluations.

Remove the below line to evaluate indefinitely.

max_evals: 10}

eval_input_reader: { tf_record_input_reader { input_path: "data/test.record" } label_map_path: "data/object-detection.pbtxt" shuffle: false num_readers: 1 } ############################################################ [error log] And my code and error message, It was too long, so I cut front part of message, I write some part of that on comment

cb@cb-B150M-DS3H:~/Work/models/research/object_detection$ python3 train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/faster_rcnn_nas_coco.config

2018-10-14 15:22:58.302414: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats: Limit: 7474652775 InUse: 7437627904 MaxInUse: 7437645056 NumAllocs: 10356 MaxAllocSize: 4076716032

2018-10-14 15:22:58.302586: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ***x**** 2018-10-14 15:22:58.302609: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at conv_ops.cc:398 : Resource exhausted: OOM when allocating tensor with shape[64,672,9,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[64,2016,9,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: SecondStageFeatureExtractor/cell_12/AvgPool_1 = AvgPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:GPU:0"]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Node: gradients/FirstStageFeatureExtractor/cell_8/comb_iter_1/left/bn_sep_5x5_1/FusedBatchNorm_grad/FusedBatchNormGrad/_12031 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_27488...chNormGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]