Issue with Transfer Learning Using Resnet50_V1

a428tm commented 4 years ago

New to ML in general and to TF as well. I was following tutorials to learn more about transfer learning. I was able to successfully run using https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync.config

However, when I tried using resnet50, it fails. I couldn't pinpoint what the error was and searched here, but I was not able to find an answer. If any feedback can be shared, I would appreciate it.

Thanks!

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config

2. Describe the bug

Fail to train via transfer learning using Resnet50_v1 (link from Q1 above).

3. Steps to reproduce

Was able to kick off the job using SSD MobileNet V1 with same setting. Only thing that changed was the config file which I am attaching here -

SSD with Resnet 50 v1 FPN feature extractor, shared box predictor and focal loss (a.k.a Retinanet). See Lin et al, https://arxiv.org/abs/1708.02002 Trained on COCO, initialized from Imagenet classification checkpoint

Achieves 35.2 mAP on COCO14 minival dataset. Doubling the number of training steps to 50k gets 36.9 mAP

This config is TPU compatible

model { ssd { inplace_batchnorm_update: true freeze_batchnorm: false num_classes: 2 box_coder { faster_rcnn_box_coder { y_scale: 10.0 x_scale: 10.0 height_scale: 5.0 width_scale: 5.0 } } matcher { argmax_matcher { matched_threshold: 0.5 unmatched_threshold: 0.5 ignore_thresholds: false negatives_lower_than_unmatched: true force_match_for_each_row: true use_matmul_gather: true } } similarity_calculator { iou_similarity { } } encode_background_as_zeros: true anchor_generator { multiscale_anchor_generator { min_level: 3 max_level: 7 anchor_scale: 4.0 aspect_ratios: [1.0, 2.0, 0.5] scales_per_octave: 2 } } image_resizer { fixed_shape_resizer { height: 640 width: 640 } } box_predictor { weight_shared_convolutional_box_predictor { depth: 256 class_prediction_bias_init: -4.6 conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { weight: 0.0004 } } initializer { random_normal_initializer { stddev: 0.01 mean: 0.0 } } batch_norm { scale: true, decay: 0.997, epsilon: 0.001, } } num_layers_before_predictor: 4 kernel_size: 3 } } feature_extractor { type: 'ssd_resnet50_v1_fpn' fpn { min_level: 3 max_level: 7 } min_depth: 16 depth_multiplier: 1.0 conv_hyperparams { activation: RELU_6, regularizer { l2_regularizer { weight: 0.0004 } } initializer { truncated_normal_initializer { stddev: 0.03 mean: 0.0 } } batch_norm { scale: true, decay: 0.997, epsilon: 0.001, } } override_base_feature_extractor_hyperparams: true } loss { classification_loss { weighted_sigmoid_focal { alpha: 0.25 gamma: 2.0 } } localization_loss { weighted_smooth_l1 { } } classification_weight: 1.0 localization_weight: 1.0 } normalize_loss_by_num_matches: true normalize_loc_loss_by_codesize: true post_processing { batch_non_max_suppression { score_threshold: 1e-8 iou_threshold: 0.6 max_detections_per_class: 100 max_total_detections: 100 } score_converter: SIGMOID } } }

train_config: { fine_tune_checkpoint: "gs://z5_lp_bucket-dataset_ssd_resnet50_v1_fpn_shared_box_predictor/data/model.ckpt" batch_size: 128 sync_replicas: true startup_delay_steps: 0 replicas_to_aggregate: 8 num_steps: 3000 data_augmentation_options { random_horizontal_flip { } } data_augmentation_options { random_crop_image { min_object_covered: 0.0 min_aspect_ratio: 0.75 max_aspect_ratio: 3.0 min_area: 0.75 max_area: 1.0 overlap_thresh: 0.0 } } optimizer { momentum_optimizer: { learning_rate: { cosine_decay_learning_rate { learning_rate_base: .04 total_steps: 3000 warmup_learning_rate: .013333 warmup_steps: 2000 } } momentum_optimizer_value: 0.9 } use_moving_average: false } max_number_of_boxes: 100 unpad_groundtruth_tensors: false }

train_input_reader: { tf_record_input_reader { input_path: "gs://z5_lp_bucket-dataset_ssd_resnet50_v1_fpn_shared_box_predictor/data/train.record" } label_map_path: "gs://z5_lp_bucket-dataset_ssd_resnet50_v1_fpn_shared_box_predictor/data/label_map.pbtxt" }

eval_config: { metrics_set: "coco_detection_metrics" use_moving_averages: false num_examples: 3811 }

eval_input_reader: { tf_record_input_reader { input_path: "gs://z5_lp_bucket-dataset_ssd_resnet50_v1_fpn_shared_box_predictor/data/test.record" } label_map_path: "gs://z5_lp_bucket-dataset_ssd_resnet50_v1_fpn_shared_box_predictor/data/label_map.pbtxt" shuffle: false num_readers: 1 }

4. Expected behavior

Training should start without any error

5. Additional context

`{ textPayload: "The replica master 0 exited with a non-zero status of 1. Traceback (most recent call last): [...]

Total hbm usage >= 8.33G: reserved 528.00M program 7.82G arguments unknown size

Output size unknown.

Program hbm requirement 7.82G: reserved 12.0K global 2.58M HLO temp 7.82G (61.6% utilization, 0.1% fragmentation (11.23M))

Largest program allocations in hbm:

1. Size: 800.00M
   Operator: op_type="Relu6" op_name="FeatureExtractor/resnet_v1_50/resnet_v1_50/conv1/Relu6"
   Shape: f32[16,320,320,64]{3,0,2,1}
   Unpadded size: 400.00M
   Extra memory due to padding: 400.00M (2.0x expansion)
   XLA label: %fusion.29 = f32[16,320,320,64]{3,0,2,1} fusion(f32[64]{0} %get-tuple-element.32689, f32[64]{0} %get-tuple-element.32690, f32[64]{0} %get-tuple-element.31582, f32[16,320,320,64]{3,0,2,1} %get-tuple-element.30688, f32[64]{0} %get-tuple-element.30619), kind=...
   Allocation type: HLO temp
   ==========================

2. Size: 400.00M
   Operator: op_type="Relu" op_name="FeatureExtractor/resnet_v1_50/resnet_v1_50/block1/unit_2/bottleneck_v1/Relu"
   Shape: f32[16,160,160,256]{3,0,2,1}
   Unpadded size: 400.00M
   XLA label: %fusion.7 = f32[16,160,160,256]{3,0,2,1} fusion(f32[16,160,160,256]{3,0,2,1} %fusion.6, f32[256]{0} %get-tuple-element.32460, f32[256]{0} %get-tuple-element.32459, f32[256]{0} %fusion.5393, f32[16,160,160,256]{3,0,2,1} %get-tuple-element.30695, f32[256]{0}...
   Allocation type: HLO temp
   ==========================

3. Size: 400.00M
   Operator: op_type="Relu" op_name="FeatureExtractor/resnet_v1_50/resnet_v1_50/block1/unit_1/bottleneck_v1/Relu"
   Shape: f32[16,160,160,256]{3,0,2,1}
   Unpadded size: 400.00M
   XLA label: %fusion.6 = f32[16,160,160,256]{3,0,2,1} fusion(f32[256]{0} %get-tuple-element.32440, f32[256]{0} %get-tuple-element.32439, f32[256]{0} %get-tuple-element.32444,...

(0 successful operations.)

To find out more about why your job exited please check the logs: LINK_TO_VIEW_LOGS insertId: "1amcldwc3uw" resource: {2} timestamp: "2020-05-16T04:04:17.816728130Z" severity: "ERROR" labels: {1} logName: "projects/z5-lp-detect/logs/ml.googleapis.com%2Fjupyter_lp_detector_w_resnet50_v1_1589601566" receiveTimestamp: "2020-05-16T04:04:18.739867303Z" }`

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Training configuration was done via Google Cloud AI Platform + Jupyter Notebook.
Mobile device name if the issue happens on a mobile device: N/A
TensorFlow installed from (source or binary): pip install tensorflow==1.13.1
TensorFlow version (use command below): v1.13.0-rc2-5-g6612da8
Python version: Python 3
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory: Jupyter setting is - 8 vCPUs, 30 GB RAM

a428tm commented 4 years ago

checking in to see if there is any update..? or do i need to provide any additional info?

sayakpaul commented 4 years ago

Hi @a428tm. With reference to what has been specified here (https://github.com/tensorflow/models/tree/master/official/vision/detection#train-retinanet-on-tpu) could you lament on how are you passing the checkpoint files? Are you using pre-trained checkpoints? If yes, which ones?

tensorflow / models