hhsinhan commented 3 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[x] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[ ] I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/official/...

2. Describe the bug

I have test my environment works fine in Mobilenet Centernet FastRCNN Pure detection training and all works fine. This proved my environment including coco dataset which created under the model_tool shell script are all fine.

Here is my error message after I try to train any centernet keypoint models :

INFO:tensorflow:batch_all_reduce: 256 all-reduces with algorithm = nccl, num_packs = 1
I1126 19:09:16.592537 140625801361152 cross_device_ops.py:695] batch_all_reduce: 256 all-reduces with algorithm = nccl, num_packs = 1
2020-11-26 19:09:24.650400: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:906] Skipping loop optimization for Merge node with control input: replica_1/Loss/total_loss/write_summary/summary_cond/branch_executed/_527
Traceback (most recent call last):
  File  PATH/models/research/object_detection/model_main_tf2.py , line 122, in <module>
    tf.compat.v1.app.run()
  File  /home/tom/anaconda3/envs/tf2.2_gpu/lib/python3.8/site-packages/tensorflow/python/platform/app.py , line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File  /home/tom/anaconda3/envs/tf2.2_gpu/lib/python3.8/site-packages/absl/app.py , line 300, in run
    _run_main(main, args)
  File  /home/tom/anaconda3/envs/tf2.2_gpu/lib/python3.8/site-packages/absl/app.py , line 251, in _run_main
    sys.exit(main(argv))
  File  PATH/models/research/object_detection/model_main_tf2.py , line 113, in main
    model_lib_v2.train_loop(
  File  PATH/models/research/object_detection/model_lib_v2.py , line 636, in train_loop
    loss = _dist_train_step(train_input_iter)
  File /home/tom/anaconda3/envs/tf2.2_gpu/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py, line 580, in __call__
    result = self._call(*args, **kwds)
  File /home/tom/anaconda3/envs/tf2.2_gpu/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py, line 644, in _call
    return self._stateless_fn(*args, **kwds)
  File /home/tom/anaconda3/envs/tf2.2_gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py, line 2420, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File  /home/tom/anaconda3/envs/tf2.2_gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py , line 1661, in _filtered_call
    return self._call_flat(
  File  /home/tom/anaconda3/envs/tf2.2_gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py , line 1745, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File  /home/tom/anaconda3/envs/tf2.2_gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py , line 593, in call
    outputs = execute.execute(
  File  /home/tom/anaconda3/envs/tf2.2_gpu/lib/python3.8/site-packages/tensorflow/python/eager/execute.py , line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  indices[0] = 0 is not in [0, 0)
     [[node GatherV2_8 (defined at  PATH/models/research/object_detection/utils/target_assigner_utils.py:286) ]]
     [[MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]]
     [[IteratorGetNext_1]]
     [[ToAbsoluteCoordinates_20/Assert/AssertGuard/else/_181/Assert/data_0/_420]]
  (1) Invalid argument:  indices[0] = 0 is not in [0, 0)
     [[node GatherV2_8 (defined at  PATH/models/research/object_detection/utils/target_assigner_utils.py:286) ]]
     [[MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]]
     [[IteratorGetNext_1]]
0 successful operations.
1 derived errors ignored. [Op:__inference__dist_train_step_69913]

Errors may have originated from an input operation.
Input Source operations connected to node GatherV2_8:
 mul_131 (PATH/models/research/object_detection/utils/target_assigner_utils.py:283)

Input Source operations connected to node GatherV2_8:
 mul_131 (defined at PATH/models/research/object_detection/utils/target_assigner_utils.py:283)
Function call stack:
_dist_train_step -> _dist_train_step

3. Steps to reproduce

Steps to reproduce the behavior.

4. Expected behavior

A clear and concise description of what you expected to happen.

5. Additional context

Here is m config :

# CenterNet meta-architecture from the "Objects as Points" [1] paper
# with the ResNet-v1-50 backbone. The ResNet backbone has a few differences
# as compared to the one mentioned in the paper, hence the performance is
# slightly worse. This config is TPU comptatible.
# [1]: https://arxiv.org/abs/1904.07850
#

model {
  center_net {
    num_classes: 90
    feature_extractor {
      type: "resnet_v1_50_fpn"
    }
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 512
        max_dimension: 512
        pad_to_max_dimension: true
      }
    }
    object_detection_task {
      task_loss_weight: 1.0
      offset_loss_weight: 1.0
      scale_loss_weight: 0.1
      localization_loss {
        l1_localization_loss {
        }
      }
    }
    object_center_params {
      object_center_loss_weight: 1.0
      min_box_overlap_iou: 0.7
      max_box_predictions: 100
      classification_loss {
        penalty_reduced_logistic_focal_loss {
          alpha: 2.0
          beta: 4.0
        }
      }
    }
    keypoint_label_map_path: "path/models/research/object_detection/data/face_person_with_keypoints_label_map.pbtxt"
    keypoint_estimation_task {
      task_name: "human_pose"
      task_loss_weight: 1.0
      loss {
        localization_loss {
          l1_localization_loss {
          }
        }
        classification_loss {
          penalty_reduced_logistic_focal_loss {
            alpha: 2.0
            beta: 4.0
          }
        }
      }
      keypoint_class_name: "Person"
      keypoint_label_to_std {
        key: "left_ankle"
        value: 0.89
      }
      keypoint_label_to_std {
        key: "left_ear"
        value: 0.35
      }
      keypoint_label_to_std {
        key: "left_elbow"
        value: 0.72
      }
      keypoint_label_to_std {
        key: "left_eye"
        value: 0.25
      }
      keypoint_label_to_std {
        key: "left_hip"
        value: 1.07
      }
      keypoint_label_to_std {
        key: "left_knee"
        value: 0.89
      }
      keypoint_label_to_std {
        key: "left_shoulder"
        value: 0.79
      }
      keypoint_label_to_std {
        key: "left_wrist"
        value: 0.62
      }
      keypoint_label_to_std {
        key: "nose"
        value: 0.26
      }
      keypoint_label_to_std {
        key: "right_ankle"
        value: 0.89
      }
      keypoint_label_to_std {
        key: "right_ear"
        value: 0.35
      }
      keypoint_label_to_std {
        key: "right_elbow"
        value: 0.72
      }
      keypoint_label_to_std {
        key: "right_eye"
        value: 0.25
      }
      keypoint_label_to_std {
        key: "right_hip"
        value: 1.07
      }
      keypoint_label_to_std {
        key: "right_knee"
        value: 0.89
      }
      keypoint_label_to_std {
        key: "right_shoulder"
        value: 0.79
      }
      keypoint_label_to_std {
        key: "right_wrist"
        value: 0.62
      }
      keypoint_regression_loss_weight: 0.1
      keypoint_heatmap_loss_weight: 1.0
      keypoint_offset_loss_weight: 1.0
      offset_peak_radius: 3
      per_keypoint_offset: true
    }
  }
}

train_config: {

  batch_size: 6
  num_steps: 250000

  data_augmentation_options {
    random_horizontal_flip {
      keypoint_flip_permutation: 0
      keypoint_flip_permutation: 2
      keypoint_flip_permutation: 1
      keypoint_flip_permutation: 4
      keypoint_flip_permutation: 3
      keypoint_flip_permutation: 6
      keypoint_flip_permutation: 5
      keypoint_flip_permutation: 8
      keypoint_flip_permutation: 7
      keypoint_flip_permutation: 10
      keypoint_flip_permutation: 9
      keypoint_flip_permutation: 12
      keypoint_flip_permutation: 11
      keypoint_flip_permutation: 14
      keypoint_flip_permutation: 13
      keypoint_flip_permutation: 16
      keypoint_flip_permutation: 15
    }
  }

  data_augmentation_options {
    random_crop_image {
      min_aspect_ratio: 0.5
      max_aspect_ratio: 1.7
      random_coef: 0.25
    }
  }

  data_augmentation_options {
    random_adjust_hue {
    }
  }

  data_augmentation_options {
    random_adjust_contrast {
    }
  }

  data_augmentation_options {
    random_adjust_saturation {
    }
  }

  data_augmentation_options {
    random_adjust_brightness {
    }
  }

  data_augmentation_options {
    random_absolute_pad_image {
       max_height_padding: 200
       max_width_padding: 200
       pad_color: [0, 0, 0]
    }
  }

  optimizer {
    adam_optimizer: {
      epsilon: 1e-7  # Match tf.keras.optimizers.Adam's default.
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: 1e-3
          total_steps: 250000
          warmup_learning_rate: 2.5e-4
          warmup_steps: 5000
        }
      }
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false

  #fine_tune_checkpoint_version: V2
  #fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED"
  #fine_tune_checkpoint_type: "classification"
}

train_input_reader: {
  label_map_path: "path/models/research/object_detection/data/mscoco_label_map.pbtxt"
  tf_record_input_reader {
    input_path: "path/models/research/object_detection/dataset_tools/coco/coco_train*"
  }
  num_keypoints: 17
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  num_visualizations: 10
  max_num_boxes_to_visualize: 20
  min_score_threshold: 0.2
  batch_size: 1;
  parameterized_metric {
    coco_keypoint_metrics {
      class_label: "person"
      keypoint_label_to_sigmas {
        key: "nose"
        value: 0.026
      }
      keypoint_label_to_sigmas {
        key: "left_eye"
        value: 0.025
      }
      keypoint_label_to_sigmas {
        key: "right_eye"
        value: 0.025
      }
      keypoint_label_to_sigmas {
        key: "left_ear"
        value: 0.035
      }
      keypoint_label_to_sigmas {
        key: "right_ear"
        value: 0.035
      }
      keypoint_label_to_sigmas {
        key: "left_shoulder"
        value: 0.079
      }
      keypoint_label_to_sigmas {
        key: "right_shoulder"
        value: 0.079
      }
      keypoint_label_to_sigmas {
        key: "left_elbow"
        value: 0.072
      }
      keypoint_label_to_sigmas {
        key: "right_elbow"
        value: 0.072
      }
      keypoint_label_to_sigmas {
        key: "left_wrist"
        value: 0.062
      }
      keypoint_label_to_sigmas {
        key: "right_wrist"
        value: 0.062
      }
      keypoint_label_to_sigmas {
        key: "left_hip"
        value: 0.107
      }
      keypoint_label_to_sigmas {
        key: "right_hip"
        value: 0.107
      }
      keypoint_label_to_sigmas {
        key: "left_knee"
        value: 0.087
      }
      keypoint_label_to_sigmas {
        key: "right_knee"
        value: 0.087
      }
      keypoint_label_to_sigmas {
        key: "left_ankle"
        value: 0.089
      }
      keypoint_label_to_sigmas {
        key: "right_ankle"
        value: 0.089
      }
    }
  }
  # Provide the edges to connect the keypoints. The setting is suitable for
  # COCO's 17 human pose keypoints.
  keypoint_edge {  # nose-left eye
    start: 0
    end: 1
  }
  keypoint_edge {  # nose-right eye
    start: 0
    end: 2
  }
  keypoint_edge {  # left eye-left ear
    start: 1
    end: 3
  }
  keypoint_edge {  # right eye-right ear
    start: 2
    end: 4
  }
  keypoint_edge {  # nose-left shoulder
    start: 0
    end: 5
  }
  keypoint_edge {  # nose-right shoulder
    start: 0
    end: 6
  }
  keypoint_edge {  # left shoulder-left elbow
    start: 5
    end: 7
  }
  keypoint_edge {  # left elbow-left wrist
    start: 7
    end: 9
  }
  keypoint_edge {  # right shoulder-right elbow
    start: 6
    end: 8
  }
  keypoint_edge {  # right elbow-right wrist
    start: 8
    end: 10
  }
  keypoint_edge {  # left shoulder-right shoulder
    start: 5
    end: 6
  }
  keypoint_edge {  # left shoulder-left hip
    start: 5
    end: 11
  }
  keypoint_edge {  # right shoulder-right hip
    start: 6
    end: 12
  }
  keypoint_edge {  # left hip-right hip
    start: 11
    end: 12
  }
  keypoint_edge {  # left hip-left knee
    start: 11
    end: 13
  }
  keypoint_edge {  # left knee-left ankle
    start: 13
    end: 15
  }
  keypoint_edge {  # right hip-right knee
    start: 12
    end: 14
  }
  keypoint_edge {  # right knee-right ankle
    start: 14
    end: 16
  }
}
eval_input_reader: {
  label_map_path: "PATH/models/research/object_detection/data/mscoco_label_map.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "PATH/models/research/object_detection/dataset_tools/coco/coco_val*"
  }
  num_keypoints: 17
}

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
Mobile device name if the issue happens on a mobile device: x
TensorFlow installed from (source or binary): conda
TensorFlow version (use command below): anaconda 2.2
Python version: 3.8
Bazel version (if compiling from source): x
GCC/Compiler version (if compiling from source): x
CUDA/cuDNN version: 10.2/ cudnn765
GPU model and memory: gtx 1660 + gtx 1660 super

2.2.0

hhsinhan commented 3 years ago

Hi, I found my own error and sorry for bothering.

I was careless not checking the script in dataset_tools/download_and_preprocess_mscoco.sh

line 96

python object_detection/dataset_tools/create_coco_tf_record.py \
  --logtostderr \
  --include_masks \
  --train_image_dir="${TRAIN_IMAGE_DIR}" \
  --val_image_dir="${VAL_IMAGE_DIR}" \
  --test_image_dir="${TEST_IMAGE_DIR}" \
  --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
  --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
  --testdev_annotations_file="${TESTDEV_ANNOTATIONS_FILE}" \
  --output_dir="${OUTPUT_DIR}"

I found that default script do not actually include keypoint inside.

for training keypoint tfrecord, here's what i edit

TRAIN_KP_ANNOTATIONS_FILE="${SCRATCH_DIR}/annotations/person_keypoints_train2017.json"
VAL_KP_ANNOTATIONS_FILE="${SCRATCH_DIR}/annotations/person_keypoints_val2017.json"

# Build TFRecords of the image data.
cd "${CURRENT_DIR}"
python3 create_coco_tf_record.py \
  --logtostderr \
  --include_masks \
  --train_image_dir="${TRAIN_IMAGE_DIR}" \
  --val_image_dir="${VAL_IMAGE_DIR}" \
  --test_image_dir="${TEST_IMAGE_DIR}" \
  --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
  --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
  --testdev_annotations_file="${TESTDEV_ANNOTATIONS_FILE}" \
  --train_keypoint_annotations_file="${TRAIN_KP_ANNOTATIONS_FILE}" \
  --val_keypoint_annotations_file="${VAL_KP_ANNOTATIONS_FILE}" \
  --output_dir="${OUTPUT_DIR}"

After I rebuild the tfrecord within keypoints value.

The training process works fine.

INFO:tensorflow:Step 100 per-step time 0.371s loss=12.478
I1204 21:17:35.270161 139674450851584 model_lib_v2.py:648] Step 100 per-step time 0.371s loss=12.478
INFO:tensorflow:Step 200 per-step time 0.383s loss=7.781
I1204 21:18:13.289809 139674450851584 model_lib_v2.py:648] Step 200 per-step time 0.383s loss=7.781
INFO:tensorflow:Step 300 per-step time 0.357s loss=6.903
I1204 21:18:51.141005 139674450851584 model_lib_v2.py:648] Step 300 per-step time 0.357s loss=6.903

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No

wahabrind commented 2 years ago

Hey @hhsinhan can you share the json files that you used for training ? i wanna know the format of keypoints annotation

tensorflow / models

Error message from while training centernet keypoints network #9508