tensorflow / models

Models and examples built with TensorFlow
Other
76.95k stars 45.79k forks source link

[TF2 object detection] Tensorboard always shows 100 validation boxes for groundtruth images #9471

Open ilaripih opened 3 years ago

ilaripih commented 3 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection/model_main_tf2.py

2. Describe the bug

I trained an Efficientdet D2 object detection model using my own TFRecord dataset with 12 classes. When I ran the validation loop (model_main_tf2.py with the checkpoint_dir parameter) the ground truth images in Tensorboard all had 100 boxes visualized even though only few were provided by the validation dataset.

All of the extra boxes have the class id 1 (with text "additional-panels" in my dataset). I confirmed this by checking the values of groundtruth_boxes and groundtruth_classes here: https://github.com/tensorflow/models/blob/master/research/object_detection/model_lib_v2.py#L709

groundtruth_boxes for the image shown looks like this:

[[[0.09878296 0.17683971 0.14969572 0.2070736 ]
  [0.10324543 0.6021677  0.15415822 0.62977755]
  [0.15464549 0.6048238  0.19507532 0.62797683]
  [0.19444445 0.6045089  0.22057322 0.62604165]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]
  [0.         0.         0.         0.        ]]]

...and groundtruth_classes looks like this:

[[6 6 5 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]

The labels for the real ground truth boxes are correct. It looks like the culprit is this line here where label_id_offset is added to the class ids: https://github.com/tensorflow/models/blob/master/research/object_detection/model_lib_v2.py#L707 Or maybe the bounding box visualization function should ignore these zero-area "padding" boxes.

3. Steps to reproduce

I can't provide the TFRecord files I'm using but this should be reproducible with any dataset where the labels start with id 1 as instructed here: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md#label-maps

  1. Train one checkpoint with the Efficientdet D2 config provided in the "Additional context" section.
  2. Run the evaluation: model_main_tf2.py --checkpoint_dir=<checkpoint_dir> --pipeline_config_path=<pipeline_config_path> --model_dir=<model_dir>
  3. Run Tensorboard with --logdir=
  4. Look at the IMAGES tab in Tensorboard and see a bunch of extra zero-area ground truth bounding boxes in the bottom left corner (see attached image).

4. Expected behavior

Only the ground truth boxes in the validation set should be visualized.

5. Additional context

Model config:

# SSD with EfficientNet-b2 + BiFPN feature extractor,
# shared box predictor and focal loss (a.k.a EfficientDet-d2).
# See EfficientDet, Tan et al, https://arxiv.org/abs/1911.09070
# See Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from an EfficientNet-b2 checkpoint.
#
# Train on TPU-8

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    num_classes: 12
    add_background_class: false
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: [1.0, 2.0, 0.5]
        scales_per_octave: 3
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 768
        width: 768
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        depth: 112
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          force_use_bias: true
          activation: SWISH
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            scale: true
            decay: 0.99
            epsilon: 0.001
          }
        }
        num_layers_before_predictor: 3
        kernel_size: 3
        use_depthwise: true
      }
    }
    feature_extractor {
      type: 'ssd_efficientnet-b2_bifpn_keras'
      bifpn {
        min_level: 3
        max_level: 7
        num_iterations: 5
        num_filters: 112
      }
      conv_hyperparams {
        force_use_bias: true
        activation: SWISH
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          scale: true,
          decay: 0.99,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.25
          gamma: 1.5
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.5
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  fine_tune_checkpoint: "/workspace/tf_base_models/efficientnet_b2/ckpt-0"
  fine_tune_checkpoint_version: V2
  fine_tune_checkpoint_type: "classification"
  batch_size: 18
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 8
  use_bfloat16: true
  num_steps: 300000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_adjust_brightness {
      max_delta: 0.1
    }
  }
  data_augmentation_options {
    random_adjust_hue {
    }
  }
  data_augmentation_options {
    random_adjust_saturation {
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: 8e-2
          total_steps: 300000
          warmup_learning_rate: .001
          warmup_steps: 2500
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
}

train_input_reader: {
  label_map_path: "/workspace/tf_datasets/traffic_signs_detect_labels.pbtxt"
  tf_record_input_reader {
    input_path: "/workspace/tf_datasets/traffic_signs_detect_train.record"
  }
}

eval_config: {
  metrics_set: "oid_V2_detection_metrics"
  use_moving_averages: false
  num_visualizations: 10
}

eval_input_reader: {
  label_map_path: "/workspace/tf_datasets/traffic_signs_detect_labels.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "/workspace/tf_datasets/traffic_signs_detect_validation.record"
  }
}

Tensorboard validation ground truth image: tensorboard_object_detection_val

6. System information

ilaripih commented 3 years ago

More info: Looks like it's not just the visualization of the boxes that is affected. I also get a very poor validation score for the class with label id 1:

OpenImagesV2_PerformanceByCategory/AP@0.5IOU/additional-panels: 0.003221

This is not a particularly difficult class to learn, and for all the other classes I get a good validation AP@0.5IOU score. The model's actual performance with the first label is decent, based on a visual inspection of the boxes it detects.

sebderhy commented 3 years ago

Hi, I am having a similar issue. Does anyone know the reason for this issue, and how to solve it? Thanks

Geoyi commented 3 years ago

Report this back, this definitely not only impact validation, and I believe this bug has also contaminated the way the training dataset was read as well. We just wasted two weeks to train like SSD Mobilenet, SSD Resnet50, and then CentralNet Resnet101, the ground truth data was in good condition, all models performed poorly overall classes. Particularly, the noise (additional 100 boxes on the bottom left) brought in from the class id 1 basically stop model learning anything useful from train and validation data.

I only discovered the problem comes from the bug on how tfrecords are read in, currently investigating util script under the OD API but have not found any solution yet, if anyone has a solution, I'd really appreciate it.

Screen Shot 2020-11-16 at 8 48 54 AM Screen Shot 2020-11-16 at 8 46 06 AM
Geoyi commented 3 years ago

I should have provided more context that the same training dataset went to train an SSD Resnet102 with TensorFlow1.15, and we received F1 and mAP scores above 0.6.

We then switched to use TF2.2 with the faith of catching up with the TF community. We frozed the Object Detection API codebase before Oct 28th. when we saw the bug present in the last comment I left ☝️ . In the Dockerfile, we did:

# Downloading the TensorFlow Models
RUN git clone --progress https://github.com/tensorflow/models.git /tensorflow/models
# Froze the codebase before Oct 28, 2020, https://github.com/tensorflow/models/tree/24e41ffe97c601e52b35682170e7abceed0eae1a
RUN cd /tensorflow/models && git fetch && git checkout 24e41ff

SSD Mobilenet, SSD Resnet50, and then CentralNet Resnet101 models were trained, and all models' mAP, precision, and recall scores are lower than 0.005 with the exact same training dataset that we used to get > 0.6 mAP score.

From the above comment that I left, I think TF2.2 TFRecords reader added 100 additional bounding boxes of class id 1 to each image in the train, validation and test dataset.

Unless the bug has been fixed after Oct 28, 2020, I think it still exists and will mess up all the model training, @tombstone @kulzc @jch1 any thoughts and guidance over this?

germano2239 commented 3 years ago

I'm having the exact same issue; It seems to me that the bug has been just introduced by the last commit made on model_lib_v2.py, if I revert just this file to the previous version it seems to be woking fine, so I'd suggest to try it as a temporary fix

Geoyi commented 3 years ago

@germano2239, can you provide the commit you refer to that worked for you. Our codebase was from before Oct 29th, so what you were saying the script that works is from Nov 2nd (The log of the script commits )?

mocialov commented 3 years ago

@Geoyi this commit seems to NOT have this issue: git reset --hard b55044d89751b44e102c97b992cb25cccdbd7ba9 && git clean -f

germano2239 commented 3 years ago

@germano2239, can you provide the commit you refer to that worked for you. Our codebase was from before Oct 29th, so what you were saying the script that works is from Nov 2nd (The log of the script commits )?

Yes this is the one that also @mocialov is mentioning, I just reverted the one file but I think it's the same

Geoyi commented 3 years ago

Guys, I reverted the script back to the one @germano2239 AND @mocialov mentioned above as follows, unless I did something wrong.

# Downloading the TensorFlow Models
RUN git clone --progress https://github.com/tensorflow/models.git /tensorflow/models
# Froze the codebase before Nov. 2, 2020. https://github.com/cocodataset/cocoapi/commit/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9
RUN cd /tensorflow/models && git fetch && git checkout 33cca6c

It did not work, and still, get the same issue. The code is overwhelmed to debug, and I think I will just switch back to Tensorflow 1.15 and code base for models in TF1.

Screen Shot 2020-11-18 at 9 20 07 AM
germano2239 commented 3 years ago

It could depend on the specific architecture used, I'm using "faster_rcnn_inception_resnet_v2_keras"; I checked twice, for me that commit does make the difference everything else being unchanged

Geoyi commented 3 years ago

Do you mind posting some of your logs from model training, e.g. precision, recall, mAP or loss while training, @germano2239? I've used SSD Mobilenet, SSD Resnet50, and then CentralNet Resnet101 and they all behaved the same as the screenshots I shared above. I believe the bug exists in the TFRecord reading process in TF2.x and it's not pre-train model specific.

germano2239 commented 3 years ago

also @geoyi your real ground truth boxes are all messed up, I didn't have this issue, also in the OP example image the road signs are ok

image

Geoyi commented 3 years ago

also @Geoyi your real ground truth boxes are all messed up

Yeah, I really don't understand that part. I've used the same dataset to train SSD resnet101 under TF1.15, and they all performed reasonably two months ago. I might need to dig into my training dataset a little bit.

Hackerman28 commented 3 years ago

Hi. I am having similar issue when training the SSD Mobilenet v2. I saw the ground truth images in the tensorboard and I thought it was a tfrecord problem. So I created a new tfrecord for my dataset. Still the issue persisted so I thought it was a tensorboard bug. My model's mAP were very low even after training for 50000 iterations. I thought it was suitable since SSD Mobilenet v2 already has only low mAP in COCO dataset. So I attributed the lower mAP scores to the Mobilenet backbone and thought it would improve if I used a heavier backbone. But after seeing this issue, I'm worried that my entire training was wasted due to this bug. I trained 2 variants (pretrained and scratch) of both 320, 640 SSD Mobilenet v2 models. Can u fix this bug soon so that I can train the models correctly this time?

mocialov commented 3 years ago

Guys, I reverted the script back to the one @germano2239 AND @mocialov mentioned above as follows, unless I did something wrong.

# Downloading the TensorFlow Models
RUN git clone --progress https://github.com/tensorflow/models.git /tensorflow/models
# Froze the codebase before Nov. 2, 2020. https://github.com/cocodataset/cocoapi/commit/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9
RUN cd /tensorflow/models && git fetch && git checkout 33cca6c

It did not work, and still, get the same issue. The code is overwhelmed to debug, and I think I will just switch back to Tensorflow 1.15 and code base for models in TF1.

Screen Shot 2020-11-18 at 9 20 07 AM

If I just use THAT specific commit, I would still get bunch of bounding boxes in my ground truth. However, I am doing it as follows now (I know it is not great, but it eliminates the problem):

git clone  https://github.com/tensorflow/models

cd models/research

protoc object_detection/protos/*.proto --python_out=.
pip install .

git reset --hard b55044d89751b44e102c97b992cb25cccdbd7ba9 && git clean -f

pip install .

export PYTHONPATH=$PYTHONPATH:.:./slim

import os
os.environ['PYTHONPATH'] += ':.:./slim'

Maybe that would at least solve the problem with extra bounding boxes for you in the ground truth. However, check your actual bounding boxes, becuase they don't look right at all

Geoyi commented 3 years ago

@germano2239 @mocialov, thanks, just discovered that I switched x, and y when I read and write tf-example, I did [ymin, xmin, ymax, xmax] correct one should be [xmin, ymin, xmax, ymax]. I am going to try the solution you provide @mocialov, and will report back how the new model training goes later today :).

ngaloppo commented 3 years ago

@germano2239 @Geoyi @mocialov I have the same issue. I have executed an eval run with commit b55044d89, the issue remains.

ngaloppo commented 3 years ago

I can confirm that the same issue exists at b55044d by just evaluating on COCO val2017, with vanilla EfficientDet-d0 config.

janmaltel commented 3 years ago

I've ran into the same issue as OP (using tf2.2.0)!

@mocialov if I follow your instructions, and then run a training job using object_detection/model_main_tf2.py, I get the following error:

ImportError: cannot import name 'string_int_label_map_pb2' from 'object_detection.protos' (/Users/.../git_repos/odo/models/research/object_detection/protos/__init__.py)

And indeed during the git reset it is Removing object_detection/protos/string_int_label_map_pb2.py (it removes many object/detection/protos/*_pb2.py files).

Does anyone know whether this only affects eval or could it also affect / mess up the training procedure?

Geoyi commented 3 years ago

I've been using TF1.15 for many different projects with multiple imagery sources, and I've never faced data reading issues like TF2.x we've seen (in my case I am using Tensorflow 2.2 with the object detection API). The training dataset did not give any meaningful model result at all, and I suspect the bug has been introducing to the training data not only eval (BUT I MAY BE WRONG THOUGH). My objects are too small to detect, and I can't afford that the bug introduces additional noise to the model training.

I moved back to use Tensorflow 1.15 with Faster-RCNN and MobileNet, both worked as I expected.

Screen Shot 2020-11-23 at 5 45 45 PM

I think I will stick with the current workflow with TF1.15 until the bug in TF2 is fixed.

mocialov commented 3 years ago

I've ran into the same issue as OP (using tf2.2.0)!

@mocialov if I follow your instructions, and then run a training job using object_detection/model_main_tf2.py, I get the following error:

ImportError: cannot import name 'string_int_label_map_pb2' from 'object_detection.protos' (/Users/.../git_repos/odo/models/research/object_detection/protos/__init__.py)

And indeed during the git reset it is Removing object_detection/protos/string_int_label_map_pb2.py (it removes many object/detection/protos/*_pb2.py files).

Does anyone know whether this only affects eval or could it also affect / mess up the training procedure?

You can just revert one file to the previous commit

rbavery commented 3 years ago

Has this been fixed?

ecrows commented 2 years ago

I spent the night tearing my hair out debugging the eval util scripts before I realized what was happening. Also very interested in what the impact has been on our training in past weeks.

ilaripih commented 2 years ago

I spent the night tearing my hair out debugging the eval util scripts before I realized what was happening. Also very interested in what the impact has been on our training in past weeks.

At least for us there seemed to be no impact on training, so this is strictly a validation bug. But still a very annoying bug.