Open ilaripih opened 4 years ago
More info: Looks like it's not just the visualization of the boxes that is affected. I also get a very poor validation score for the class with label id 1:
OpenImagesV2_PerformanceByCategory/AP@0.5IOU/additional-panels: 0.003221
This is not a particularly difficult class to learn, and for all the other classes I get a good validation AP@0.5IOU score. The model's actual performance with the first label is decent, based on a visual inspection of the boxes it detects.
Hi, I am having a similar issue. Does anyone know the reason for this issue, and how to solve it? Thanks
Report this back, this definitely not only impact validation, and I believe this bug has also contaminated the way the training dataset was read as well. We just wasted two weeks to train like SSD Mobilenet, SSD Resnet50, and then CentralNet Resnet101, the ground truth data was in good condition, all models performed poorly overall classes. Particularly, the noise (additional 100 boxes on the bottom left) brought in from the class id 1 basically stop model learning anything useful from train and validation data.
I only discovered the problem comes from the bug on how tfrecords are read in, currently investigating util script under the OD API but have not found any solution yet, if anyone has a solution, I'd really appreciate it.
I should have provided more context that the same training dataset went to train an SSD Resnet102 with TensorFlow1.15, and we received F1 and mAP scores above 0.6.
We then switched to use TF2.2 with the faith of catching up with the TF community. We frozed the Object Detection API codebase before Oct 28th. when we saw the bug present in the last comment I left ☝️ . In the Dockerfile, we did:
# Downloading the TensorFlow Models
RUN git clone --progress https://github.com/tensorflow/models.git /tensorflow/models
# Froze the codebase before Oct 28, 2020, https://github.com/tensorflow/models/tree/24e41ffe97c601e52b35682170e7abceed0eae1a
RUN cd /tensorflow/models && git fetch && git checkout 24e41ff
SSD Mobilenet, SSD Resnet50, and then CentralNet Resnet101 models were trained, and all models' mAP, precision, and recall scores are lower than 0.005 with the exact same training dataset that we used to get > 0.6 mAP score.
From the above comment that I left, I think TF2.2 TFRecords reader added 100 additional bounding boxes of class id 1 to each image in the train, validation and test dataset.
Unless the bug has been fixed after Oct 28, 2020, I think it still exists and will mess up all the model training, @tombstone @kulzc @jch1 any thoughts and guidance over this?
I'm having the exact same issue; It seems to me that the bug has been just introduced by the last commit made on model_lib_v2.py, if I revert just this file to the previous version it seems to be woking fine, so I'd suggest to try it as a temporary fix
@germano2239, can you provide the commit you refer to that worked for you. Our codebase was from before Oct 29th, so what you were saying the script that works is from Nov 2nd (The log of the script commits )?
@Geoyi this commit seems to NOT have this issue: git reset --hard b55044d89751b44e102c97b992cb25cccdbd7ba9 && git clean -f
@germano2239, can you provide the commit you refer to that worked for you. Our codebase was from before Oct 29th, so what you were saying the script that works is from Nov 2nd (The log of the script commits )?
Yes this is the one that also @mocialov is mentioning, I just reverted the one file but I think it's the same
Guys, I reverted the script back to the one @germano2239 AND @mocialov mentioned above as follows, unless I did something wrong.
# Downloading the TensorFlow Models
RUN git clone --progress https://github.com/tensorflow/models.git /tensorflow/models
# Froze the codebase before Nov. 2, 2020. https://github.com/cocodataset/cocoapi/commit/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9
RUN cd /tensorflow/models && git fetch && git checkout 33cca6c
It did not work, and still, get the same issue. The code is overwhelmed to debug, and I think I will just switch back to Tensorflow 1.15 and code base for models in TF1.
It could depend on the specific architecture used, I'm using "faster_rcnn_inception_resnet_v2_keras"; I checked twice, for me that commit does make the difference everything else being unchanged
Do you mind posting some of your logs from model training, e.g. precision, recall, mAP or loss while training, @germano2239? I've used SSD Mobilenet, SSD Resnet50, and then CentralNet Resnet101 and they all behaved the same as the screenshots I shared above. I believe the bug exists in the TFRecord reading process in TF2.x and it's not pre-train model specific.
also @geoyi your real ground truth boxes are all messed up, I didn't have this issue, also in the OP example image the road signs are ok
also @Geoyi your real ground truth boxes are all messed up
Yeah, I really don't understand that part. I've used the same dataset to train SSD resnet101 under TF1.15, and they all performed reasonably two months ago. I might need to dig into my training dataset a little bit.
Hi. I am having similar issue when training the SSD Mobilenet v2. I saw the ground truth images in the tensorboard and I thought it was a tfrecord problem. So I created a new tfrecord for my dataset. Still the issue persisted so I thought it was a tensorboard bug. My model's mAP were very low even after training for 50000 iterations. I thought it was suitable since SSD Mobilenet v2 already has only low mAP in COCO dataset. So I attributed the lower mAP scores to the Mobilenet backbone and thought it would improve if I used a heavier backbone. But after seeing this issue, I'm worried that my entire training was wasted due to this bug. I trained 2 variants (pretrained and scratch) of both 320, 640 SSD Mobilenet v2 models. Can u fix this bug soon so that I can train the models correctly this time?
Guys, I reverted the script back to the one @germano2239 AND @mocialov mentioned above as follows, unless I did something wrong.
# Downloading the TensorFlow Models RUN git clone --progress https://github.com/tensorflow/models.git /tensorflow/models # Froze the codebase before Nov. 2, 2020. https://github.com/cocodataset/cocoapi/commit/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9 RUN cd /tensorflow/models && git fetch && git checkout 33cca6c
It did not work, and still, get the same issue. The code is overwhelmed to debug, and I think I will just switch back to Tensorflow 1.15 and code base for models in TF1.
If I just use THAT specific commit, I would still get bunch of bounding boxes in my ground truth. However, I am doing it as follows now (I know it is not great, but it eliminates the problem):
git clone https://github.com/tensorflow/models
cd models/research
protoc object_detection/protos/*.proto --python_out=.
pip install .
git reset --hard b55044d89751b44e102c97b992cb25cccdbd7ba9 && git clean -f
pip install .
export PYTHONPATH=$PYTHONPATH:.:./slim
import os
os.environ['PYTHONPATH'] += ':.:./slim'
Maybe that would at least solve the problem with extra bounding boxes for you in the ground truth. However, check your actual bounding boxes, becuase they don't look right at all
@germano2239 @mocialov, thanks, just discovered that I switched x, and y when I read and write tf-example, I did [ymin, xmin, ymax, xmax]
correct one should be [xmin, ymin, xmax, ymax]
. I am going to try the solution you provide @mocialov, and will report back how the new model training goes later today :).
@germano2239 @Geoyi @mocialov I have the same issue. I have executed an eval run with commit b55044d89, the issue remains.
I can confirm that the same issue exists at b55044d by just evaluating on COCO val2017, with vanilla EfficientDet-d0 config.
I've ran into the same issue as OP (using tf2.2.0)!
@mocialov if I follow your instructions, and then run a training job using object_detection/model_main_tf2.py
, I get the following error:
ImportError: cannot import name 'string_int_label_map_pb2' from 'object_detection.protos' (/Users/.../git_repos/odo/models/research/object_detection/protos/__init__.py)
And indeed during the git reset
it is Removing object_detection/protos/string_int_label_map_pb2.py
(it removes many object/detection/protos/*_pb2.py
files).
Does anyone know whether this only affects eval or could it also affect / mess up the training procedure?
I've been using TF1.15 for many different projects with multiple imagery sources, and I've never faced data reading issues like TF2.x we've seen (in my case I am using Tensorflow 2.2 with the object detection API). The training dataset did not give any meaningful model result at all, and I suspect the bug has been introducing to the training data not only eval (BUT I MAY BE WRONG THOUGH). My objects are too small to detect, and I can't afford that the bug introduces additional noise to the model training.
I moved back to use Tensorflow 1.15 with Faster-RCNN and MobileNet, both worked as I expected.
I think I will stick with the current workflow with TF1.15 until the bug in TF2 is fixed.
I've ran into the same issue as OP (using tf2.2.0)!
@mocialov if I follow your instructions, and then run a training job using
object_detection/model_main_tf2.py
, I get the following error:
ImportError: cannot import name 'string_int_label_map_pb2' from 'object_detection.protos' (/Users/.../git_repos/odo/models/research/object_detection/protos/__init__.py)
And indeed during the
git reset
it isRemoving object_detection/protos/string_int_label_map_pb2.py
(it removes manyobject/detection/protos/*_pb2.py
files).Does anyone know whether this only affects eval or could it also affect / mess up the training procedure?
You can just revert one file to the previous commit
Has this been fixed?
I spent the night tearing my hair out debugging the eval util scripts before I realized what was happening. Also very interested in what the impact has been on our training in past weeks.
I spent the night tearing my hair out debugging the eval util scripts before I realized what was happening. Also very interested in what the impact has been on our training in past weeks.
At least for us there seemed to be no impact on training, so this is strictly a validation bug. But still a very annoying bug.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/tree/master/research/object_detection/model_main_tf2.py
2. Describe the bug
I trained an Efficientdet D2 object detection model using my own TFRecord dataset with 12 classes. When I ran the validation loop (
model_main_tf2.py
with thecheckpoint_dir
parameter) the ground truth images in Tensorboard all had 100 boxes visualized even though only few were provided by the validation dataset.All of the extra boxes have the class id 1 (with text "additional-panels" in my dataset). I confirmed this by checking the values of groundtruth_boxes and groundtruth_classes here: https://github.com/tensorflow/models/blob/master/research/object_detection/model_lib_v2.py#L709
groundtruth_boxes
for the image shown looks like this:...and
groundtruth_classes
looks like this:The labels for the real ground truth boxes are correct. It looks like the culprit is this line here where
label_id_offset
is added to the class ids: https://github.com/tensorflow/models/blob/master/research/object_detection/model_lib_v2.py#L707 Or maybe the bounding box visualization function should ignore these zero-area "padding" boxes.3. Steps to reproduce
I can't provide the TFRecord files I'm using but this should be reproducible with any dataset where the labels start with id 1 as instructed here: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md#label-maps
model_main_tf2.py --checkpoint_dir=<checkpoint_dir> --pipeline_config_path=<pipeline_config_path> --model_dir=<model_dir>
4. Expected behavior
Only the ground truth boxes in the validation set should be visualized.
5. Additional context
Model config:
Tensorboard validation ground truth image:
6. System information