tensorflow / models

Models and examples built with TensorFlow
Other
77.23k stars 45.75k forks source link

[SSD] Small object detection #3196

Open Tsuihao opened 6 years ago

Tsuihao commented 6 years ago

Hi all,

I have a question regarding the configuration of SSD. An interesting task for me is to fine-tuning the SSD_mobilenet_v1_coco_2017_11_17 with Bosch small traffic light dataset.

However, the default setting is to resize the image into 300 x 300 (image_resizer). Here is the total loss during training. It loss maintains around 6. (Please ignore the overlapping at 5000 steps, due to some re-launch trainign process.) image

I think the trend of the total loss is okay. However, when I stop around 12k and feed with the test dataset (around 90 images for a short try). There is nothing detected.

image

Personally, I have some doubts about this issue:

  1. Maybe the small traffic lights are too small for SSD?
  2. However, why the total loss curve displayed a correct "learning" process?

Can I simply change the config of image size into 512 x 512 or even larger value (1000 x 1000)? Will this work correctly as well?

Regards, Hao

AliceDinh commented 5 years ago

Sorry @aysark, I am not sure about your situation. I think trying to train with different resolutions, image sizes to see the accuracy of SSD FPN model, then decided the size later on.

gulingfengze commented 5 years ago

@aysark Try the following:

image_resizer { fixed_shape_resizer { height: 854 width: 1200 } }

tmyapple commented 5 years ago

Hi everybody, I'm trying to reproduce the SSD_mobilenet_V1 results, but I'm having some troubles. I am using TensorFlow Object-detection API I use the config file that was uploaded with the frozen model, so i expected to get similar results to what was presented on the website. The resizing that is being done is to 300x300, yet my goal is to train on bigger size (I plan to do so whenever i will be sure that everything is running well on the basic example). It looks like some of you succeeded in the task , so I'll appreciate any help/insights/code...

BTW, are you working with the legacy scripts (Train.py / eval.py) of the model_main.py script which is newer? i get different results for both of them...

qmaruf commented 5 years ago

What happens if we decrease the min_score_thresh? We can get more detections by lowering this threshold though. https://github.com/tensorflow/models/blob/master/research/object_detection/utils/visualization_utils.py#L354

maxmine11 commented 5 years ago

@qmaruf

Yeah, the visualization will essentially allow to draw any boxes with confidence score above or equal to your min_score_thresh on to your image. A good min_threshold is usually around 0.5 but it also depends on what you're trying to see.

wandonye commented 5 years ago

Maybe the last way is really like what you say, crop and re-annotate everything. that will be a lot overhead.

I might have misunderstood. Why is cropping and transforming the annotation into the cropped images so difficult? It should be just a few lines of python codes.

NightFury10497 commented 5 years ago

@Tsuihao @oneTimePad @Luonic @izzrak @augre @fdiazgon

I trained Mobilenetv2 ssd for Wider Face Detection the size of my images are 512*512 the size of my objects are variant. I am sharing my pipeline file and my output result what i have got after freezing the model. Screenshot from 2019-03-13 19-48-30

pipeline.txt

Please help me figure out this!

tmyapple commented 5 years ago

Hi @NightFury10497 , Lately I've worked with google object detection API and had my own struggles with it, something that might help you in the training process: The part of the train_config - rms_prop_optimizer { learning_rate { exponential_decay_learning_rate { initial_learning_rate: 0.004000000189989805 decay_steps: 800720 decay_factor: 0.949999988079071 } }

deimsdeutsch commented 5 years ago

Let me summarize this discussion... please correct me if i am wrong ..:-)

If i am training my own custom dataset using legacy train.py the best strategy is to crop a bigger resolution image into smaller chunks to have many images.

Or if i am using the version 2 model_main.py that process is automatically taken care of ?

Please reply.

NightFury10497 commented 5 years ago

@tmyapple @Tsuihao @oneTimePad @Luonic @izzrak @augre @fdiazgon do i need to resize and create the crops of objects detected as per the above comments as well? or is it fine in detecting only one single class using Object Detection API (tf) of MobilenetV2 ssd? by changing the Learning rate will it work fine?

tmyapple commented 5 years ago
  1. Changing the learning may help, because the one exists now in the pipeline.config is probably not what you need and it the one that was used for the training that was done from scratch.
  2. Regarding crops... - I guess it depends on the image resolution you have... I would try first to continue as you did - meaning work with 512x512 Fixed resizer and compare it to results you get on 300x300. 3.More thoughts: take in mind the sized of the objects to intend to detect. The model is trained to have 6 output branches with 6 anchors per pixel (except for the first branch which has 3 anchors) - anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.3333 }
    • do you really need these 6 output branches?
    • i guess you can even remove the two last aspect ratios (3:1, 1:3) - because face tends to be more "boxy" -
    • min_scale: 0.2 max_scale: 0.95
      after you will see that the model starts to learn something - this it another thing you may want to tune . min scale defines the anchors scale relative to the image an the first output branch, max scale --> anchors scale relative to the image at the last layer (it is interpolated in all the output layers inbetween) reducing the scales may help to find smaller objects...
NightFury10497 commented 5 years ago

@tmyapple @Tsuihao @oneTimePad @Luonic @izzrak @augre @fdiazgon` 1st I trained mobilenetV2 ssd on Wider Face Dataset 2nd Annotated them using yolo then after converting it to xml(voc) verified each in labelImg

3rd I created tfrecords trained using the pipeline #NOTE the size of the images are 300300 with the annotated objects (with the fixed image resizer 300300) and with the same learning rates you mentioned before i changed them in the pipeline as well as decay_steps to 5000

The ISSUE i am facing is below: Screenshot from 2019-03-14 11-12-46 Screenshot from 2019-03-14 11-12-58

Model is detecting the faces but in a very small manner that that bounding boxes detected are way too small.

NightFury10497 commented 5 years ago
  1. Changing the learning may help, because the one exists now in the pipeline.config is probably not what you need and it the one that was used for the training that was done from scratch.
  2. Regarding crops... - I guess it depends on the image resolution you have... I would try first to continue as you did - meaning work with 512x512 Fixed resizer and compare it to results you get on 300x300. 3.More thoughts: take in mind the sized of the objects to intend to detect. The model is trained to have 6 output branches with 6 anchors per pixel (except for the first branch which has 3 anchors) - anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.3333 }
  • do you really need these 6 output branches?
  • i guess you can even remove the two last aspect ratios (3:1, 1:3) - because face tends to be more "boxy" -
  •   min_scale: 0.2
      max_scale: 0.95  

    after you will see that the model starts to learn something - this it another thing you may want to tune . min scale defines the anchors scale relative to the image an the first output branch, max scale --> anchors scale relative to the image at the last layer (it is interpolated in all the output layers inbetween) reducing the scales may help to find smaller objects... ###### Do you mean i should alter: anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.20000000298023224 max_scale: 0.949999988079071 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.33329999446868896 } } to anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 } }

tmyapple commented 5 years ago

@NightFury10497

  1. Did your loss function seemed to converge ?
  2. What tool do you use for visualization ? -- i'm not sure how you've plotted this image - but I recommend to open tensorboard (in case you didn't) - the events are written there periodically an you will get also some images from your validation set with their detections.
  3. At start - in order to find out everything works as expected it is a common practice to try overfit on one image - instead of one image you can just put the test.record path as your training also... it would help you to diagnose your work.

btw, i attach an example of the Tensorboard layout --- tensorboard_ex

tmyapple commented 5 years ago
  1. Changing the learning may help, because the one exists now in the pipeline.config is probably not what you need and it the one that was used for the training that was done from scratch.
  2. Regarding crops... - I guess it depends on the image resolution you have... I would try first to continue as you did - meaning work with 512x512 Fixed resizer and compare it to results you get on 300x300. 3.More thoughts: take in mind the sized of the objects to intend to detect. The model is trained to have 6 output branches with 6 anchors per pixel (except for the first branch which has 3 anchors) - anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.3333 }
  • do you really need these 6 output branches?
  • i guess you can even remove the two last aspect ratios (3:1, 1:3) - because face tends to be more "boxy" -
  •   min_scale: 0.2
      max_scale: 0.95  

    after you will see that the model starts to learn something - this it another thing you may want to tune . min scale defines the anchors scale relative to the image an the first output branch, max scale --> anchors scale relative to the image at the last layer (it is interpolated in all the output layers inbetween) reducing the scales may help to find smaller objects... ###### Do you mean i should alter: anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.20000000298023224 max_scale: 0.949999988079071 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.33329999446868896 } } to anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 } } you shouldn't remove all the anchors, you can try something like this: anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 } }

but if you ask me you should start with the basic and tune it from there later on..

NightFury10497 commented 5 years ago

@tmyapple @Tsuihao @oneTimePad @Luonic @izzrak @augre @fdiazgon

  1. Changing the learning may help, because the one exists now in the pipeline.config is probably not what you need and it the one that was used for the training that was done from scratch.
  2. Regarding crops... - I guess it depends on the image resolution you have... I would try first to continue as you did - meaning work with 512x512 Fixed resizer and compare it to results you get on 300x300. 3.More thoughts: take in mind the sized of the objects to intend to detect. The model is trained to have 6 output branches with 6 anchors per pixel (except for the first branch which has 3 anchors) - anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.3333 }
  • do you really need these 6 output branches?
  • i guess you can even remove the two last aspect ratios (3:1, 1:3) - because face tends to be more "boxy" -
  •   min_scale: 0.2
      max_scale: 0.95  

    after you will see that the model starts to learn something - this it another thing you may want to tune . min scale defines the anchors scale relative to the image an the first output branch, max scale --> anchors scale relative to the image at the last layer (it is interpolated in all the output layers inbetween) reducing the scales may help to find smaller objects... ###### Do you mean i should alter: anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.20000000298023224 max_scale: 0.949999988079071 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 aspect_ratios: 3.0 aspect_ratios: 0.33329999446868896 } } to anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 } } you shouldn't remove all the anchors, you can try something like this: anchor_generator { ssd_anchor_generator { num_layers: 6 min_scale: 0.2 max_scale: 0.95 aspect_ratios: 1.0 aspect_ratios: 2.0 aspect_ratios: 0.5 } }

but if you ask me you should start with the basic and tune it from there later on..

Issue is not there in training again, Please specify what all changes i should do in the pipeline of MobilenetV2_ssd for images with 300*300 for detection of small object.

ghost commented 5 years ago

@oneTimePad , @izzrak .. do you guys have any idea about this... Hey guys, A quick hijack of the post here. By now, (thanks to experiments by @AliceDinh ) we know that FPN as a feature extractor matched with SSD helps increase accuracy on small objects. But here is another issue that I'm facing.

Problem Statement: Objects very similar to each other with the distinguishing feature between them being very small.

Example:

  1. Watches -> I trained regular Mobilenet SSD on one specific watch (LG Watch). It works great. But the problem is, it detects any watch. I cannot possibly train on all watch brands/types all over the world to avoid them during detection obviously.

  2. Cars -> Attached below is a Chrysler car rear view. This is a 200 S. I have a dataset of the rear view of the car. Now problem is, the entire car rear looks same for all tiers. For example, the difference between the 200 S (in the pic) and 200 C would be.. the S and C in the badging on the car. 2015-Chrysler-200-in-Detroit-blue-rear-view

Ideas ->

  1. I just had an idea reading this discussion here where I can do weird annotations. For example, first annotate the car to localize it from the environment. And then differentiate between cars using annotations on the character like 'S' or 'C'. This way SSD-FPN would help because the small objects like 'S' / 'C' are retained because of FPN and SSD in general can just handle the rear view of car from rest of the environment. Do you guys think this will help?

  2. I have no clue on how to approach the problem with the watches though. Let's say I have 10 specific type of watch classes. How would I go about annotating this dataset and what kind of a model can be used with this. The watches are similar to each other except very minute changes in details.

For Idea-2, here's what I already know and have. I trained with vanilla Mobilenet-SSD and it didn't seem to help. My logical guess is because the object looks similar in more than 90% of the pixels, the annotations between the 2 objects is not different by much. But based on idea-1, if I instead of annotating the entire watch for detecting that one class, I just annotate much less area of the watch where the difference between the classes is high (meaning more than 60% of the pixels in the annotated region is different between 2 different watches), then it will do a better detection? Ofc, now it becomes a small object detection because the number of pixels will be small, hence using SSD-FPN. Would this help in any way?

Side Questions: Apart from those questions above, a couple of questions which I always confuse myself with:

  1. I currently have around 1500 pictures for each watch class that I collected for my school project. Is this enough dataset per class or do I need more pictures? I have 10 classes that I'm working with. I do know, the amount data required is proportional to the architecture parameter count. I'm talking about SSD-FPN with resnet50 or mobilenet. Also, Faster-RCNN.
  2. For background class images, do I have to match the images per class (say 1500 per above question) or do I need more or do I just get by with a 100 images or so? Also, when we say background classes, can it be any images? Can I randomly pull data from other datasets and call it background class?
  3. Will retaining the aspect ratio of the dataset help? I collected the watch dataset with the image size at 2592x1944 (4:3) and I RESIZE it to 640x480 (4:3) as input image to the neural network. I don't want to use the high resolution because it uses a lot of memory to train and inference is slow and I'm looking for an alternate for cropping my image data. Resizing sounds like a default option otherwise? I'm assuming this is better than resizing it to a 1:1 aspect ratio because it preserves the integrity of the object compared to changing the aspect ratio? Or does it not matter of how the anchor boxes and basically how SSD works?

Thanks for the help.

Ekko1992 commented 5 years ago

@Deep-Sek Isn't it a better idea to have some other tricks to distinguish between different types of those similar cars? for example, using OCR techniques to read the letters and decide whether it is a "C" series car or an "S" series car.

I had some experience classifying similar classes before though, e.g. different type of cars( different brand, year etc.) and different birds. It is indeed a hard problem, and I think you can have a look at paper in this domain, such as: http://openaccess.thecvf.com/content_cvpr_2017/papers/Fu_Look_Closer_to_CVPR_2017_paper.pdf

ghost commented 5 years ago

@Ekko1992 I skipped OCR techniques all together because I thought since this is "OCR in the wild" where we don't control the environment, the performance would not be good. I'll give it a try asap and keep everyone updated on how it works out. Maybe I can do some affine transformations and control the text density and structure a bit.

Also, will take a look at the paper and try that too. Thanks a lot for the resources. I'll provide an update as soon as I can.

I did try this: http://vis-www.cs.umass.edu/bcnn/docs/bcnn_iccv15.pdf Basically, took this network architecture idea as a feature extractor and replicated it using MobileNet with bilinear connection and then plugged in the regular SSD for detection network after. Can you tell me what you think of that paper? The idea sounds like it should give amazing results.

But not sure if I did have enough data to substantiate training this huge network with double the parameters. It just took way too long to converge. I'll probably re-attempt too at a later time after trying out your suggestions.

whuzs commented 5 years ago

您好@oneTimePad,

谢谢回复。 所以我可以做的一种方法是:裁剪交通灯图像,然后重新注释 我试图避免这种情况的所有图像,因为手动裁剪和重新注释需要几天我假设:p。

就我而言,我还在coco数据集上使用了预先训练过的SSD mobilenet,并使用交通灯数据集进行了微调。

我做了两个假设(如果我错了,请纠正我):

  1. _在imageresize到300 x 300期间,Tensorflow还将调整“tf.record”数据中的注释:在我的情况下,它不起作用只是因为原始图像1280 x 720调整为300 x 300,小交通灯几乎消失。我怀疑这是我无法获得正确结果的原因。
  2. 我假设发布的Tensorflow SSD mobilenet属于SSD300架构,而不是SSD500架构:这就是为什么我试图将image_resizer更改为更大的值(512 x 512); 然而,它仍然没有奏效。

也许最后一种方式真的像你说的那样,裁剪并重新注释一切。这将是一个很大的开销。

Even if the image is cropped and re-annotated during training, the image is still so large when detected that cropping seems to be of little use.

sky5media commented 5 years ago

Hello,

I am also facing a problem of recognizing small objects on the image. In my case I need to be able to detect multiple numbers (0-9) as well as tiny logos on the image. Let's say we have an advertisement billboard of a more or less standard shape which contains 3-4 lines of small logos with digits in front. For example: DHL - 1248265 UPS - 7623652 FedEx - 3726565

The real size of a billboard is pretty big, but we need to detect numbers from a distance, so the numbers would actually become small, although you could still easily recognize them on the phone screen. I am wondering if the following approach would work with SSD mobilenet V1/V2 models:

I will create a dataset consisting of individual numbers, logos and the whole billboard. Then we will detect the whole billboard at first. Since its pretty large relative to the image. After getting it's bounding box, I will crop the image based on that, maybe enlarge it a bit and then feed the result back to the model to detect logos and numbers

So we would actually run the detector twice on the same image. I assume this would be anyway faster than running ResNet or Faster-RCNN on mobile device.

Does anyone know if that would make any improvements for detecting process with SSD mobilenet?

jamessmith90 commented 5 years ago

Tensorflow is crap and below-par piece of shitty library written for the benefit of Google cloud.

Thank you.

dexception commented 5 years ago

For those who are visiting... let me break down the entire story for you. comment the following in your pipeline.config file. There are bugs depending upon which version of tensorflow your using that is why if your working on new version this problem should not come in your way. For the old version:

data_augmentation_options {

#random_horizontal_flip {
#}

}

data_augmentation_options {

#ssd_random_crop {
#}

}

qraleq commented 5 years ago

@dexception Which version of tensorflow you're reffering to as the old version? And since which version this bug is fixed?

Thanks.

Gmrevo commented 5 years ago

OK i will try 224224 @hengshanji Did training with 224224 MobilenetSSD V2 solve the issue?

tcrockett commented 5 years ago

Here is something I tried that I haven't seen anyone else try here. My problem is my camera input is 1280x960 and I'm looking for small labels. To keep the height from becoming to distorted when the image is fit into the 300x300 input space I kept the aspect ratio but fit the image into the same linear space. e.g.

300 300 = 90e3, Y = X 960/1280, 90e3=X X 960/1280 = X^2 960/1280, X = sqrt(90e3 1280/960) = 346.41, Y = 259.81._

Rounding X and Y to integers to keep X Y<90e3 with minimal wasted bytes finds the optimal new size to be 346x260 with 40 3 wasted bytes. img.shape = (260,346,3)

image_resizer {
  fixed_shape_resizer {
    height: 260
    width: 346
  }
}

Retraining a SSD with inception v2, I should keep the meat of what the model has learned with minimal trouble. This converged to a loss of 1.8 after 86000 steps.

ghost commented 5 years ago

@tcrockett Preserving aspect ratio should not really affect your training in anyway. If your camera input is 4:3 (1280x960) and you resize your input image to 1:1 (300x300) and you're always consistent with this. Then it shouldn't matter. For example, after you train your network by resizing your pics from 4:3 to 1:1.. as long as you do the same during inference time (post training) and convert your camera input from 4:3 to 1:1, the distortion that you do on the image is consistent and the neural network doesn't care much about that. I can see that the network having trouble with detections if you used a different aspect ratio to capture raw data (before resizing) and then resized that to 1:1. But preserving aspect ratio doesn't really do anything.

In SSD, the prior boxes have different aspect ratios which is why the aspect ratio of the input image doesn't really matter because the prior boxes will pick up the aspect ratio variation of the objects.

tcrockett commented 5 years ago

logoCmpare left is 300x300, right is 260x346 Without aspect ratio adaption the width of the logo will be represented in the 300x300 space by fewer pixels reducing the horizontal detail.

Siggi1988 commented 5 years ago

Hallo Tsuihao,

is the loss in your graph for the traffic light detection in percent? Or I must multiply the values with 100? My problem is the same, because I get values between 1 and 2.

Thanks for your answer Sigg

arvindchandel commented 4 years ago

Can anyone suggest something about Retraining a Object Detection model. i.e - Suppose i train tensorflow faster Rcnn_inception on any custom data having 10 classes like ball, bottle, Coca etc.. and its performing quite well. Now later i got some new data of 10 more classes like Paperboat, Thums up etc and I want my model to trained on these too. Is there any method so that i can retrain my generated model for these 10 new classes too to upgrade it for 20 classes, rather starting training from scratch.

lorenzolightsgdwarf commented 4 years ago

Hi guys, here are my 2 cents: in my scenario I want to detect UI elements (buttons, checkbox, etc) from screenshots of 800x800 using ssd_mobile_net_v2. The dimensions of the objects range from 80px to 400px.

Lastly in my case I also have the need for an augmentation that creates an effect of zoom-in zoom-out for simulating projects at different scales and positions. For this I modify the preprocessor as in the pull request https://github.com/tensorflow/models/pull/8043 and used the configuration

data_augmentation_options {
    ssd_random_crop_pad_fixed_aspect_ratio{
         aspect_ratio: 1.0
         min_padded_size_ratio: [0.5,0.5]
         max_padded_size_ratio: [2, 2]        
         operations {
            random_coef: 0.5
            overlap_thresh: 1.0 
            clip_boxes: false 
            min_object_covered: 1.0  
            min_aspect_ratio: 0.25
            max_aspect_ratio: 4
            min_area: 0.1
            max_area: 1.0
        }
   }
}

On Stack Overflow someone explained how to test the augmentation. This is the adapted script to visualize the effect of the above operation

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import functools
import os
import cv2
from absl.testing import parameterized

import numpy as np
import tensorflow as tf
from scipy.misc import imsave, imread

from object_detection import inputs
from object_detection.core import preprocessor
from object_detection.core import standard_fields as fields
from object_detection.utils import config_util
from object_detection.utils import test_case

FLAGS = tf.flags.FLAGS
tf.disable_eager_execution()
class DataAugmentationFnTest(test_case.TestCase):

  def test_apply_image_and_box_augmentation(self):
    # Put here your augmentation
    data_augmentation_options = [
        (preprocessor.ssd_random_crop_pad_fixed_aspect_ratio, {
                'min_object_covered': [1.0],
                            'aspect_ratio': 1.0,
                            'aspect_ratio_range': [(0.25, 4)],
                            'area_range': [(0.1, 1.0)],
                            'overlap_thresh': [0.999999],
                            'clip_boxes': [False],
                            'random_coef': [0.0],
                            'min_padded_size_ratio': (0.25, 0.25),
                            'max_padded_size_ratio': (2, 2)})
    ]
    data_augmentation_fn = functools.partial(
        inputs.augment_input_data,
        data_augmentation_options=data_augmentation_options)
    tensor_dict = {
        fields.InputDataFields.image:
            # lena.png is the image reference
            tf.constant(imread('lena.png').astype(np.float32)),
        fields.InputDataFields.groundtruth_boxes:
            # just a ground truth box element in normalized coordinates [y1,x1,y2,x2]
            tf.constant(np.array([[ 0.5, 0.5,  0.53 , 0.53]], np.float32)),
        fields.InputDataFields.groundtruth_classes:
            tf.constant(np.array([1.0], np.float32))
    }
    # This is the size of the resizer
    final_image_size= (800, 800)

    augmented_tensor_dict = data_augmentation_fn(tensor_dict=tensor_dict)
    with self.session() as sess:
        for x in range(100):
            augmented_tensor_dict_out = sess.run(augmented_tensor_dict)
            final_image_shape=augmented_tensor_dict_out[fields.InputDataFields.image].shape
            print("Final Shape "+ str(x) + ": ", final_image_shape)
            print("Final Boxes "+ str(x) + ": ", augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes])
            final_image=augmented_tensor_dict_out[fields.InputDataFields.image]
            if augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes].shape[0] > 0:
                point_x=augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes][0][1]
                point_y=augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes][0][0]
                point_x2=augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes][0][3]
                point_y2=augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes][0][2]
                final_image = cv2.rectangle(final_image, (int(point_x*final_image_shape[1]),int(point_y*final_image_shape[0])), (int(point_x2*final_image_shape[1]),int(point_y2*final_image_shape[0])), (255,0,0), 2)
            else:
                print("Boxes is empty")
            imsave('test/lena_out'+str(x)+'.jpeg',cv2.resize(final_image,final_image_size))

if __name__ == '__main__':
  tf.test.main()
sainisanjay commented 4 years ago

@eumicro what model and how did you fine-tune the model to get accurate prediction?

Hi, sorry my English is not that good. I described how I fine tuned and trained the SSD MobileNet here (only in German, sorry): http://eugen-lange.de/german-traffic-sign-detection/

the main "tuning steps" are:

  • generated my own data set (see my homepage for more details), I think it was the most important "step" ^^...
  • removed 2 first layers from the MobileNet
  • used grayscale pictures

from which file you removed first two layers ?

synergy178 commented 4 years ago

@sky5media have you been able to solve your issue? If yes, how? I also try to use object detection for OCR but I have 14 classes and can only detect 9 of them with model_main. Train.py loss does something weird doing great for the first epoch and then goes expotentially to billioons.

sainisanjay commented 4 years ago

Quite a same issue i am facing withssd_mobilenet_v2_coco_2018_03_29 pre-trained model. Localisation loss is fluctuating and loss is quite high even after 50K steps. Trying to train model with 7 classes (Pedestrian;Truck;Car;Van;Bus;MotorBike;Bicycle). I know the same classes are already available in the pre-trained model but i am feeding my own images. Any idea whats wrong?

trainingloss

synergy178 commented 4 years ago

@sainisanjay Your learning rate(LR) is too high I guess. Try setting a scheduled decay of LR.

Check whether your objects are correctly annotated and easy to disntinguish from the background.

Check the exif orientation of your pictures as well.

sainisanjay commented 4 years ago

@synergy178, I have following parameters:

initial_learning_rate: 0.001
    decay_steps: 40000
    decay_factor: 0.95

I am not really sure how to check the the exif orientation of your pictures. But i have visualised my TF records with tfrecord-viewer. This tools gives my same results as original annotation. As can be seen attached image.

5

sainisanjay commented 4 years ago

Further, i have checked the image orientation with following two options. Both has gave me same orientation: Option 1: Example from exif

import matplotlib.pyplot as plt
import image_to_numpy
img = image_to_numpy.load_image_file("my_file.jpg")
plt.imshow(img)
plt.show()

Option 2: Normal matplotlib lib.

from matplotlib import image
from matplotlib import pyplot
image = image.imread("my_file.jpg")
print(image.dtype)
print(image.shape)
pyplot.imshow(image)
pyplot.show()

exif matplot Since both libraries are giving same orientation so i assumed orientation of images are correct. Problem is something else?

sky5media commented 4 years ago

@synergy178 unfortunately no, I couldn't solve it.

preronamajumder commented 4 years ago

Here is something I tried that I haven't seen anyone else try here. My problem is my camera input is 1280x960 and I'm looking for small labels. To keep the height from becoming to distorted when the image is fit into the 300x300 input space I kept the aspect ratio but fit the image into the same linear space. e.g.

300 300 = 90e3, Y = X 960/1280, 90e3=X X 960/1280 = X^2 960/1280, X = sqrt(90e3 1280/960) = 346.41, Y = 259.81._

Rounding X and Y to integers to keep X Y<90e3 with minimal wasted bytes finds the optimal new size to be 346x260 with 40 3 wasted bytes. img.shape = (260,346,3)

image_resizer {
  fixed_shape_resizer {
    height: 260
    width: 346
  }
}

Retraining a SSD with inception v2, I should keep the meat of what the model has learned with minimal trouble. This converged to a loss of 1.8 after 86000 steps.

It is not a good idea to have different height and width for the image resizer in case you want to convert it to uff to run on edge devices. Because you need to manually put the ratios in the uff config file. and the function that is used to calculate the ratios take only one variable as input. so for 300x300, the ratios would be calculated for 300. but for your case 260x346, if you input either 260 or 346, the resulting bounding boxes generated by the tensorrt model in the edge device will be different than the ones generated by the tensorflow model in your pc.

sainisanjay commented 4 years ago

@preronamajumder Did you use transfer learning or you train the model from scratch? I believe, If you change the height and width you can not use the pre-trained model (300x300) for weight initialization.

HUI11126 commented 3 years ago

https://github.com/DetectionTeamUCAS/FPN_Tensorflow This project based Faster rcnn + FPN, which is accurate to detect small objects. But I was not able to deploy the project on Openvino, sinice the merge function in "fusion_two_layer" is limited on Openvino.

bhavyaj12 commented 3 years ago

Hi all. I'm trying to train an SSD on a custom barcode detection task. The issue is that the dataset images are all different sizes and keep aspect ratio resizer doesn't seem to be working with ssd resnet 50. Is it required for the input images to be the same sizes in 1:1 ratio as in the fixed resizer?

preronamajumder commented 3 years ago

@preronamajumder Did you use transfer learning or you train the model from scratch? I believe, If you change the height and width you can not use the pre-trained model (300x300) for weight initialization.

I used transfer learning with ssd_mobilenet_v2_coco. fixed image resizer can be changed. But I started setting it to 300x300.

preronamajumder commented 3 years ago

Hi all. I'm trying to train an SSD on a custom barcode detection task. The issue is that the dataset images are all different sizes and keep aspect ratio resizer doesn't seem to be working with ssd resnet 50. Is it required for the input images to be the same sizes in 1:1 ratio as in the fixed resizer?

Why dont you try to pad the images? It will maintain the aspect ratio of the ground truth boxes and will also give the appropriate size required by the detection model.

NickosKal commented 3 years ago

Hi @Luonic,

Yes, I had successfully trained faster rcnn and obtained an accurate result. As shown: image

However, it is too slow for my use case. That is why I want to try the fastest SSD mobilenet model :)

I have some concerns regarding the annotated information. When you crop the annotated images, how did you "update" the information in the original annotation? Let's say: Original image 1280 x 720 and the annotated traffic light is : boxes: {label: Green, occluded: false, x_max: 752.25, x_min: 749.0, y_max: 355.125, y_min: 345.125}

when you crop it into 300 x 300, the annotated image coordinate system need to be updated. Did you manually re-annotate them or there is some crop image tool can help you do this?

Regards, Hao

Hey @Tsuihao could you share the repo you use for the faster-RCNN please? Thanks in advance!

Petros626 commented 1 year ago

I'm finding several problems in obtaining a good detection on small objects. My images are 640x480 and the objects size are typically around 70x35 - 120x60.

I'm using the typical ssd_mobilenet config file, and I train from ssd_mobilenet_v2 pretrained model. I'm interested in a good accuracy with a great speed, so I need SSD architecture. Maybe is better to move to SSD inception v2? Or can I change some parameters, like anchors and fixed_shape_resizer (but... how?)

Thank you for any advice,

@eumicro how did you edit the config file to obtain that good detection?

@darkdrake88 @sainisanjay

He removed the first two layers of the architecture in my opinion. I thought a bit about it and I'm sure these layers are excluded:

scientific paper (https://arxiv.org/abs/1801.04381): 224x244x3 conv2d, output_channels=32, stride=2 112x112x3 bottleneck, output_channels=16, stride=1

TF OD API (https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet_v2.py):

op(slim.conv2d, stride=2, num_outputs=32, kernel_size=[3, 3])
op(ops.expanded_conv, expansion_size=expand_input(1, divisible_by=1),num_outputs=16)

Questions about it:

  1. Is it necessary to rerun the protoc command (refer to the TensorFlow Installation guide) or just comment these two lines an start training?
  2. Why this change increase the ability of the model do detect smaller objects, which are more far away?