[SSD] Small object detection

Tsuihao commented 6 years ago

Hi all,

I have a question regarding the configuration of SSD. An interesting task for me is to fine-tuning the SSD_mobilenet_v1_coco_2017_11_17 with Bosch small traffic light dataset.

However, the default setting is to resize the image into 300 x 300 (image_resizer). Here is the total loss during training. It loss maintains around 6. (Please ignore the overlapping at 5000 steps, due to some re-launch trainign process.)

I think the trend of the total loss is okay. However, when I stop around 12k and feed with the test dataset (around 90 images for a short try). There is nothing detected.

Personally, I have some doubts about this issue:

Maybe the small traffic lights are too small for SSD?
However, why the total loss curve displayed a correct "learning" process?

Can I simply change the config of image size into 512 x 512 or even larger value (1000 x 1000)? Will this work correctly as well?

Regards, Hao

oneTimePad commented 6 years ago

Did you try taking 300x300 crops from the images?

You could try training it on smaller images and feed in overlapping crops of size 300x300 that tile the original image, which could be bigger. I was able to train it on 1000x600 images, and it worked on my test set which was also 1000x600. This might be slightly hard since your original set is not 300x300, but if instead you could form a dataset out of random crops of size 300x300 from your original set then maybe...

The images I am actually working with are around 12MP, and I am feeding in crops of size 1000x600. However, with 1000x600, SSD is struggling to learn the classes, but the localization error is very low.

Tsuihao commented 6 years ago

Hi @oneTimePad,

Thanks for your reply.

I have thought about this approach too. However, in this case, I need to take care of the annotation too right?

Did you first annotation all the images and then covert the annotations into the cropped corresponding image (with some python script I assume)?

Or you first crop them and then annotate manually on those 300x 300 images?

Luonic commented 6 years ago

@Tsuihao you cropping already annotated images. SSD has issues with detecting small objects but Faster-RCNN much better at this.

Tsuihao commented 6 years ago

Hi @Luonic,

Yes, I had successfully trained faster rcnn and obtained an accurate result. As shown:

However, it is too slow for my use case. That is why I want to try the fastest SSD mobilenet model :)

I have some concerns regarding the annotated information. When you crop the annotated images, how did you "update" the information in the original annotation? Let's say: Original image 1280 x 720 and the annotated traffic light is : boxes: {label: Green, occluded: false, x_max: 752.25, x_min: 749.0, y_max: 355.125, y_min: 345.125}

when you crop it into 300 x 300, the annotated image coordinate system need to be updated. Did you manually re-annotate them or there is some crop image tool can help you do this?

Regards, Hao

oneTimePad commented 6 years ago

Ah, yes. Completely forgot about the annotation. In my case I have program that generates all of my training data, so I can easily change the training data image size (which will then change the annotations). However, yeah, you could write a program that converts the bounding box coordinates as you mentioned, but as mentioned I am still struggling with getting the classification accuracy up.

An idea I had, was to first train mobilenet base network, fine tuning from the checkpoint trained on the coco dataset or a classification checkpoint, to just classify small crops of the the objects of interest. In your case, crops of traffic lights classifying their color. Then go back to SSD and fine-tune the model from these weights trained to classify. I haven't tried this yet, but it might help mostly with the classification accuracy.

You mentioned mobilenet(s); have you tried a different base network?

Tsuihao commented 6 years ago

Hi @oneTimePad,

Thanks for the reply. So there is one way I could do is: crop the traffic light image and then re-annotate all the images I was trying to avoid this since the manual crop and re-annotate will take few days I assume :p.

In my case, I also used the pre-trained SSD mobilenet on coco dataset and fine tuning with the traffic light dataset.

There are two assumptions I made (please correct me if I am wrong):

_during the imageresize to 300 x 300, Tensorflow will also resize the annotation in "tf.record" data: In my case, it does not work just because the original images 1280 x 720 resize into 300 x 300, the small traffic light just nearly vanishes. I suspect that is the reason I could not have the correct result.
I assume that the release Tensorflow SSD mobilenet is under SSD300 architecture, not SSD500 architecture : And this is why I was trying to change the image_resizer into larger value (512 x 512); however, it still not worked.

Maybe the last way is really like what you say, crop and re-annotate everything. that will be a lot overhead.

izzrak commented 6 years ago

If you want to train an SSD512 model, you need to start from scratch. The pre-trained model can only be fine-tuned as SSD300 model.

augre commented 6 years ago

Hi @Tsuihao Did you successfully train the SSD model on small objects? If so how did you get around it?

My original images are 512x512 I am thinking about cropping them to 300x300 around the areas of interest and create the TFrecords file from the cropped ones. Would this be ok?

Tsuihao commented 6 years ago

Hi @augre,

I have not tried it yet. I am also thinking about the same approach as you described and will try it as long as I have time.

I am not sure how the performance will be of cropping training images. Maybe you can share your experience later :)

chanyoungjung commented 6 years ago

Hi @Tsuihao

Could you share your trained model(faster-rcnn)?

And what framework did you use for training, caffe or tensorflow?

Thanks

jhagege commented 6 years ago

@Tsuihao Any progress on this method ? I'm having the same issue, do you have any interesting findings that you remember you could share ? Thanks !

sapjunior commented 6 years ago

Try this paper S3FD: Single Shot Scale-invariant Face Detector https://arxiv.org/abs/1708.05237 They modified SSD OHEM and IOU criterion to be more sensitive to small object like faces

paolomanchisi commented 6 years ago

Hi, I'm interested in training ssd500 mobilenet from scratch, can someone give me some hints? Thank you.

elifbykl commented 6 years ago

Hi @Tsuihao

I have a problem with ssd_mobilenet_v2_coco. My images are 600x600 size but with resizing in the config file 300x300. Is there any possibility to work 600x600 in this case? Do my training images have to be 300x300? How did you solved small object problem?

abhishekvahadane commented 6 years ago

@sapjunior : Have you used the implementation on some application other than faces?

Tsuihao commented 6 years ago

@jungchan1 sorry I could not provide my trained work. I was using TensorFlow

@cyberjoac Nope, I did not go further on this topic; however, I am still looking forward to see if anyone can share the experience in this community :)

@elifbykl 600X600 for me sounds acceptable to resize into 300x300; however, it also depends on the relative object size you are working on. Based on the above discussion, you training image will resize inito 300x 300 due to the fixed architecture SSD provided by Tensorflow. I am still not solving the small object detection with SSD yet.

eumicro commented 6 years ago

I trained a model capable of recognizing 78 German traffic signs. I used Tensorflow's Object Detection API for the training. The model can recognize the characters at a signsof about 15 meters. Here you can download the model and try it out.

Model: http://eugen-lange.de/download/ssd-4-traffic-sign-detection-frozen_inpherence_graph-pb/

julianklumpers commented 6 years ago

@Tsuihao i had a similar problem and i needed to slice the image into smaller tiles/crops. however i already labelled my dataset and i was not sure what size of tiles were suitable for training. So i wrote a python script that slices the image in a giving size and recalculates the annotations for you in separate .xml files per tile/image it creates.

Here is the code, its far from perfect but i needed a quick solution. https://github.com/julianklumpers/slice_image_with_annotations/blob/master/slice_image_with_annotations.py

It uses openCV rather then PIL because i tested both and openCV was much quicker with sliceing and saving the images. It creates tiles with coordinates from the original image as a name, this way i can stich the image back together. Feel free to adjust it to your needs. i will probably make a library some day

The function creates 2 rows and 2 columns. so if you have a image that is 1000x1000 and you need 500x500 tiles. you just put size=(2,2) 1000 / 2 = 500.

willSapgreen commented 6 years ago

Hello @Tsuihao,

have you tried the stock SSD_mobilenet_v1_coco_2017_11_17 without training and see the result visually?

My situation is the performance from stock SSD_inception_v2_coco_2017_11_17 is better than my trained-with-kitti model on car detection.

I am still working on this and hopefully can get back to you ASAP.

Best,

Tsuihao commented 6 years ago

Hi @willSapgreen,

Yes, I have tried to use the pure SSD_mobilenet_v1_coco_2017_11_17 to do the traffic light detection. And the result is better than my trained SSD with traffic light dataset.

However, this result can be foreseen due to the fact that SSD_mobilenet_v1_coco_2017_11_17 trained with the COCO dataset. In my case, I need a more details about the detected traffic lights e.g. red, green, yellow, red left, etc.

In your case, you wanted to detect car, I believed that car in the image is much bigger than the traffic light; therefore, you should not have the same issue (traffic light is too small) as mine. I will suggest you to:

Check your tensorboard report (see whether training result is good or bad)
Change with different model e.g. faster_rcnn (see whether your data/label is valid)

Regards,

lozuwa commented 6 years ago

Hey, I read that you struggled with resizing/cropping and then labeling again. I had the same problem so I made some scripts that I am trying to turn into a library. Why don't you check them https://github.com/lozuwa/impy

There is a method called reduceDatasetByRois() that takes in an offset and produces images of size (offset)X(offset) which contain the annotations of the original image.

simonegrazioso commented 6 years ago

I'm finding several problems in obtaining a good detection on small objects. My images are 640x480 and the objects size are typically around 70x35 - 120x60.

I'm using the typical ssd_mobilenet config file, and I train from ssd_mobilenet_v2 pretrained model. I'm interested in a good accuracy with a great speed, so I need SSD architecture. Maybe is better to move to SSD inception v2? Or can I change some parameters, like anchors and fixed_shape_resizer (but... how?)

Thank you for any advice,

@eumicro how did you edit the config file to obtain that good detection?

hengshan123 commented 6 years ago

Hi, i have a problem related with this, but it's a little different. I want to train a model to detect my hand, yes only one class and run the model on my phone. But the speed is a little slow ,about 400ms. I want to resize the image to smaller size like 100*100, the speed is much fast, but the presicion is very bad. I guess i need to train the ssd from scratch, is that right ? @izzrak

Luonic commented 6 years ago

You have to go on with MobileNet v2. On modern device you would get around 200 ms per image. It operates on 224x224 images. 100x100 is too small for robust detection. If you want smooth UI you can track feature points with classic CV tracker and while calculating new predictions animate UI with tracked movement.

On Fri, Jun 15, 2018, 11:59 hengshan notifications@github.com wrote:

Hi, i have a problem related with this, but it's a little different. I want to train a model to detect my hand, yes only one class and run the model on my phone. But the speed is a little slow ,about 400ms. I want to resize the image to smaller size like 100*100, the speed is much fast, but the presicion is very bad. I guess i need to train the ssd from scratch, is that right ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/3196#issuecomment-397558651, or mute the thread https://github.com/notifications/unsubscribe-auth/AMn3zerXQCTPu4JaV5S04MqJgA7_33gWks5t83dWgaJpZM4RjWXw .

hengshan123 commented 6 years ago

OK i will try 224*224

AliceDinh commented 6 years ago

I have same problem with detecting small objects, my input 660x420 and the objects are about 25x35. I consider my objects medium size but SSD mobilenet v1 gives low accuracy and the training time is long. I did try to make my input 660x660 (width:heigh = 1:1) as recommended by @oneTimePad to see how the resizing step to 300x300 of SSD make any improvement but the answer is yes, but not much.

simonegrazioso commented 6 years ago

@AliceDinh, for long training time, what do you mean? How many steps? Which learning rate? Do you change anchors values?

AliceDinh commented 6 years ago

@simonegrazioso

Training time is long, means to get loss~=1.0, the numbers of step are more than 200K. (With FasterRCNN, after 2K steps I get loss ~=0.02)
Where to check the learning rate? Is that from the Tensorboard? I trained on server without Internet so I could not launch the Tensorboard from there.
Change the anchors values? What specific values I should change?

ashraful100 commented 6 years ago

@AliceDinh

Learning rate is defined inside the configuration. If you haven't change or edit anything inside the config file, it would be like below for SSD Mobilenet v1

learning_rate: { exponential_decay_learning_rate { initial_learning_rate: 0.0002 decay_steps: 800720 decay_factor: 0.95 }

Can you please share some experiences over training with SSD Mobilenet v1?

In my case, my object size varies from let's say 6 X 6 pixel to 20 X 20 pixel. My train and test image size is 200 X 200 pixel. Each of the images contains on average around 40 objects. I used (for now) 300 images as train image and 50 images as test image. In the last week, I trained till 77K+ steps, but the total loss didn't go under 3.5. In result of that, I didn't got much better detection.

Should I pay more attention on to increase the number of train images or to let it train more steps with same images to get good detection?

I would like to know, did you get better detection after being trained more than 200K steps??

@oneTimePad

My original image is bigger than 1000 X 1000 pixel, so I cropped these to 200 X 200 pixel, then labelled it with a tool and trained. After training, I observed it detects comparatively good in 200 X 200 train images than the 1000 X 1000 original images. Could you please mention some logical reasons behind that?

Thanks in advance.

AliceDinh commented 6 years ago

@ashraful100 my training image: 660x416, 16-19 objects with size 25x35, I consider my objects medium size, not as small as yours. I mostly used default parameters of SSD mobilenet V1 to train. I modified only the PATH_TOBE_CONFIGURE, the number of class. After 15K steps, the total loss dropped to 0.8 but stay there forever. The model can detect most of the objects, but not all as my expectation. However, whatever it detected, it labeled correctly. I am training with other input-sizes to see any differences coz the resizing step of SSD matters to others. I'll let you know the result later on.

faellie commented 6 years ago

@Tsuihao @Luonic Hi Hao Tsui and Luonic I am trying train on a dataset with large image (2040X1563) containing small object too (~25X25). I am trying to use fast -rcnn (using the config file faster_rcnn_inception_v2_pets.config). I am having trouble to detect anything. Would you guys be able to share some experience on how you train your dataset with faster-rcnn for this kind of dataset? (speed is not my concern here).

I am very new to this so any suggestion is very appreciated. I think my problem currently is the RPN doesn't give any proposed box ( I see none in tensor board).

Note: If I crop the images down to ~600X400, its seem it does work, but I am wondering is it just impossible or I need to do some configuration changes.

I am also very confused about the height_stride: 16 width_stride: 16 in first_stage_anchor_generator, from what I read, I should change them to a smaller value (since mu object is only around 25x25 insize).

But when I change them to 8/4, I do not see any improvement when I did not crop the image. when I use dataset with cropped image, oth original 16/16 give the best result.

fediazgon commented 6 years ago

Hi guys, I've successfully finetuned a pre-trained ssd-mobilenet with some tips I read in this conversation. I still have to try if the same method works for really small objects too (<20 pixels).

First, my input image was 1920x1080 so I decided to take crops of 400x400 to train the neural network. I don't think it is important how big the crops you use for training are, as long as you don't lose to much detail and the have the same aspect ratio of the input volume of the model you are finetuning (300x300 in my case). I did not change any hyperparameter.

After that, to perform the detection, I split the image (which can be of whatever size) in square crops. I've tried with multiple sizes but it seems that taking 6-8 crops is enough for me. If I try to predict using the whole image (without cropping), it detects some objects but the result is less accurate, however, it is much faster (300ms vs 800ms CPU only).

So, basically, it is a trade-off here. If I have time, I would like to train the model using distorted images (I mean, with a different aspect ratio of the input volume), to see if I can detect objects without cropping.

You can see the result here if you want: https://github.com/fdiazgon/cone-detector-tf

I would try to comment again if I get some interesting results.

PD: really interesting conversation, by the way.

TBdt38 commented 6 years ago

i'm working on similar models but i'm quiet new. Could you elaborate when you say that you take crops to train the network.
I'm not sure to understand how it helps to identify small objects. By using cropped images, it is like a zoom in right? so i understand it would reduce the training losses but during prediction, won't the problem be the same for tiny objects? Also for the cropping, what is good strategy? crop directly the bounding boxes, or center image cropping or random area? thanks for the explanations!

fediazgon commented 6 years ago

By using cropped images, it is like a zoom in right?

Yes. The thing is, you have a 1280*720 image with an object of size 10x10 in it. But, the input volume of the NN you are finetuning is of size 300x300, so the NN has to squeeze the image before processing it and, consequently, the object you are trying to detect is no more 10x10, it is much smaller and maybe impossible to detect.

but during prediction, won't the problem be the same for tiny objects?

No if you take crops during prediction (at the expense of increasing the time to detect).

Also for the cropping, what is good strategy?

I 'scanned' the whole image. Make the crops overlap a few pixels in case the crop cuts an object in half. I did this for an image of 1280x720 with crops of 300x300, and I needed to take eight crops (you can increase the size of the crop).

TBdt38 commented 6 years ago

ok i see! did you use special routines to make the cropping task or internal tensorflow methods? i saw some were available in TF apparently. For prediction, if you do cropping too, i understand it will take more time for sure! still you preferred doing cropping with ssd mobilenet than using faster rcnn and a single image prediction? it still save processing time using mobilenet+multiple cropped images?

fediazgon commented 6 years ago

it still save processing time using mobilenet+multiple cropped images?

That's a good question. I haven't tried. The problem that I have is that I need the smallest and least resource-consuming network possible and, from a quick read, mobilenet-ssd seemed to be the best choice (maybe I am wrong). But it would be awesome if I could avoid cropping.

AliceDinh commented 6 years ago

Hi guys, I found that different input image sizes donot affect the accuracy much, considered similar (for my dataset). I highly recommend SSD-FPN which improves the small object detection significantly

david-macleod commented 6 years ago

@AliceDinh thank you for the insight, I am looking to try SSD-FPN. Please could you explain further what your mean by

SSD anchor generator by somehow creates crops automatically and randomly?

AliceDinh commented 6 years ago

@david-macleod Let's try SSD-FPN, training with TensorFlow Object Detection API version 2 (which replace train.py by model_main.py), you will see the improvement. About the cropping, I found that for my data which is quite sensitive, I will try with other data then post here later on.

AliceDinh commented 6 years ago

Hi @Tsuihao In your result that you posted a while ago: 35113972-88afece4-fc83-11e7-9d3e-e411d49d9650 How to display the time in milisecond and the total detected objects as you did? Please show me! Thanks

giridhar13 commented 6 years ago

@AliceDinh Did you try the training with SSD-FPN.If yes did you find any improvement over the normal SSD

giridhar13 commented 6 years ago

I was Training with SSD_resnet_fpn on Bosch Small TL dataset and hit this issue during Training.My losses just spiked up at step number 6K. Should I continue Training,My regularization loss does not come down.From the original config file i just modified 2 Things: Batch size 32 instaed of 64 Warmup steps/total steps 800/18000 instead of 2000/25000

dscha09 commented 5 years ago

@eumicro what model and how did you fine-tune the model to get accurate prediction?

eumicro commented 5 years ago

@eumicro what model and how did you fine-tune the model to get accurate prediction?

Hi, sorry my English is not that good. I described how I fine tuned and trained the SSD MobileNet here (only in German, sorry): http://eugen-lange.de/german-traffic-sign-detection/

the main "tuning steps" are:

generated my own data set (see my homepage for more details), I think it was the most important "step" ^^...
removed 2 first layers from the MobileNet
used grayscale pictures

AliceDinh commented 5 years ago

@giridhar13 SSD-FPN that I tried is SSD Mobilenet V1 FPN and the accuracy is very good: Comparing to SSD Mobilenet V1, I got total loss arround 0.8 and stay there forever

aysark commented 5 years ago

What sizes of imagery is your data? the problem with SSD is it doesn't work for large images.

AliceDinh commented 5 years ago

660x416 is my image size, object size is about 25x35

aysark commented 5 years ago

@AliceDinh doesn't SSD FPN require fixed shape image resizing? Do you pad your images?

I've tried 640x640 SSD FPN and got poor results.

AliceDinh commented 5 years ago

I got good result without padding

aysark commented 5 years ago

@AliceDinh thank you for the insight. What do you suggest i use in terms of model arch- from your experience: Image size: 1200x854 Object size: 40x40

So if i downsize my images, object size will be really small.

I tried taking 256x256 patches but accuracy was not good. How can i train SSD FPN with large imagery?

tensorflow / models

[SSD] Small object detection #3196