detection small object from big images

chi8411 commented 5 years ago

Hi,I want detection traffic sign from big images.What can I do to improve the accuracy? And I want to try to change the network. Should I just change the convolution of cfg? If I want to reduce the number of downsampling layers in Yolov3, how can I do? Thank you.

AlexeyAB commented 5 years ago

@chi8411 Hi,

You should use width=832 height=832 or width=1024 height=1024 in your cfg-file
Also you can try to use these modified cfg-files:
- https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3_5l.cfg
- https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-tiny_3l.cfg

chi8411 commented 5 years ago

ok, thank you. I will try.

kooscode commented 5 years ago

how big width/height in pixels are the items you are trying to detect? and how big resolution wise is your source image?

chi8411 commented 5 years ago

您要檢測的項目的寬度/高度（像素）有多大？你的源圖像有多大分辨率？

Hi,my image is 2048*2048. The goal I want to detect is the traffic sign, not big.

kooscode commented 5 years ago

How many pixels width and height is 'not big' ?

chi8411 commented 5 years ago

寬度和高度有多少像素“不大”？ The width and height are about [0,300], but the maximum range is [30,60].

kooscode commented 5 years ago

if the network resizes 2048 down to 1024, you will get effective size of 15-30 pixels per object, which should be plenty to detect it.. But with a 1024x1024 network input size, you will need a LOT of GPU memory to train, or very small batch sizes, and it will be very very slow to train and inference.

I suggest a network size of 512x512 and then chop your main image into 512x512 blocks/tiles and inference per block.. that way you can utilize full resolution of source image and you wont lose any details due to resizing.. and you should also train with same tiles.

we do this for detecting aerial images and we detect 15x15 pixel objects in 5000x8000 pixel full res images with very high accuracy..

chi8411 commented 5 years ago

if the network resizes 2048 down to 1024, you will get effective size of 15-30 pixels per object, which should be plenty to detect it.. But with a 1024x1024 network input size, you will need a LOT of GPU memory to train, or very small batch sizes, and it will be very very slow to train and inference.

I suggest a network size of 512x512 and then chop your main image into 512x512 blocks/tiles and inference per block.. that way you can utilize full resolution of source image and you wont lose any details due to resizing.. and you should also train with same tiles.

we do this for detecting aerial images and we detect 15x15 pixel objects in 5000x8000 pixel full res images with very high accuracy..

so, your mean is a picture of 2048x2048 is cut into 16 pieces of 512x512. Is it? Will the project be cut so that it cannot be detected? sorry, I can't understand "inference per block" and "train with same tiles". Can you say in detail? Or do you have a paper for reference? Thank for your help.

wahid18benz commented 5 years ago

@kooscode , I'm trying to train yoloV3 for Drone Dataset with different sizes,HD ,Full Hd, and 4K . Do i have to change image width and height , and anchors box or not ? what do you suggest to improve accuracy?

thanks,

kooscode commented 5 years ago

@chi8411 - Yes, you can cut the 2048x2048 image into 16 images of 512x512.

You can use a sliding window of 512x512 with a stride of 480pixels or somethings similar, meaning all the 512x512 squares will have at least 30 pixel overlap, so you wont miss stuff. you would need to remove duplicates though..

@wahid18benz - same question, what are you trying to detect and how big physical size is it and what GSD are you using? are you flying constant AGL and with terrain following?

kooscode commented 5 years ago

@wahid18benz, the anchor boxes should match the shape of your objects for better alignment and shape of predicted boxes. For example if you use a pre-defined network and its anchor boxes were meant for things like pedestrians and traffic signs (i.e. long rectangles), it will have a hard time with accurate alignment of perfect square boxes around objects.

chi8411 commented 5 years ago

@kooscode Excuse me, the picture is processed beforehand, or it is cut after entering a picture in yolov3. Is your sliding window another program or is it added to yolov3? I want to know more about this method!! I think it's a good way to detect small objects. Thank you.

kooscode commented 5 years ago

@chi8411 - it is not part of yolo, i wrote it myself and yes, its processed into tiles of 512x512 (or whatever your network input size is) and then inference.

we use a multi threaded multi GPU inferencing system on aerial images cut these 512x512 blocks using sliding window with stride from same image and then in parallel inference across multi GPU's and then remove any duplicates and restore back to original image coordinates.

you essentially end up with a neural net of any size being able to inference an image of any size and its very fast and very accurate since its 1:1 resolution from source image to network input.

We are working on modifying YOLO so we can use the similar region proposal algorithm to identify which tiles contain objects of interest and then only do full neural net on those tiles. but right now sliding window works well.

chi8411 commented 5 years ago

@kooscode so, suppose you want to train a picture. Your yolo input is 16 pictures and 16 new labels. Is it? or your input is a image of the object and new labels? What about your test? cut or not cut? If it is not cut, can it be detected?

Thank you.

wahid18benz commented 5 years ago

@kooscode
I don't have information about GSD and AGL , I'm using Visdrone Dataset http://aiskyeye.com/views/getInfo?loc=3 I have ten classes : pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle.

kooscode commented 5 years ago

Gotcha. In this case, I would suggest you compute anchor boxes for this particular application.

Erissonleo commented 5 years ago

if the network resizes 2048 down to 1024, you will get effective size of 15-30 pixels per object, which should be plenty to detect it.. But with a 1024x1024 network input size, you will need a LOT of GPU memory to train, or very small batch sizes, and it will be very very slow to train and inference. I suggest a network size of 512x512 and then chop your main image into 512x512 blocks/tiles and inference per block.. that way you can utilize full resolution of source image and you wont lose any details due to resizing.. and you should also train with same tiles. we do this for detecting aerial images and we detect 15x15 pixel objects in 5000x8000 pixel full res images with very high accuracy..

@kooscode Do you crop the image before training or when detecting objects ? I want to detect cars from video filmed by UAV. The video resolution is 1920x1080, and target objects is about 15x15 pixels. I have tried make my training dataset by cuting the images, and set width/height 832/832 in cfg file, but the result is not very good.
Thank you!

kooscode commented 5 years ago

You have to train and inference with the same full resolution 512x512 crops from original image, yes.

ZHI-ANG commented 5 years ago

@chi8411 Are you working on the Tsinghua-Tencent 100K dataset? How are you going on with the project? I'm also working on this dataset and trapped in the small sign detection. I think the approach kooscode advised would still not work well on this dataset for the differences between the two problems:

Traffic signs are highly similar to each other, more details are needed to distinguish among them. However, kooscode might only need to detect one class in the air or classes with significant differences @kooscode Is that the case? Sorry if I make mistakes so fewer details of the object are needed. More examples like the paper "find tiny faces".

What's more, the first two convs in YOLO v3 also downsample the images 4X even if you put the original resolution in. The traffic signs of size [16,32](which account for nearly 25% in dataset) still face the lack of details.

So I think YOLO v3 may not be the best recipe in this case. A two-stage method with region proposal(RP) and classification might be a better choice. Classification should work on the regions of the original input of the image. The only problems I estimate are:

Roi pooling is not implemented in darknet. RP and Classification may have to work seperately.
Train the two part of the CNN respectively. The classification training dataset need to be created from Tsinghua-tencent 100k dataset by yourself.

YKritet commented 5 years ago

Is it possible to start training yolo with a set of parameters (width and height), stop the training, change these parameters an then continue the training ? More like will it affect negatively our decreasing loss curve ?

faybak commented 5 years ago

if the network resizes 2048 down to 1024, you will get effective size of 15-30 pixels per object, which should be plenty to detect it.. But with a 1024x1024 network input size, you will need a LOT of GPU memory to train, or very small batch sizes, and it will be very very slow to train and inference.

I suggest a network size of 512x512 and then chop your main image into 512x512 blocks/tiles and inference per block.. that way you can utilize full resolution of source image and you wont lose any details due to resizing.. and you should also train with same tiles.

we do this for detecting aerial images and we detect 15x15 pixel objects in 5000x8000 pixel full res images with very high accuracy..

I don't understand very well about the "512x512 blocks/tiles", how do i make that? Do we have to use multi-gpu? I have images of different sizes ranging from 800x600 to 1270x720 and i want to detect very small object like a dash on numberplate car i'm using a tesla v100

kooscode commented 5 years ago

you cut up the image into 512x512 squares and inference each square.

if you are finding number plates, i would recommend you instead detect the number plate, extract that bounding box as a region of interest and then feed that into a different network at full resolution.

--

kooscode commented 5 years ago

Is it possible to start training yolo with a set of parameters (width and height), stop the training, change these parameters an then continue the training ? More like will it affect negatively our decreasing loss curve ?

yes you can. id recommend you extract convolutional weights from your first training set and use that for transfer learning job using a bigger input size..

same goes for once you trained, you can inference with any size network.

faybak commented 5 years ago

then feed that into a different network at full resolution.

how can I feed that into a different network at full resolution? when i crop the plate detected for making detection for numberplate, it doesn't recognize anything. But if i'm using the principale image without crop it can recognize some caractere. but not all. How can i do that

kooscode commented 5 years ago

how can I feed that into a different network at full resolution?

Well, if you have the full resolution image and you crop out the number plate as ROI, that cropped image is at full resolution.. so just feed that into a number plate reader...

https://blog.yellowant.com/automate-license-plate-recognition-in-3-simple-steps-f50886177d2e

faybak commented 5 years ago

My idea was to use only yolo along the process . Is it possible?

kooscode commented 5 years ago

Well, I guess you can broach a square peg into a round hole with a huge hammer and enough force..

Why not use the right tool for the right job?

https://nanonets.com/blog/attention-ocr-for-text-recogntion/

matteoguidi commented 4 years ago

we use a multi threaded multi GPU inferencing system on aerial images cut these 512x512 blocks using sliding window with stride from same image and then in parallel inference across multi GPU's and then remove any duplicates and restore back to original image coordinates.

@kooscode does your training set consist of image of 512*512 size?

Because my training dataset consists of 256x256 patches, where the object I want to detect cover 10% to 40% of the image. But then I want to detect these objects from 20000x20000 images!

Should I train using width and height of 256 and then test with same configuration, on 256x256 patches that I split from the original image?

kooscode commented 4 years ago

@matteoguidi - yes, we trained on 512x512

in your case, you should train on 256x256 full resolution using a network with same 256x256x3 input size.

Then during inference you can cut your 20kx20k image up into whatever size you want, so long you then also adjust your inference network input size to that same size. for example if you have good hardware and can deal with a 928x928 input size, then cut your imges into those block sizes and feed them 1:1 ratio to input size into the network.

I would also suggest when you tile your image, use a overlap stride so you dont miss cut in half (or more) objects and you then de-duplicate after reconstructing detected objects locations back into the 20kx20k image

does that make sense ?

matteoguidi commented 4 years ago

@kooscode yep, this totally makes sense. I will try to do that and see what results I obtain.

Just one last question, did you connect the layers in the .cfg file as Alexey did?

for training for small objects (smaller than 16x16 after the image is resized to 416x416) - set layers = -1, 11 instead of https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/cfg/yolov3.cfg#L720 and set stride=4 instead of https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/cfg/yolov3.cfg#L717

Thank you very much for your help!

Vinod-Koli commented 4 years ago

@chi8411 @AlexeyAB I am working on a very similar problem, to detect very small objects of size 10x10 in an image of 1920x1080,

Could you please tell us about your experiments and results, what worked best for you ? It would save a lot of time for me.

Thank you.

matteoguidi commented 4 years ago

@Vinod-Koli Hey there,

in the past months I managed to perform detection of very small objects (10 to 20 pixels) in images of 30.000 by 30.000 pixels.

For the training, I trained on patches containing objects of 256x256 pixels. Width and height in the .yolo file were set as 256. For the detection, I split the image in smaller patches (like 1056x1056 pixels) and performed detection on that, using different values of width and height in the .yolo file, and then selecting the one that performed the best.

Like the comment of @kooscode said:

I would also suggest when you tile your image, use a overlap stride so you dont miss cut in half (or more) objects and you then de-duplicate after reconstructing detected objects locations back into the 20kx20k image

I also used a overlap portion of the images (like 15% of the dimension of the patch) and removed the occasional double detection, and the results were really good.

Try to do the same, and see if it works for your case!

Vinod-Koli commented 4 years ago

@matteoguidi

Thank you for the information, I will try to implement as you mentioned. In my case, i have to achieve real-time performance of at least 10-15 FPS, so what was the inference time in your and which version of yolo did you use ?

and what are your thoughts on this (quoted from @ZHI-ANG )

What's more, the first two convs in YOLO v3 also downsample the images 4X even if you put the original resolution in. The traffic signs of size [16,32](which account for nearly 25% in dataset) still face the lack of details.

Thanks alot!

matteoguidi commented 4 years ago

In my case, i have to achieve real-time performance of at least 10-15 FPS, so what was the inference time in your and which version of yolo did you use ?

I'm sorry, can't help you with that since I was doing detection on images, not videos.. But still, I used YoloV3 (the AlexeyAB fork). For the detection, depending on the number of images, and using a GeForce RTX 2080 takes about 5 minutes.

and what are your thoughts on this (quoted from @ZHI-ANG )

What's more, the first two convs in YOLO v3 also downsample the images 4X even if you put the original resolution in. The traffic signs of size [16,32](which account for nearly 25% in dataset) still face the lack of details.

Don't know about that issue, since I was not doing a precise classification of the objects, just detection of a single class. Maybe try to increase the resolution, so when the 4X downscale happens, it does not degrade the details?

Vinod-Koli commented 4 years ago

Thanks for the information!

Leprechault commented 4 years ago

@chi8411 - it is not part of yolo, i wrote it myself and yes, its processed into tiles of 512x512 (or whatever your network input size is) and then inference.

we use a multi threaded multi GPU inferencing system on aerial images cut these 512x512 blocks using sliding window with stride from same image and then in parallel inference across multi GPU's and then remove any duplicates and restore back to original image coordinates.

you essentially end up with a neural net of any size being able to inference an image of any size and its very fast and very accurate since its 1:1 resolution from source image to network input.

We are working on modifying YOLO so we can use the similar region proposal algorithm to identify which tiles contain objects of interest and then only do full neural net on those tiles. but right now sliding window works well.

I like very much this approach, please how is the procedure for manipulate this aerial images? Is this images in jpeg or Tiff format for training YOLO?

kooscode commented 4 years ago

@Leprechault - It doesnt matter the file format. although Darknet does not handle 16bit images and also does not support uncompressed TIFF and so we had to resort to using 8-bit uncompressed PNG files. But jpg should work fine too..

Leprechault commented 4 years ago

@Leprechault - It doesnt matter the file format. although Darknet does not handle 16bit images and also does not support uncompressed TIFF and so we had to resort to using 8-bit uncompressed PNG files. But jpg should work fine too..

Thank you kooscode. But whats's the approach for applied YOLO v3 in aerial orthomosaic, normally in geoTIFF with big size (in my case my geoTIFF has 22000x22000 pixel) and retrieves the geographic information about object classification if I make an image conversion to PNG or JPEG?

matteoguidi commented 4 years ago

@Leprechault I can tell you what my approach was to make detection on geoTIFF big images (sentinel-1 data).

Read image as an array
Split the image in small patches
Convert the array patches in jpg (while keeping the same number of pixels)
Perform detection on jpg patches. Each detection returns the top-left pixel of the detected box in patch coordinates.
Convert the coordinates of the detected object from patch coordinates to image coordinates
Convert that last coordinates into lat/lon with this formula https://gdal.org/user/raster_data_model.html#affine-geotransform

I obtained very good and precise results, but it took a lot of tries and work.

Leprechault commented 4 years ago

@matteoguidi thank you very much!!! I'II used our tips as a workflow for try.

robisen1 commented 4 years ago

@kooscode I don't have information about GSD and AGL , I'm using Visdrone Dataset http://aiskyeye.com/views/getInfo?loc=3 I have ten classes : pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle.

That link is dead now, can you share that dataset?

Leprechault commented 4 years ago

if the network resizes 2048 down to 1024, you will get effective size of 15-30 pixels per object, which should be plenty to detect it.. But with a 1024x1024 network input size, you will need a LOT of GPU memory to train, or very small batch sizes, and it will be very very slow to train and inference.

I suggest a network size of 512x512 and then chop your main image into 512x512 blocks/tiles and inference per block.. that way you can utilize full resolution of source image and you wont lose any details due to resizing.. and you should also train with same tiles.

we do this for detecting aerial images and we detect 15x15 pixel objects in 5000x8000 pixel full res images with very high accura

@Leprechault I can tell you what my approach was to make detection on geoTIFF big images (sentinel-1 data).

Read image as an array

Split the image in small patches

Convert the array patches in jpg (while keeping the same number of pixels)

Perform detection on jpg patches. Each detection returns the top-left pixel of the detected box in patch coordinates.

Convert the coordinates of the detected object from patch coordinates to image coordinates

Convert that last coordinates into lat/lon with this formula https://gdal.org/user/raster_data_model.html#affine-geotransform

I obtained very good and precise results, but it took a lot of tries and work.

@matteoguidi thanks for the tips, in my case a have small target objects (10-40 pixels) and I slipt my images in *jpg with 80x80 pixels by image for a fast Darknet training. What changes, that's work for you, do you recommend in detector.c file for improve the model accuracy?

robisen1 commented 4 years ago

if the network resizes 2048 down to 1024, you will get effective size of 15-30 pixels per object, which should be plenty to detect it.. But with a 1024x1024 network input size, you will need a LOT of GPU memory to train, or very small batch sizes, and it will be very very slow to train and inference.

I suggest a network size of 512x512 and then chop your main image into 512x512 blocks/tiles and inference per block.. that way you can utilize full resolution of source image and you wont lose any details due to resizing.. and you should also train with same tiles.

we do this for detecting aerial images and we detect 15x15 pixel objects in 5000x8000 pixel full res images with very high accuracy..

How do you cut up the images with loosing your bounding boxes?

Leprechault commented 4 years ago

if the network resizes 2048 down to 1024, you will get effective size of 15-30 pixels per object, which should be plenty to detect it.. But with a 1024x1024 network input size, you will need a LOT of GPU memory to train, or very small batch sizes, and it will be very very slow to train and inference. I suggest a network size of 512x512 and then chop your main image into 512x512 blocks/tiles and inference per block.. that way you can utilize full resolution of source image and you wont lose any details due to resizing.. and you should also train with same tiles. we do this for detecting aerial images and we detect 15x15 pixel objects in 5000x8000 pixel full res images with very high accuracy..

How do you cut up the images with loosing your bounding boxes?

Thanks @robisen1. In my case, first I croppped all the original images in 80x80 pixels size and after I created the bounding boxes in each image cropped ( I'm doing this step yet). This 22.000 images, each with 80x80 pixels in jpg + bounding box coordinates in txt, I will use for training my model.

robisen1 commented 4 years ago

Ohh you have to recreate the bouncing boxes? I have 100k images and manually redoing bouncing boxes is not possible for me.

kooscode commented 4 years ago

if you have 100k images and you already have the bounding boxes, why would you have to re-do them?

robisen1 commented 4 years ago

if you have 100k images and you already have the bounding boxes, why would you have to re-do them?

They are large and it seems the recommendation is to cut the images into small images? If so how do that with cut an object or bounding box?

Leprechault commented 4 years ago

@Leprechault I can tell you what my approach was to make detection on geoTIFF big images (sentinel-1 data).

Read image as an array

Split the image in small patches

Convert the array patches in jpg (while keeping the same number of pixels)

Perform detection on jpg patches. Each detection returns the top-left pixel of the detected box in patch coordinates.

Convert the coordinates of the detected object from patch coordinates to image coordinates

Convert that last coordinates into lat/lon with this formula https://gdal.org/user/raster_data_model.html#affine-geotransform

I obtained very good and precise results, but it took a lot of tries and work.

@matteoguidi please, what's is the maximum size of geoTIFF big image that's work for you in: ./darknet detector test ... ? Because, if I try to use a 26338x21685 pixels * jpg orthomosaic image (152MB), I have a to large error:/

  105 conv     18  1 x 1 / 1    76 x  76 x 256   ->    76 x  76 x  18  0.053 BFLOPs
  106 yolo
Loading weights from backup/obj_100.weights...Done!
Cannot load image "ORTO_T4_FULL_CROP.jpg"
STB Reason: too large

matteoguidi commented 4 years ago

@Leprechault I know what your problem is. Don't use the "graphical" part of darknet detector, because I think is not able to open such large images. Going by memory, the max I could get was 3k x 3k pixels? Instead, you should save the output in a .txt file with this command:

./darknet detector test cfg/voc.data yolo-voc.cfg yolo-voc.weights -dont_show -ext_output < data/train.txt > result.txt

where you should substitute your input files (.data, .cfg and .weights). The train.txt file is a simple .txt file in which you have for each line the path to the image that you want to be detected (in your case, your .jpg image). The result.txt file will be created, and the results of the detection will be stored inside.

Don't know if your GPU will be able to run the detection on such a big image, but give it a try.

Leprechault commented 4 years ago

Thanks @matteoguidi for your attention and proposed solution, but now I have a new output error. First, as you recommend I create a train.txt with the path for my image. Inside my train.txt I just have:

/home/fitlab1/Área de Trabalho/CNN4antsYOLO/darknet/ORTO_T4_FULL_CROP.jpg

and after I apply the code ./darknet detector test obj.data obj.cfg backup/obj_100.weights -dont_show -ext_output data/train.txt result.txt the output was:

  101 conv    128  1 x 1 / 1    76 x  76 x 256   ->    76 x  76 x 128  0.379 BFLOPs
  102 conv    256  3 x 3 / 1    76 x  76 x 128   ->    76 x  76 x 256  3.407 BFLOPs
  103 conv    128  1 x 1 / 1    76 x  76 x 256   ->    76 x  76 x 128  0.379 BFLOPs
  104 conv    256  3 x 3 / 1    76 x  76 x 128   ->    76 x  76 x 256  3.407 BFLOPs
  105 conv     18  1 x 1 / 1    76 x  76 x 256   ->    76 x  76 x  18  0.053 BFLOPs
  106 yolo
Loading weights from backup/obj_100.weights...Done!
Cannot load image "-dont_show"
STB Reason: can't fopen

...

pjreddie / darknet

detection small object from big images #1535