ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.18k stars 3.44k forks source link

INCREASING NMS SPEED #679

Closed glenn-jocher closed 4 years ago

glenn-jocher commented 4 years ago

Non Maximal Suppression (NMS) of bounding boxes is a significant speed constraint during testing. I am opening this issue to try to determine options for speeding up this operation. I am going to compare the default NMS method 'MERGE' with two newly available PyTorch methods. If anyone has any additional methods we could test, please post here.

https://github.com/ultralytics/yolov3/blob/cadd2f75ff5108818048cf48af8a5e8558acf6ee/utils/utils.py#L456

The test code is below. Hardware is a 2080Ti.

python3 test.py --weights ultralytics68.pt --nms-thres 0.6 --img-size 512 --device 0

UPDATE: THESE ARE OLD RESULTS, SEE BOTTOM OF THREAD FOR IMPROVED RESULTS

Speed
mm:ss
COCO mAP
@0.5...0.95
COCO mAP
@0.5
ultralytics 'OR' 8:20 39.7 60.3
ultralytics 'AND' 7:38 39.6 60.1
ultralytics 'SOFT' 12:00 39.1 58.7
ultralytics 'MERGE' 11:25 40.2 60.4
torchvision.ops.boxes.nms() 5:08 39.7 60.3
torchvision.ops.boxes.batched_nms() 6:00 39.7 60.3
glenn-jocher commented 4 years ago

Results of the test is that torchvision.ops.boxes.nms() is fastest but not the highest mAP. Ultralytics MERGE method increases AP + 0.5, so I will leave it for testing (when calling test.py directly using --conf-thres 0.001), and use torchvision.ops.boxes.nms() for calculating mAP when training using --conf-thres 0.10 (to increase training speed).

https://github.com/ultralytics/yolov3/blob/1e9ddc5a90057d39fb47af52737dd24e99db2f07/utils/utils.py#L513-L517

FranciscoReveriano commented 4 years ago

I will look more into this during the weekend.

developer0hye commented 4 years ago

great works!

omizonly commented 4 years ago

torchvision. ops implements operators that are specific for Computer Vision. Those operators currently do not support TorchScript. Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU)

AttributeError: module 'torchvision' has no attribute 'ops'

what should I do?

glenn-jocher commented 4 years ago

@omizonly what is your use case for TorchScript?

omizonly commented 4 years ago

@omizonly what is your use case for TorchScript?

tensorflow= 1.3.1

glenn-jocher commented 4 years ago

@omizonly I don't understand, can you elaborate? This repo only runs PyTorch, and exports to ONNX for onward use in other formats, however we clearly can not support you with problems in those other formats. I suggest you raise an issue on the PyTorch or TF repos.

glenn-jocher commented 4 years ago

I'll close this issue for now as the original issue appears to have been resolved, and/or no activity has been seen for some time. Feel free to comment if this is not the case.

glenn-jocher commented 4 years ago

Quick update with latest code on one T4 GPU. Second line is current default.

python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp.cfg --img 608
Time
sec/image
Time
mm:ss
COCO mAP
@0.5...0.95
COCO mAP
@0.5
'vision_batched', multi_cls=False 43ms 3:36 40.2 60.4
'vision_batched', multi_cls=True 48ms 4:01 40.9 61.4
'merge', multi_cls=True 172ms 14:23 41.3 61.7
FranciscoReveriano commented 4 years ago

Is there a way to make the model print the JSON file if it detects an object regardless of classification?

Zzh-tju commented 4 years ago

Hi, I saw a Fast NMS proposed by YOLACT. How is it? https://arxiv.org/abs/1912.06218

glenn-jocher commented 4 years ago

@Zzh-tju yes that seems an interesting approach. They apply NMS as a matrix operation to remove the for loop, which they say runs much faster with a minimum mAP penalty.

Depending on the conf-thres used, NMS may or may not be a very expensive operation in this repo. For most actual use applications with conf-thres around 0.1-0.9, NMS is not a speed concern, taking <10% of the total processing time for an image, but when calculating mAP near conf-thres = 0.0001 for example, NMS may take up 90% of the processing time.

If you can try to implement a fast NMS experiment here that would be very useful. The NMS function is here. In the meantime I will update this thread with the latest speeds on a T4 colab instance. https://github.com/ultralytics/yolov3/blob/dce753ead4a8378055fc07be54c3f54bcf55e2ed/utils/utils.py#L504-L512

UPDATE: I've posted an issue on yolact repo for this https://github.com/dbolya/yolact/issues/366#issue-575069787

glenn-jocher commented 4 years ago

Update: I discovered a majority of time in test.py was spent building pycocotools JSON files for official mAPs. If I turn off this functionality (compute mAP only with repo code) I get the following times for the 5k COCO2014 val images. Machine is a 12-vCPU V100 instance.

python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp --img 608
NMS method Time
ms/img
Time
mm:ss
mAP
@0.5:0.95
mAP
@0.5
'vision_batched' (default) 15.2 ms 1:16 41.9 61.8
'merge' 103 ms 8:35 42.3 62.0
'fast_batched' 14.6 ms 1:13 41.5 61.5
glenn-jocher commented 4 years ago

@Zzh-tju FastNMS updates have been committed and pushed now after testing. https://github.com/ultralytics/yolov3/blob/f915bf175c02911a1f40fbd2de8494963d4e7914/utils/utils.py#L564-L571

glenn-jocher commented 4 years ago

@Zzh-tju to clear up the timing a bit more, I added profiling code to test.py that specifically tracks inference and NMS times in https://github.com/ultralytics/yolov3/commit/e482392161c30d4e4dbf4b4eebdb4672fcc6a134. This can be accessed with the --profile flag:

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

I ran with both default torchvision NMS and the yolact FastNMS, and actually saw a slight speed decrease with FastNMS:

Default: Profile results: 1.3/6.9/8.1 ms inference/NMS/total per image FastNMS: Profile results: 1.3/7.1/8.4 ms inference/NMS/total per image

So perhaps the slight speed increase from FastNMS observed in the total test time is due simply to a reduced box count produced by this NMS method, which results in less postprocessing work during testing (mAP calculation etc.).

The other surprise was the great amount of total time spent on NMS vs inference. Even under the default settings 6.9/8.1 = 85% of the total time is spent on NMS!

glenn-jocher commented 4 years ago

CORRECTION: My previous analysis was incorrect, it lacked the torch.cuda.synchronize() operations necessary when profiling cuda operations. I've fixed this in https://github.com/ultralytics/yolov3/commit/1430a1e4083609ab197cf1947a12ab8692b20593. Corrected results, consistent across several runs:

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

Default: Profile results: 6.6/1.6/8.2 ms inference/NMS/total per image FastNMS: Profile results: 6.6/1.9/8.5 ms inference/NMS/total per image

Conclusion is that inference uses most (80%) of the runtime in both cases, and that FastNMS appears to run slightly slower than default torchvision.ops.boxes.batched_nms().

glenn-jocher commented 4 years ago

Inference can be sped up with larger batch sizes, but NMS is run per image in all cases, so the only ways to affect it's speed currently are here. Note that the 1.6 ms profile time uses all default settings though (none of these speedups are applied).

glenn-jocher commented 4 years ago

Running a few tests to document effects on speed. These are with a V100 from a docker container, which is slightly slower than running natively.

python3 test.py --cfg yolov3-spp.cfg --weights yolov3-spp-ultralytics.pt --img 608

rect=False cudnn.deterministic=True, cudnn.benchmark = False: 12.9/1.8/14.8 ms inference/NMS/total per 608x608 image at batch-size 32 cudnn.deterministic=False, cudnn.benchmark = False: 9.9/1.7/11.6 ms inference/NMS/total per 608x608 image at batch-size 32 cudnn.deterministic=False, cudnn.benchmark = True: 9.5/1.7/11.1 ms inference/NMS/total per 608x608 image at batch-size 32

rect=True cudnn.deterministic=True, cudnn.benchmark = False: 9.8/1.7/11.5 ms inference/NMS/total per 608x608 image at batch-size 32 cudnn.deterministic=False, cudnn.benchmark = False: (default) 6.8/1.7/8.6 ms inference/NMS/total per 608x608 image at batch-size 32 cudnn.deterministic=False, cudnn.benchmark = True: 18.2/1.7/19.9 ms inference/NMS/total per 608x608 image at batch-size 32 cudnn.deterministic=False, cudnn.benchmark = False, bs64 7.0/1.7/8.8 ms inference/NMS/total per 608x608 image at batch-size 64 cudnn.deterministic=False, cudnn.benchmark = False, bs1 14.0/2.0/16.0 ms inference/NMS/total per 608x608 image at batch-size 1 cudnn.deterministic=False, cudnn.benchmark = False, no contiguous() in models.py L207 6.8/1.7/8.5 ms inference/NMS/total per 608x608 image at batch-size 32 cudnn.deterministic=False, cudnn.benchmark = False, no contiguous(), reshape in models.py L207 6.8/1.7/8.5 ms inference/NMS/total per 608x608 image at batch-size 32

Running default natively: Speed: 6.7/1.6/8.2 ms inference/NMS/total per 608x608 image at batch-size 32 no contiguous(): Speed: 6.6/1.6/8.2 ms inference/NMS/total per 608x608 image at batch-size 32 no contiguous() bs1: Speed: 12.8/1.8/14.6 ms inference/NMS/total per 608x608 image at batch-size 1 yes contiguous() bs1: Speed: 12.7/1.8/14.5 ms inference/NMS/total per 608x608 image at batch-size 1 no contiguous() bs1 img-size 512 Speed: 12.5/1.8/14.3 ms inference/NMS/total per 512x512 image at batch-size 1 no contiguous() bs1 img-size 416 Speed: 12.8/1.8/14.6 ms inference/NMS/total per 416x416 image at batch-size 1 no contiguous() bs1 img-size 608 yolov3-tiny Speed: 3.2/1.8/4.9 ms inference/NMS/total per 608x608 image at batch-size 1

glenn-jocher commented 4 years ago

V100: Speed: 6.6/1.5/8.1 ms inference/NMS/total per 608x608 image at batch-size 32 Speed: 17.2/1.5/18.8 ms inference/NMS/total per 800x800 image at batch-size 1 Speed: 11.8/1.5/13.3 ms inference/NMS/total per 608x608 image at batch-size 1 Speed: 11.6/1.5/13.1 ms inference/NMS/total per 512x512 image at batch-size 1 Speed: 11.6/1.5/13.1 ms inference/NMS/total per 416x416 image at batch-size 1 Speed: 11.6/1.5/13.1 ms inference/NMS/total per 320x320 image at batch-size 1

2080Ti: Speed: 9.2/1.2/10.4 ms inference/NMS/total per 608x608 image at batch-size 32 Speed: 13.9/1.5/15.4 ms inference/NMS/total per 608x608 image at batch-size 1

CPU: Speed: 753.0/2.9/756.0 ms inference/NMS/total per 608x608 image at batch-size 1

Zzh-tju commented 4 years ago

batch_size=32 means testing 32 images simultaneously including NMS?

glenn-jocher commented 4 years ago

@Zzh-tju batch-size 32 means for example a 32x3x608x608 tensor is passed to the model for inference. The inference outputs are passed to NMS, which operates sequentially over the images: for img in range(32):

https://github.com/ultralytics/yolov3/blob/4089735c5e515698b0b3b60e8726e6d601cfc090/utils/utils.py#L508

glenn-jocher commented 4 years ago

Test-time augmentation study https://github.com/ultralytics/yolov3/issues/931:

Default + 0 ops: 11.8/1.5/13.3 ms inference/NMS/total per 608x608 image at batch-size 1 Default + 1 ops: 18.7/1.6/20.3 ms inference/NMS/total per 608x608 image at batch-size 1 Default + 2 ops: 26.4/1.8/28.2 ms inference/NMS/total per 608x608 image at batch-size 1

glenn-jocher commented 4 years ago

Updated V100 speeds with fused inference: Speed: 11.1/1.7/12.8 ms inference/NMS/total per 608x608 image at batch-size 1 NEW RECORD Speed: 6.5/1.5/8.1 ms inference/NMS/total per 608x608 image at batch-size 32 NEW RECORD

Default + 0 ops: 11.1/1.7/12.8 ms inference/NMS/total per 608x608 image at batch-size 1 Default + 2 ops: 26.1/1.9/28.1 ms inference/NMS/total per 608x608 image at batch-size 1

glenn-jocher commented 4 years ago

SOLOv2 Table 7: Matrix NMS: https://arxiv.org/pdf/2003.10152.pdf

Screen Shot 2020-03-25 at 5 47 37 PM Screen Shot 2020-03-25 at 5 47 50 PM

UPDATE: Unable to reproduce using this code:

            elif method == 'matrix_batch':  # Matrix NMS from https://arxiv.org/abs/2003.10152
                iou = box_iou(boxes, boxes).triu_(diagonal=1)  # upper triangular iou matrix
                m = iou.max(0)[0].view(-1, 1)  # max values
                decay = torch.exp(-(iou ** 2 - m ** 2) / 0.5).min(0)[0]  # gauss with sigma=0.5
                scores *= decay
                i = torch.full((boxes.shape[0],), fill_value=1).bool()
qtw1998 commented 4 years ago

torchvision. ops implements operators that are specific for Computer Vision. Those operators currently do not support TorchScript. Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU)

AttributeError: module 'torchvision' has no attribute 'ops'

what should I do?

Have you solved it? I met the same problems

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

Zzh-tju commented 4 years ago

@glenn-jocher Hi, could you tell me why we cannot do NMS cross batches. Currently, NMS is done on images one by one. However, we turn on batch testing.

The number of detections from different images are different, is it the reason why we cannot perform real batch NMS?

glenn-jocher commented 4 years ago

@Zzh-tju feel free to play around with the NMS code and try your idea out. If you see performance improvements please submit a PR! Thank you.

Zzh-tju commented 4 years ago

@glenn-jocher Now, I just figured out a speed improvement. And will give you a PR later. You can try it and give it more optimization.

Because Torchvision NMS cannot run across images mode. (if we add image related offset for boxes, it will enlarge the size of IoU matrix quadratically). So I have to try Cluster-NMS. I keep the preprocessing of NMS unchanged, and just replace the core part of your merge nms with Cluster-Weighted NMS.

Batch Size torchvision merge nms batch mode Cluster-Weighted NMS Cluster-Weighted NMS
AP - 42.9 42.9 42.9
time 4 3.0ms 4.4ms 5.5ms
time 32 2.3ms 3.0ms 4.7ms

Now I want to ask you why with batchsize increase, NMS time decrease? (for torchvision nms) What's the max batchsize can we use? I run on 2 2080Ti GPUs. Batchsize 32 takes me about 6~7 GB memory per GPU. I guess if we continue to increase batchsize when testing, it may be benefited more by batch mode Cluster-NMS series. However, limited by my personal code ability, it might be possible to optimize the code better.

I think maybe the best way is to intergrate the preprocessing of NMS into batch mode either, even if it will bring us a slight performance drop. Now it takes about 1.3~1.5ms for preprocessing. And just 0.8 ms for your torchvision merge NMS. It still room for accelarating.

glenn-jocher commented 4 years ago

@Zzh-tju ah! Thanks for the interesting study. We've actually discovered that in yolov5 the regression is improved enough that we can stop using merge, and simply use the default pytorch NMS to get the same results. So the current NMS strategy we have is in yolov5 function is not to use merge anymore.

It is an interesting idea to do a batched NMS approach instead of calling the nms function once per image. Your results show a significant improvement, 2.3 / 3.0 is about 25% faster (!). This would make a huge improvement on yolov5s for example, which has inference time of 2.1ms per image at batch-size 32 FP16, about half of which is used up with NMS. See speeds here. NMS is about 1 ms per image in these numbers, so a 25% speedup there would be noticeable in the table. https://github.com/ultralytics/yolov5#pretrained-checkpoints

glenn-jocher commented 4 years ago

Right now the boxes are offset by (class max_image_size) to get batched per image (so different classes never overlap). I suppose to run once per batch we would offset boxes by (class max_image_size * image_index)? Are you using torchvision.ops.nms() or torchvision.ops._batched_nms()?

Zzh-tju commented 4 years ago

@glenn-jocher no, you misunderstand me. My question is why with batchsize increase, NMS speed increase either?

glenn-jocher commented 4 years ago

@Zzh-tju in my experiments with yolov5, NMS speed is the same no matter the batch size. For example from the notebook:

!python test.py --weights yolov5s.pt --data coco128.yaml --img 640 --batch 1
!python test.py --weights yolov5s.pt --data coco128.yaml --img 640 --batch 8
!python test.py --weights yolov5s.pt --data coco128.yaml --img 640 --batch 32

Output:

Namespace(augment=False, batch_size=1, conf_thres=0.001, data='./data/coco128.yaml', device='', img_size=640, iou_thres=0.65, merge=False, save_json=False, single_cls=False, task='val', verbose=False, weights='yolov5s.pt')
Using CUDA device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)

Model Summary: 191 layers, 7.46816e+06 parameters, 7.46816e+06 gradients
Fusing layers...
Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100% 128/128 [00:00<00:00, 8725.21it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 128/128 [00:03<00:00, 37.63it/s]
                 all         128         929       0.379        0.74       0.676        0.44
Speed: 9.3/1.8/11.1 ms inference/NMS/total per 640x640 image at batch-size 1

Namespace(augment=False, batch_size=8, conf_thres=0.001, data='./data/coco128.yaml', device='', img_size=640, iou_thres=0.65, merge=False, save_json=False, single_cls=False, task='val', verbose=False, weights='yolov5s.pt')
Using CUDA device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)

Model Summary: 191 layers, 7.46816e+06 parameters, 7.46816e+06 gradients
Fusing layers...
Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100% 128/128 [00:00<00:00, 5722.17it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 16/16 [00:02<00:00,  5.41it/s]
                 all         128         929       0.381       0.744        0.68       0.442
Speed: 4.1/2.2/6.3 ms inference/NMS/total per 640x640 image at batch-size 8

Namespace(augment=False, batch_size=32, conf_thres=0.001, data='./data/coco128.yaml', device='', img_size=640, iou_thres=0.65, merge=False, save_json=False, single_cls=False, task='val', verbose=False, weights='yolov5s.pt')
Using CUDA device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)

Model Summary: 191 layers, 7.46816e+06 parameters, 7.46816e+06 gradients
Fusing layers...
Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100% 128/128 [00:00<00:00, 9776.04it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 4/4 [00:04<00:00,  1.12s/it]
                 all         128         929       0.385       0.752       0.692       0.452
Speed: 4.2/2.1/6.3 ms inference/NMS/total per 640x640 image at batch-size 32

So 1.8ms, 2.2ms, 2.1ms at batch sizes 1, 8, 32. Basically NMS speed per image is not correlated to batch size.

Zzh-tju commented 4 years ago

got it @glenn-jocher , I will do more test with batchsize.

Zzh-tju commented 4 years ago

@glenn-jocher Hi, I have just finished a marginal work about Batch Mode Weighted Cluster-NMS for speeding up NMS. You can check https://github.com/Zzh-tju/yolov5 for details. My conclusion is Batch mode Weighted Cluster-NMS will benefit us when TTA is used.

glenn-jocher commented 4 years ago

@Zzh-tju ah, very interesting! I'll check out the forked repo.

glenn-jocher commented 4 years ago

@Zzh-tju I looked things over. You've clearly done a lot of work and experimentation!

I see it's hard to provide substantial gains off of the basic NMS unfortunately. I think this is because box regression is improving over past works, so perhaps the gains presented by merging two 0.90 iou boxes are less than for example merging two 0.5 iou boxes. It's unfortunate, because actually one of the yolov5 changes is increased grid sensetivity. In yolov3, only one cell per output layer could trigger on an object. In yolov5, >=3 cells per output layer always trigger per object (the nearest 3), so I'd expect many more boxes being proposed by yolov5 than by yolov3. It's frustrating that there isn't a better way to exploit all these extra statistics.

One very interesting piece of information I found out during the TTA and Ensembling work, I discovered that merging output grids always produced better results than appending output boxes togethor. If you look at the YOLOv5 ensembling module you will see that there are 3 options: https://github.com/ultralytics/yolov5/blob/cab36f72a852ef00e8b42d3283ba9b2fc757b17f/models/experimental.py#L117-L129

If there was a way to mean() TTA output grids the way that mean ensemble works, this might produce the best results, but it is very complicated due to the varying output shapes unfortunately, so abandoned this effort.

Zzh-tju commented 4 years ago

@glenn-jocher wait a second, why do TTA output grids have different shape of outputs?

Zzh-tju commented 4 years ago

@glenn-jocher And I did saw an improvement when merging two 0.8 IoU boxes rather than two 0.65 boxes.

glenn-jocher commented 4 years ago

@Zzh-tju ensemble output grids will have the same shape, for example if you run both YOLOv5s and YOLOv5m at the same image size, the 3 output grids from YOLOv5s are the same size as from YOLOv5m.

TTA uses different inference sizes as part of it's augmentation, so naturally the output grids will change in size, and can no longer be directly meaned.

Hmm, interesting, 0.8 IoU is higher than I've ever tried. I think the more accurate the box regressions, the higher you can raise the IoU threshold. What was the improvement you saw using 0.8 IoU?

Zzh-tju commented 4 years ago

@glenn-jocher see the results in https://github.com/Zzh-tju/yolov5. weighted threshold is the merging threshold

Zzh-tju commented 4 years ago

@glenn-jocher mmexport1599542292807 Do you mean with input size change, the size of output grid map will change too?

glenn-jocher commented 4 years ago

@Zzh-tju yes. YOLOv5 strides are 8, 16, 32 on the small, medium and large object output layers. So a 640x640 image will have 3 output grids of size 20x20, 40x40, 80x80.

The same output grids for a 320x320 image are 10x10, 20x20, 40x40.